metarl.np.algos.off_policy_rl_algorithm module¶

This class implements OffPolicyRLAlgorithm for off-policy RL algorithms.

class OffPolicyRLAlgorithm(env_spec, policy, qf, replay_buffer, *, use_target=False, discount=0.99, steps_per_epoch=20, max_path_length=None, max_eval_path_length=None, n_train_steps=50, buffer_batch_size=64, min_buffer_size=10000, rollout_batch_size=1, reward_scale=1.0, smooth_return=True, exploration_policy=None)[source]¶

Bases: metarl.np.algos.rl_algorithm.RLAlgorithm

This class implements OffPolicyRLAlgorithm for off-policy RL algorithms.

Off-policy algorithms such as DQN, DDPG can inherit from it.

Parameters:

env_spec (EnvSpec) – Environment specification.
policy (metarl.np.policies.Policy) – Policy.
qf (object) – The q value network.
replay_buffer (metarl.replay_buffer.ReplayBuffer) – Replay buffer.
use_target (bool) – Whether to use target.
discount (float) – Discount factor for the cumulative return.
steps_per_epoch (int) – Number of train_once calls per epoch.
max_path_length (int) – Maximum path length. The episode will terminate when length of trajectory reaches max_path_length.
max_eval_path_length (int or None) – Maximum length of paths used for off-policy evaluation. If None, defaults to max_path_length.
n_train_steps (int) – Training steps.
buffer_batch_size (int) – Batch size for replay buffer.
min_buffer_size (int) – The minimum buffer size for replay buffer.
rollout_batch_size (int) – Roll out batch size.
reward_scale (float) – Reward scale.
smooth_return (bool) – Whether to smooth the return.
exploration_policy – (metarl.np.exploration_policies.ExplorationPolicy): Exploration strategy.

init_opt()[source]¶

Initialize the optimization procedure.

If using tensorflow, this may include declaring all the variables and compiling functions.

log_diagnostics(paths)[source]¶

Log diagnostic information on current paths.

Parameters:	paths (list[dict]) – A list of collected paths.

process_samples(itr, paths)[source]¶

Return processed sample data based on the collected paths.

Parameters:

itr (int) – Iteration number.
paths (list[dict]) – A list of collected paths.

Returns:

Processed sample data, with keys

undiscounted_returns (list[float])
success_history (list[float])
complete (list[bool])

Return type:

dict

train(runner)[source]¶

Obtain samplers and start actual training for each epoch.

Parameters:	runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:	The average return in last epoch cycle.
Return type:	float

train_once(itr, paths)[source]¶

Perform one step of policy optimization given one batch of samples.

Parameters:	itr (int) – Iteration number. paths (list[dict]) – A list of collected paths.