metarl.np.algos package

Reinforcement learning algorithms which use NumPy as a numerical backend.

class RLAlgorithm[source]

Bases: abc.ABC

Base class for all the algorithms.

Note

If the field sampler_cls exists, it will be by LocalRunner.setup to initialize a sampler.

train(runner)[source]

Obtain samplers and start actual training for each epoch.

Parameters:runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
class CEM(env_spec, policy, baseline, n_samples, discount=0.99, max_path_length=500, init_std=1, best_frac=0.05, extra_std=1.0, extra_decay_time=100)[source]

Bases: metarl.np.algos.rl_algorithm.RLAlgorithm

Cross Entropy Method.

CEM works by iteratively optimizing a gaussian distribution of policy.

In each epoch, CEM does the following: 1. Sample n_samples policies from a gaussian distribution of

mean cur_mean and std cur_std.
  1. Do rollouts for each policy.
  2. Update cur_mean and cur_std by doing Maximum Likelihood Estimation over the n_best top policies in terms of return.
Parameters:
  • env_spec (metarl.envs.EnvSpec) – Environment specification.
  • policy (metarl.np.policies.Policy) – Action policy.
  • baseline (metarl.np.baselines.Baseline) – Baseline for GAE (Generalized Advantage Estimation).
  • n_samples (int) – Number of policies sampled in one epoch.
  • discount (float) – Environment reward discount.
  • max_path_length (int) – Maximum length of a single rollout.
  • best_frac (float) – The best fraction.
  • init_std (float) – Initial std for policy param distribution.
  • extra_std (float) – Decaying std added to param distribution.
  • extra_decay_time (float) – Epochs that it takes to decay extra std.
train(runner)[source]

Initialize variables and start training.

Parameters:runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:The average return in last epoch cycle.
Return type:float
train_once(itr, paths)[source]

Perform one step of policy optimization given one batch of samples.

Parameters:
  • itr (int) – Iteration number.
  • paths (list[dict]) – A list of collected paths.
Returns:

The average return of epoch cycle.

Return type:

float

class CMAES(env_spec, policy, baseline, n_samples, discount=0.99, max_path_length=500, sigma0=1.0)[source]

Bases: metarl.np.algos.rl_algorithm.RLAlgorithm

Covariance Matrix Adaptation Evolution Strategy.

Note

The CMA-ES method can hardly learn a successful policy even for simple task. It is still maintained here only for consistency with original rllab paper.

Parameters:
  • env_spec (metarl.envs.EnvSpec) – Environment specification.
  • policy (metarl.np.policies.Policy) – Action policy.
  • baseline (metarl.np.baselines.Baseline) – Baseline for GAE (Generalized Advantage Estimation).
  • n_samples (int) – Number of policies sampled in one epoch.
  • discount (float) – Environment reward discount.
  • max_path_length (int) – Maximum length of a single rollout.
  • sigma0 (float) – Initial std for param distribution.
train(runner)[source]

Initialize variables and start training.

Parameters:runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:The average return in last epoch cycle.
Return type:float
train_once(itr, paths)[source]

Perform one step of policy optimization given one batch of samples.

Parameters:
  • itr (int) – Iteration number.
  • paths (list[dict]) – A list of collected paths.
Returns:

The average return in last epoch cycle.

Return type:

float

class MetaRLAlgorithm[source]

Bases: metarl.np.algos.rl_algorithm.RLAlgorithm, abc.ABC

Base class for Meta-RL Algorithms.

adapt_policy(exploration_policy, exploration_trajectories)[source]

Produce a policy adapted for a task.

Parameters:
  • exploration_policy (metarl.Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_trajectories by interacting with an environment. The caller may not use this object after passing it into this method.
  • exploration_trajectories (metarl.TrajectoryBatch) – Trajectories to adapt to, generated by exploration_policy exploring the environment.
Returns:

A policy adapted to the task represented by the

exploration_trajectories.

Return type:

metarl.Policy

get_exploration_policy()[source]

Return a policy used before adaptation to a specific task.

Each time it is retrieved, this policy should only be evaluated in one task.

Returns:
The policy used to obtain samples that are later
used for meta-RL adaptation.
Return type:metarl.Policy
class NOP[source]

Bases: metarl.np.algos.rl_algorithm.RLAlgorithm

NOP (no optimization performed) policy search algorithm.

init_opt()[source]

Initialize the optimization procedure.

optimize_policy(paths)[source]

Optimize the policy using the samples.

Parameters:paths (list[dict]) – A list of collected paths.
train(runner)[source]

Obtain samplers and start actual training for each epoch.

Parameters:runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
class OffPolicyRLAlgorithm(env_spec, policy, qf, replay_buffer, *, use_target=False, discount=0.99, steps_per_epoch=20, max_path_length=None, max_eval_path_length=None, n_train_steps=50, buffer_batch_size=64, min_buffer_size=10000, rollout_batch_size=1, reward_scale=1.0, smooth_return=True, exploration_policy=None)[source]

Bases: metarl.np.algos.rl_algorithm.RLAlgorithm

This class implements OffPolicyRLAlgorithm for off-policy RL algorithms.

Off-policy algorithms such as DQN, DDPG can inherit from it.

Parameters:
  • env_spec (EnvSpec) – Environment specification.
  • policy (metarl.np.policies.Policy) – Policy.
  • qf (object) – The q value network.
  • replay_buffer (metarl.replay_buffer.ReplayBuffer) – Replay buffer.
  • use_target (bool) – Whether to use target.
  • discount (float) – Discount factor for the cumulative return.
  • steps_per_epoch (int) – Number of train_once calls per epoch.
  • max_path_length (int) – Maximum path length. The episode will terminate when length of trajectory reaches max_path_length.
  • max_eval_path_length (int or None) – Maximum length of paths used for off-policy evaluation. If None, defaults to max_path_length.
  • n_train_steps (int) – Training steps.
  • buffer_batch_size (int) – Batch size for replay buffer.
  • min_buffer_size (int) – The minimum buffer size for replay buffer.
  • rollout_batch_size (int) – Roll out batch size.
  • reward_scale (float) – Reward scale.
  • smooth_return (bool) – Whether to smooth the return.
  • exploration_policy – (metarl.np.exploration_policies.ExplorationPolicy): Exploration strategy.
init_opt()[source]

Initialize the optimization procedure.

If using tensorflow, this may include declaring all the variables and compiling functions.

log_diagnostics(paths)[source]

Log diagnostic information on current paths.

Parameters:paths (list[dict]) – A list of collected paths.
process_samples(itr, paths)[source]

Return processed sample data based on the collected paths.

Parameters:
  • itr (int) – Iteration number.
  • paths (list[dict]) – A list of collected paths.
Returns:

Processed sample data, with keys
  • undiscounted_returns (list[float])
  • success_history (list[float])
  • complete (list[bool])

Return type:

dict

train(runner)[source]

Obtain samplers and start actual training for each epoch.

Parameters:runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:The average return in last epoch cycle.
Return type:float
train_once(itr, paths)[source]

Perform one step of policy optimization given one batch of samples.

Parameters:
  • itr (int) – Iteration number.
  • paths (list[dict]) – A list of collected paths.