| # Unity ML-Agents Gym Wrapper |
|
|
| A common way in which machine learning researchers interact with simulation |
| environments is via a wrapper provided by OpenAI called `gym`. For more |
| information on the gym interface, see [here](https://github.com/openai/gym). |
|
|
| We provide a gym wrapper and instructions for using it with existing machine |
| learning algorithms which utilize gym. Our wrapper provides interfaces on top of |
| our `UnityEnvironment` class, which is the default way of interfacing with a |
| Unity environment via Python. |
|
|
| ## Installation |
|
|
| The gym wrapper is part of the `mlgents_envs` package. Please refer to the |
| [mlagents_envs installation instructions](ML-Agents-Envs-README.md). |
|
|
|
|
| ## Using the Gym Wrapper |
|
|
| The gym interface is available from `gym_unity.envs`. To launch an environment |
| from the root of the project repository use: |
|
|
| ```python |
| from mlagents_envs.envs.unity_gym_env import UnityToGymWrapper |
| |
| env = UnityToGymWrapper(unity_env, uint8_visual, flatten_branched, allow_multiple_obs) |
| ``` |
|
|
| - `unity_env` refers to the Unity environment to be wrapped. |
|
|
| - `uint8_visual` refers to whether to output visual observations as `uint8` |
| values (0-255). Many common Gym environments (e.g. Atari) do this. By default |
| they will be floats (0.0-1.0). Defaults to `False`. |
|
|
| - `flatten_branched` will flatten a branched discrete action space into a Gym |
| Discrete. Otherwise, it will be converted into a MultiDiscrete. Defaults to |
| `False`. |
|
|
| - `allow_multiple_obs` will return a list of observations. The first elements |
| contain the visual observations and the last element contains the array of |
| vector observations. If False the environment returns a single array (containing |
| a single visual observations, if present, otherwise the vector observation). |
| Defaults to `False`. |
|
|
| - `action_space_seed` is the optional seed for action sampling. If non-None, will |
| be used to set the random seed on created gym.Space instances. |
|
|
| The returned environment `env` will function as a gym. |
|
|
| ## Limitations |
|
|
| - It is only possible to use an environment with a **single** Agent. |
| - By default, the first visual observation is provided as the `observation`, if |
| present. Otherwise, vector observations are provided. You can receive all |
| visual and vector observations by using the `allow_multiple_obs=True` option in |
| the gym parameters. If set to `True`, you will receive a list of `observation` |
| instead of only one. |
| - The `TerminalSteps` or `DecisionSteps` output from the environment can still |
| be accessed from the `info` provided by `env.step(action)`. |
| - Stacked vector observations are not supported. |
| - Environment registration for use with `gym.make()` is currently not supported. |
| - Calling env.render() will not render a new frame of the environment. It will |
| return the latest visual observation if using visual observations. |
|
|
| ## Running OpenAI Baselines Algorithms |
|
|
| OpenAI provides a set of open-source maintained and tested Reinforcement |
| Learning algorithms called the [Baselines](https://github.com/openai/baselines). |
|
|
| Using the provided Gym wrapper, it is possible to train ML-Agents environments |
| using these algorithms. This requires the creation of custom training scripts to |
| launch each algorithm. In most cases these scripts can be created by making |
| slight modifications to the ones provided for Atari and Mujoco environments. |
|
|
| These examples were tested with baselines version 0.1.6. |
|
|
| ### Example - DQN Baseline |
|
|
| In order to train an agent to play the `GridWorld` environment using the |
| Baselines DQN algorithm, you first need to install the baselines package using |
| pip: |
|
|
| ``` |
| pip install git+git://github.com/openai/baselines |
| ``` |
|
|
| Next, create a file called `train_unity.py`. Then create an `/envs/` directory |
| and build the environment to that directory. For more information on |
| building Unity environments, see |
| [here](../docs/Learning-Environment-Executable.md). Note that because of |
| limitations of the DQN baseline, the environment must have a single visual |
| observation, a single discrete action and a single Agent in the scene. |
| Add the following code to the `train_unity.py` file: |
|
|
| ```python |
| import gym |
| |
| from baselines import deepq |
| from baselines import logger |
| |
| from mlagents_envs.environment import UnityEnvironment |
| from mlagents_envs.envs.unity_gym_env import UnityToGymWrapper |
| |
| |
| def main(): |
| unity_env = UnityEnvironment( < path - to - environment >) |
| env = UnityToGymWrapper(unity_env, uint8_visual=True) |
| logger.configure('./logs') # Change to log in a different directory |
| act = deepq.learn( |
| env, |
| "cnn", # For visual inputs |
| lr=2.5e-4, |
| total_timesteps=1000000, |
| buffer_size=50000, |
| exploration_fraction=0.05, |
| exploration_final_eps=0.1, |
| print_freq=20, |
| train_freq=5, |
| learning_starts=20000, |
| target_network_update_freq=50, |
| gamma=0.99, |
| prioritized_replay=False, |
| checkpoint_freq=1000, |
| checkpoint_path='./logs', # Change to save model in a different directory |
| dueling=True |
| ) |
| print("Saving model to unity_model.pkl") |
| act.save("unity_model.pkl") |
| |
| |
| if __name__ == '__main__': |
| main() |
| ``` |
|
|
| To start the training process, run the following from the directory containing |
| `train_unity.py`: |
|
|
| ```sh |
| python -m train_unity |
| ``` |
|
|
| ### Other Algorithms |
|
|
| Other algorithms in the Baselines repository can be run using scripts similar to |
| the examples from the baselines package. In most cases, the primary changes |
| needed to use a Unity environment are to import `UnityToGymWrapper`, and to |
| replace the environment creation code, typically `gym.make()`, with a call to |
| `UnityToGymWrapper(unity_environment)` passing the environment as input. |
|
|
| A typical rule of thumb is that for vision-based environments, modification |
| should be done to Atari training scripts, and for vector observation |
| environments, modification should be done to Mujoco scripts. |
|
|
| Some algorithms will make use of `make_env()` or `make_mujoco_env()` functions. |
| You can define a similar function for Unity environments. An example of such a |
| method using the PPO2 baseline: |
|
|
| ```python |
| from mlagents_envs.environment import UnityEnvironment |
| from mlagents_envs.envs import UnityToGymWrapper |
| from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv |
| from baselines.common.vec_env.dummy_vec_env import DummyVecEnv |
| from baselines.bench import Monitor |
| from baselines import logger |
| import baselines.ppo2.ppo2 as ppo2 |
| |
| import os |
| |
| try: |
| from mpi4py import MPI |
| except ImportError: |
| MPI = None |
| |
| |
| def make_unity_env(env_directory, num_env, visual, start_index=0): |
| """ |
| Create a wrapped, monitored Unity environment. |
| """ |
| |
| def make_env(rank, use_visual=True): # pylint: disable=C0111 |
| def _thunk(): |
| unity_env = UnityEnvironment(env_directory, base_port=5000 + rank) |
| env = UnityToGymWrapper(unity_env, uint8_visual=True) |
| env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank))) |
| return env |
| |
| return _thunk |
| |
| if visual: |
| return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)]) |
| else: |
| rank = MPI.COMM_WORLD.Get_rank() if MPI else 0 |
| return DummyVecEnv([make_env(rank, use_visual=False)]) |
| |
| |
| def main(): |
| env = make_unity_env( < path - to - environment >, 4, True) |
| ppo2.learn( |
| network="mlp", |
| env=env, |
| total_timesteps=100000, |
| lr=1e-3, |
| ) |
| |
| |
| if __name__ == '__main__': |
| main() |
| ``` |
|
|
| ## Run Google Dopamine Algorithms |
|
|
| Google provides a framework [Dopamine](https://github.com/google/dopamine), and |
| implementations of algorithms, e.g. DQN, Rainbow, and the C51 variant of |
| Rainbow. Using the Gym wrapper, we can run Unity environments using Dopamine. |
|
|
| First, after installing the Gym wrapper, clone the Dopamine repository. |
|
|
| ``` |
| git clone https://github.com/google/dopamine |
| ``` |
|
|
| Then, follow the appropriate install instructions as specified on |
| [Dopamine's homepage](https://github.com/google/dopamine). Note that the |
| Dopamine guide specifies using a virtualenv. If you choose to do so, make sure |
| your unity_env package is also installed within the same virtualenv as Dopamine. |
| |
| ### Adapting Dopamine's Scripts |
| |
| First, open `dopamine/atari/run_experiment.py`. Alternatively, copy the entire |
| `atari` folder, and name it something else (e.g. `unity`). If you choose the |
| copy approach, be sure to change the package names in the import statements in |
| `train.py` to your new directory. |
|
|
| Within `run_experiment.py`, we will need to make changes to which environment is |
| instantiated, just as in the Baselines example. At the top of the file, insert |
|
|
| ```python |
| from mlagents_envs.environment import UnityEnvironment |
| from mlagents_envs.envs import UnityToGymWrapper |
| ``` |
|
|
| to import the Gym Wrapper. Navigate to the `create_atari_environment` method in |
| the same file, and switch to instantiating a Unity environment by replacing the |
| method with the following code. |
|
|
| ```python |
| game_version = 'v0' if sticky_actions else 'v4' |
| full_game_name = '{}NoFrameskip-{}'.format(game_name, game_version) |
| unity_env = UnityEnvironment(<path-to-environment>) |
| env = UnityToGymWrapper(unity_env, uint8_visual=True) |
| return env |
| ``` |
|
|
| `<path-to-environment>` is the path to your built Unity executable. For more |
| information on building Unity environments, see |
| [here](../docs/Learning-Environment-Executable.md), and note the Limitations |
| section below. |
|
|
| Note that we are not using the preprocessor from Dopamine, as it uses many |
| Atari-specific calls. Furthermore, frame-skipping can be done from within Unity, |
| rather than on the Python side. |
|
|
| ### Limitations |
|
|
| Since Dopamine is designed around variants of DQN, it is only compatible with |
| discrete action spaces, and specifically the Discrete Gym space. For |
| environments that use branched discrete action spaces, you can enable the |
| `flatten_branched` parameter in `UnityToGymWrapper`, which treats each |
| combination of branched actions as separate actions. |
|
|
| Furthermore, when building your environments, ensure that your Agent is using |
| visual observations with greyscale enabled, and that the dimensions of the |
| visual observations is 84 by 84 (matches the parameter found in `dqn_agent.py` |
| and `rainbow_agent.py`). Dopamine's agents currently do not automatically adapt |
| to the observation dimensions or number of channels. |
|
|
| ### Hyperparameters |
|
|
| The hyperparameters provided by Dopamine are tailored to the Atari games, and |
| you will likely need to adjust them for ML-Agents environments. Here is a sample |
| `dopamine/agents/rainbow/configs/rainbow.gin` file that is known to work with |
| a simple GridWorld. |
|
|
| ```python |
| import dopamine.agents.rainbow.rainbow_agent |
| import dopamine.unity.run_experiment |
| import dopamine.replay_memory.prioritized_replay_buffer |
| import gin.tf.external_configurables |
| |
| RainbowAgent.num_atoms = 51 |
| RainbowAgent.stack_size = 1 |
| RainbowAgent.vmax = 10. |
| RainbowAgent.gamma = 0.99 |
| RainbowAgent.update_horizon = 3 |
| RainbowAgent.min_replay_history = 20000 # agent steps |
| RainbowAgent.update_period = 5 |
| RainbowAgent.target_update_period = 50 # agent steps |
| RainbowAgent.epsilon_train = 0.1 |
| RainbowAgent.epsilon_eval = 0.01 |
| RainbowAgent.epsilon_decay_period = 50000 # agent steps |
| RainbowAgent.replay_scheme = 'prioritized' |
| RainbowAgent.tf_device = '/cpu:0' # use '/cpu:*' for non-GPU version |
| RainbowAgent.optimizer = @tf.train.AdamOptimizer() |
| |
| tf.train.AdamOptimizer.learning_rate = 0.00025 |
| tf.train.AdamOptimizer.epsilon = 0.0003125 |
| |
| Runner.game_name = "Unity" # any name can be used here |
| Runner.sticky_actions = False |
| Runner.num_iterations = 200 |
| Runner.training_steps = 10000 # agent steps |
| Runner.evaluation_steps = 500 # agent steps |
| Runner.max_steps_per_episode = 27000 # agent steps |
| |
| WrappedPrioritizedReplayBuffer.replay_capacity = 1000000 |
| WrappedPrioritizedReplayBuffer.batch_size = 32 |
| ``` |
|
|
| This example assumed you copied `atari` to a separate folder named `unity`. |
| Replace `unity` in `import dopamine.unity.run_experiment` with the folder you |
| copied your `run_experiment.py` and `trainer.py` files to. If you directly |
| modified the existing files, then use `atari` here. |
|
|
| ### Starting a Run |
|
|
| You can now run Dopamine as you would normally: |
|
|
| ``` |
| python -um dopamine.unity.train \ |
| --agent_name=rainbow \ |
| --base_dir=/tmp/dopamine \ |
| --gin_files='dopamine/agents/rainbow/configs/rainbow.gin' |
| ``` |
|
|
| Again, we assume that you've copied `atari` into a separate folder. Remember to |
| replace `unity` with the directory you copied your files into. If you edited the |
| Atari files directly, this should be `atari`. |
|
|
| ### Example: GridWorld |
|
|
| As a baseline, here are rewards over time for the three algorithms provided with |
| Dopamine as run on the GridWorld example environment. All Dopamine (DQN, |
| Rainbow, C51) runs were done with the same epsilon, epsilon decay, replay |
| history, training steps, and buffer settings as specified above. Note that the |
| first 20000 steps are used to pre-fill the training buffer, and no learning |
| happens. |
|
|
| We provide results from our PPO implementation and the DQN from Baselines as |
| reference. Note that all runs used the same greyscale GridWorld as Dopamine. For |
| PPO, `num_layers` was set to 2, and all other hyperparameters are the default |
| for GridWorld in `config/ppo/GridWorld.yaml`. For Baselines DQN, the provided |
| hyperparameters in the previous section are used. Note that Baselines implements |
| certain features (e.g. dueling-Q) that are not enabled in Dopamine DQN. |
|
|
|
|
|  |
|
|
|
|