| # Example Learning Environments |
|
|
| <img src="../images/example-envs.png" align="middle" width="3000"/> |
|
|
| The Unity ML-Agents Toolkit includes an expanding set of example environments |
| that highlight the various features of the toolkit. These environments can also |
| serve as templates for new environments or as ways to test new ML algorithms. |
| Environments are located in `Project/Assets/ML-Agents/Examples` and summarized |
| below. |
|
|
| For the environments that highlight specific features of the toolkit, we provide |
| the pre-trained model files and the training config file that enables you to |
| train the scene yourself. The environments that are designed to serve as |
| challenges for researchers do not have accompanying pre-trained model files or |
| training configs and are marked as _Optional_ below. |
|
|
| This page only overviews the example environments we provide. To learn more on |
| how to design and build your own environments see our |
| [Making a New Learning Environment](Learning-Environment-Create-New.md) page. If |
| you would like to contribute environments, please see our |
| [contribution guidelines](CONTRIBUTING.md) page. |
|
|
| ## Basic |
|
|
|  |
|
|
| - Set-up: A linear movement task where the agent must move left or right to |
| rewarding states. |
| - Goal: Move to the most reward state. |
| - Agents: The environment contains one agent. |
| - Agent Reward Function: |
| - -0.01 at each step |
| - +0.1 for arriving at suboptimal state. |
| - +1.0 for arriving at optimal state. |
| - Behavior Parameters: |
| - Vector Observation space: One variable corresponding to current state. |
| - Actions: 1 discrete action branch with 3 actions (Move left, do nothing, move |
| right). |
| - Visual Observations: None |
| - Float Properties: None |
| - Benchmark Mean Reward: 0.93 |
| |
| ## 3DBall: 3D Balance Ball |
|
|
|  |
|
|
| - Set-up: A balance-ball task, where the agent balances the ball on it's head. |
| - Goal: The agent must balance the ball on it's head for as long as possible. |
| - Agents: The environment contains 12 agents of the same kind, all using the |
| same Behavior Parameters. |
| - Agent Reward Function: |
| - +0.1 for every step the ball remains on it's head. |
| - -1.0 if the ball falls off. |
| - Behavior Parameters: |
| - Vector Observation space: 8 variables corresponding to rotation of the agent |
| cube, and position and velocity of ball. |
| - Vector Observation space (Hard Version): 5 variables corresponding to |
| rotation of the agent cube and position of ball. |
| - Actions: 2 continuous actions, with one value corresponding to |
| X-rotation, and the other to Z-rotation. |
| - Visual Observations: Third-person view from the upper-front of the agent. Use |
| `Visual3DBall` scene. |
| - Float Properties: Three |
| - scale: Specifies the scale of the ball in the 3 dimensions (equal across the |
| three dimensions) |
| - Default: 1 |
| - Recommended Minimum: 0.2 |
| - Recommended Maximum: 5 |
| - gravity: Magnitude of gravity |
| - Default: 9.81 |
| - Recommended Minimum: 4 |
| - Recommended Maximum: 105 |
| - mass: Specifies mass of the ball |
| - Default: 1 |
| - Recommended Minimum: 0.1 |
| - Recommended Maximum: 20 |
| - Benchmark Mean Reward: 100 |
| |
| ## GridWorld |
|
|
|  |
|
|
| - Set-up: A multi-goal version of the grid-world task. Scene contains agent, goal, |
| and obstacles. |
| - Goal: The agent must navigate the grid to the appropriate goal while |
| avoiding the obstacles. |
| - Agents: The environment contains nine agents with the same Behavior |
| Parameters. |
| - Agent Reward Function: |
| - -0.01 for every step. |
| - +1.0 if the agent navigates to the correct goal (episode ends). |
| - -1.0 if the agent navigates to an incorrect goal (episode ends). |
| - Behavior Parameters: |
| - Vector Observation space: None |
| - Actions: 1 discrete action branch with 5 actions, corresponding to movement in |
| cardinal directions or not moving. Note that for this environment, |
| [action masking](Learning-Environment-Design-Agents.md#masking-discrete-actions) |
| is turned on by default (this option can be toggled using the `Mask Actions` |
| checkbox within the `trueAgent` GameObject). The trained model file provided |
| was generated with action masking turned on. |
| - Visual Observations: One corresponding to top-down view of GridWorld. |
| - Goal Signal : A one hot vector corresponding to which color is the correct goal |
| for the Agent |
| - Float Properties: Three, corresponding to grid size, number of green goals, and |
| number of red goals. |
| - Benchmark Mean Reward: 0.8 |
| |
| ## Push Block |
|
|
|  |
|
|
| - Set-up: A platforming environment where the agent can push a block around. |
| - Goal: The agent must push the block to the goal. |
| - Agents: The environment contains one agent. |
| - Agent Reward Function: |
| - -0.0025 for every step. |
| - +1.0 if the block touches the goal. |
| - Behavior Parameters: |
| - Vector Observation space: (Continuous) 70 variables corresponding to 14 |
| ray-casts each detecting one of three possible objects (wall, goal, or |
| block). |
| - Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise |
| and counterclockwise, move along four different face directions, or do nothing. |
| - Float Properties: Four |
| - block_scale: Scale of the block along the x and z dimensions |
| - Default: 2 |
| - Recommended Minimum: 0.5 |
| - Recommended Maximum: 4 |
| - dynamic_friction: Coefficient of friction for the ground material acting on |
| moving objects |
| - Default: 0 |
| - Recommended Minimum: 0 |
| - Recommended Maximum: 1 |
| - static_friction: Coefficient of friction for the ground material acting on |
| stationary objects |
| - Default: 0 |
| - Recommended Minimum: 0 |
| - Recommended Maximum: 1 |
| - block_drag: Effect of air resistance on block |
| - Default: 0.5 |
| - Recommended Minimum: 0 |
| - Recommended Maximum: 2000 |
| - Benchmark Mean Reward: 4.5 |
| |
| ## Wall Jump |
|
|
|  |
|
|
| - Set-up: A platforming environment where the agent can jump over a wall. |
| - Goal: The agent must use the block to scale the wall and reach the goal. |
| - Agents: The environment contains one agent linked to two different Models. The |
| Policy the agent is linked to changes depending on the height of the wall. The |
| change of Policy is done in the WallJumpAgent class. |
| - Agent Reward Function: |
| - -0.0005 for every step. |
| - +1.0 if the agent touches the goal. |
| - -1.0 if the agent falls off the platform. |
| - Behavior Parameters: |
| - Vector Observation space: Size of 74, corresponding to 14 ray casts each |
| detecting 4 possible objects. plus the global position of the agent and |
| whether or not the agent is grounded. |
| - Actions: 4 discrete action branches: |
| - Forward Motion (3 possible actions: Forward, Backwards, No Action) |
| - Rotation (3 possible actions: Rotate Left, Rotate Right, No Action) |
| - Side Motion (3 possible actions: Left, Right, No Action) |
| - Jump (2 possible actions: Jump, No Action) |
| - Visual Observations: None |
| - Float Properties: Four |
| - Benchmark Mean Reward (Big & Small Wall): 0.8 |
| |
| ## Crawler |
|
|
|  |
|
|
| - Set-up: A creature with 4 arms and 4 forearms. |
| - Goal: The agents must move its body toward the goal direction without falling. |
| - Agents: The environment contains 10 agents with same Behavior Parameters. |
| - Agent Reward Function (independent): |
| The reward function is now geometric meaning the reward each step is a product |
| of all the rewards instead of a sum, this helps the agent try to maximize all |
| rewards instead of the easiest rewards. |
| - Body velocity matches goal velocity. (normalized between (0,1)) |
| - Head direction alignment with goal direction. (normalized between (0,1)) |
| - Behavior Parameters: |
| - Vector Observation space: 172 variables corresponding to position, rotation, |
| velocity, and angular velocities of each limb plus the acceleration and |
| angular acceleration of the body. |
| - Actions: 20 continuous actions, corresponding to target |
| rotations for joints. |
| - Visual Observations: None |
| - Float Properties: None |
| - Benchmark Mean Reward: 3000 |
| |
| ## Worm |
|
|
|  |
|
|
| - Set-up: A worm with a head and 3 body segments. |
| - Goal: The agents must move its body toward the goal direction. |
| - Agents: The environment contains 10 agents with same Behavior Parameters. |
| - Agent Reward Function (independent): |
| The reward function is now geometric meaning the reward each step is a product |
| of all the rewards instead of a sum, this helps the agent try to maximize all |
| rewards instead of the easiest rewards. |
| - Body velocity matches goal velocity. (normalized between (0,1)) |
| - Body direction alignment with goal direction. (normalized between (0,1)) |
| - Behavior Parameters: |
| - Vector Observation space: 64 variables corresponding to position, rotation, |
| velocity, and angular velocities of each limb plus the acceleration and |
| angular acceleration of the body. |
| - Actions: 9 continuous actions, corresponding to target |
| rotations for joints. |
| - Visual Observations: None |
| - Float Properties: None |
| - Benchmark Mean Reward: 800 |
| |
| ## Food Collector |
|
|
|  |
|
|
| - Set-up: A multi-agent environment where agents compete to collect food. |
| - Goal: The agents must learn to collect as many green food spheres as possible |
| while avoiding red spheres. |
| - Agents: The environment contains 5 agents with same Behavior Parameters. |
| - Agent Reward Function (independent): |
| - +1 for interaction with green spheres |
| - -1 for interaction with red spheres |
| - Behavior Parameters: |
| - Vector Observation space: 53 corresponding to velocity of agent (2), whether |
| agent is frozen and/or shot its laser (2), plus grid based perception of |
| objects around agent's forward direction (40 by 40 with 6 different categories). |
| - Actions: |
| - 3 continuous actions correspond to Forward Motion, Side Motion and Rotation |
| - 1 discrete acion branch for Laser with 2 possible actions corresponding to |
| Shoot Laser or No Action |
| - Visual Observations (Optional): First-person camera per-agent, plus one vector |
| flag representing the frozen state of the agent. This scene uses a combination |
| of vector and visual observations and the training will not succeed without |
| the frozen vector flag. Use `VisualFoodCollector` scene. |
| - Float Properties: Two |
| - laser_length: Length of the laser used by the agent |
| - Default: 1 |
| - Recommended Minimum: 0.2 |
| - Recommended Maximum: 7 |
| - agent_scale: Specifies the scale of the agent in the 3 dimensions (equal |
| across the three dimensions) |
| - Default: 1 |
| - Recommended Minimum: 0.5 |
| - Recommended Maximum: 5 |
| - Benchmark Mean Reward: 10 |
| |
| ## Hallway |
|
|
|  |
|
|
| - Set-up: Environment where the agent needs to find information in a room, |
| remember it, and use it to move to the correct goal. |
| - Goal: Move to the goal which corresponds to the color of the block in the |
| room. |
| - Agents: The environment contains one agent. |
| - Agent Reward Function (independent): |
| - +1 For moving to correct goal. |
| - -0.1 For moving to incorrect goal. |
| - -0.0003 Existential penalty. |
| - Behavior Parameters: |
| - Vector Observation space: 30 corresponding to local ray-casts detecting |
| objects, goals, and walls. |
| - Actions: 1 discrete action Branch, with 4 actions corresponding to agent |
| rotation and forward/backward movement. |
| - Float Properties: None |
| - Benchmark Mean Reward: 0.7 |
| - To train this environment, you can enable curiosity by adding the `curiosity` reward signal |
| in `config/ppo/Hallway.yaml` |
| |
| ## Soccer Twos |
|
|
|  |
|
|
| - Set-up: Environment where four agents compete in a 2 vs 2 toy soccer game. |
| - Goal: |
| - Get the ball into the opponent's goal while preventing the ball from |
| entering own goal. |
| - Agents: The environment contains two different Multi Agent Groups with two agents in each. |
| Parameters : SoccerTwos. |
| - Agent Reward Function (dependent): |
| - (1 - `accumulated time penalty`) When ball enters opponent's goal |
| `accumulated time penalty` is incremented by (1 / `MaxStep`) every fixed |
| update and is reset to 0 at the beginning of an episode. |
| - -1 When ball enters team's goal. |
| - Behavior Parameters: |
| - Vector Observation space: 336 corresponding to 11 ray-casts forward |
| distributed over 120 degrees and 3 ray-casts backward distributed over 90 |
| degrees each detecting 6 possible object types, along with the object's |
| distance. The forward ray-casts contribute 264 state dimensions and backward |
| 72 state dimensions over three observation stacks. |
| - Actions: 3 discrete branched actions corresponding to |
| forward, backward, sideways movement, as well as rotation. |
| - Visual Observations: None |
| - Float Properties: Two |
| - ball_scale: Specifies the scale of the ball in the 3 dimensions (equal |
| across the three dimensions) |
| - Default: 7.5 |
| - Recommended minimum: 4 |
| - Recommended maximum: 10 |
| - gravity: Magnitude of the gravity |
| - Default: 9.81 |
| - Recommended minimum: 6 |
| - Recommended maximum: 20 |
| |
| ## Strikers Vs. Goalie |
|
|
|  |
|
|
| - Set-up: Environment where two agents compete in a 2 vs 1 soccer variant. |
| - Goal: |
| - Striker: Get the ball into the opponent's goal. |
| - Goalie: Keep the ball out of the goal. |
| - Agents: The environment contains two different Multi Agent Groups. One with two Strikers and the other one Goalie. |
| Behavior Parameters : Striker, Goalie. |
| - Striker Agent Reward Function (dependent): |
| - +1 When ball enters opponent's goal. |
| - -0.001 Existential penalty. |
| - Goalie Agent Reward Function (dependent): |
| - -1 When ball enters goal. |
| - 0.001 Existential bonus. |
| - Behavior Parameters: |
| - Striker Vector Observation space: 294 corresponding to 11 ray-casts forward |
| distributed over 120 degrees and 3 ray-casts backward distributed over 90 |
| degrees each detecting 5 possible object types, along with the object's |
| distance. The forward ray-casts contribute 231 state dimensions and backward |
| 63 state dimensions over three observation stacks. |
| - Striker Actions: 3 discrete branched actions corresponding |
| to forward, backward, sideways movement, as well as rotation. |
| - Goalie Vector Observation space: 738 corresponding to 41 ray-casts |
| distributed over 360 degrees each detecting 4 possible object types, along |
| with the object's distance and 3 observation stacks. |
| - Goalie Actions: 3 discrete branched actions corresponding |
| to forward, backward, sideways movement, as well as rotation. |
| - Visual Observations: None |
| - Float Properties: Two |
| - ball_scale: Specifies the scale of the ball in the 3 dimensions (equal |
| across the three dimensions) |
| - Default: 7.5 |
| - Recommended minimum: 4 |
| - Recommended maximum: 10 |
| - gravity: Magnitude of the gravity |
| - Default: 9.81 |
| - Recommended minimum: 6 |
| - Recommended maximum: 20 |
| |
| ## Walker |
|
|
|  |
|
|
| - Set-up: Physics-based Humanoid agents with 26 degrees of freedom. These DOFs |
| correspond to articulation of the following body-parts: hips, chest, spine, |
| head, thighs, shins, feet, arms, forearms and hands. |
| - Goal: The agents must move its body toward the goal direction without falling. |
| - Agents: The environment contains 10 independent agents with same Behavior |
| Parameters. |
| - Agent Reward Function (independent): |
| The reward function is now geometric meaning the reward each step is a product |
| of all the rewards instead of a sum, this helps the agent try to maximize all |
| rewards instead of the easiest rewards. |
| - Body velocity matches goal velocity. (normalized between (0,1)) |
| - Head direction alignment with goal direction. (normalized between (0,1)) |
| - Behavior Parameters: |
| - Vector Observation space: 243 variables corresponding to position, rotation, |
| velocity, and angular velocities of each limb, along with goal direction. |
| - Actions: 39 continuous actions, corresponding to target |
| rotations and strength applicable to the joints. |
| - Visual Observations: None |
| - Float Properties: Four |
| - gravity: Magnitude of gravity |
| - Default: 9.81 |
| - Recommended Minimum: |
| - Recommended Maximum: |
| - hip_mass: Mass of the hip component of the walker |
| - Default: 8 |
| - Recommended Minimum: 7 |
| - Recommended Maximum: 28 |
| - chest_mass: Mass of the chest component of the walker |
| - Default: 8 |
| - Recommended Minimum: 3 |
| - Recommended Maximum: 20 |
| - spine_mass: Mass of the spine component of the walker |
| - Default: 8 |
| - Recommended Minimum: 3 |
| - Recommended Maximum: 20 |
| - Benchmark Mean Reward : 2500 |
| |
|
|
| ## Pyramids |
|
|
|  |
|
|
| - Set-up: Environment where the agent needs to press a button to spawn a |
| pyramid, then navigate to the pyramid, knock it over, and move to the gold |
| brick at the top. |
| - Goal: Move to the golden brick on top of the spawned pyramid. |
| - Agents: The environment contains one agent. |
| - Agent Reward Function (independent): |
| - +2 For moving to golden brick (minus 0.001 per step). |
| - Behavior Parameters: |
| - Vector Observation space: 148 corresponding to local ray-casts detecting |
| switch, bricks, golden brick, and walls, plus variable indicating switch |
| state. |
| - Actions: 1 discrete action branch, with 4 actions corresponding to agent rotation and |
| forward/backward movement. |
| - Float Properties: None |
| - Benchmark Mean Reward: 1.75 |
| |
| ## Match 3 |
|  |
|
|
| - Set-up: Simple match-3 game. Matched pieces are removed, and remaining pieces |
| drop down. New pieces are spawned randomly at the top, with a chance of being |
| "special". |
| - Goal: Maximize score from matching pieces. |
| - Agents: The environment contains several independent Agents. |
| - Agent Reward Function (independent): |
| - .01 for each normal piece cleared. Special pieces are worth 2x or 3x. |
| - Behavior Parameters: |
| - None |
| - Observations and actions are defined with a sensor and actuator respectively. |
| - Float Properties: None |
| - Benchmark Mean Reward: |
| - 39.5 for visual observations |
| - 38.5 for vector observations |
| - 34.2 for simple heuristic (pick a random valid move) |
| - 37.0 for greedy heuristic (pick the highest-scoring valid move) |
|
|
| ## Sorter |
|  |
|
|
| - Set-up: The Agent is in a circular room with numbered tiles. The values of the |
| tiles are random between 1 and 20. The tiles present in the room are randomized |
| at each episode. When the Agent visits a tile, it turns green. |
| - Goal: Visit all the tiles in ascending order. |
| - Agents: The environment contains a single Agent |
| - Agent Reward Function: |
| - -.0002 Existential penalty. |
| - +1 For visiting the right tile |
| - -1 For visiting the wrong tile |
| - BehaviorParameters: |
| - Vector Observations : 4 : 2 floats for Position and 2 floats for orientation |
| - Variable Length Observations : Between 1 and 20 entities (one for each tile) |
| each with 22 observations, the first 20 are one hot encoding of the value of the tile, |
| the 21st and 22nd represent the position of the tile relative to the Agent and the 23rd |
| is `1` if the tile was visited and `0` otherwise. |
| - Actions: 3 discrete branched actions corresponding to forward, backward, |
| sideways movement, as well as rotation. |
| - Float Properties: One |
| - num_tiles: The maximum number of tiles to sample. |
| - Default: 2 |
| - Recommended Minimum: 1 |
| - Recommended Maximum: 20 |
| - Benchmark Mean Reward: Depends on the number of tiles. |
| |
| ## Cooperative Push Block |
|  |
| |
| - Set-up: Similar to Push Block, the agents are in an area with blocks that need |
| to be pushed into a goal. Small blocks can be pushed by one agents and are worth |
| +1 value, medium blocks require two agents to push in and are worth +2, and large |
| blocks require all 3 agents to push and are worth +3. |
| - Goal: Push all blocks into the goal. |
| - Agents: The environment contains three Agents in a Multi Agent Group. |
| - Agent Reward Function: |
| - -0.0001 Existential penalty, as a group reward. |
| - +1, +2, or +3 for pushing in a block, added as a group reward. |
| - Behavior Parameters: |
| - Observation space: A single Grid Sensor with separate tags for each block size, |
| the goal, the walls, and other agents. |
| - Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise |
| and counterclockwise, move along four different face directions, or do nothing. |
| - Float Properties: None |
| - Benchmark Mean Reward: 11 (Group Reward) |
| |
| ## Dungeon Escape |
|  |
| |
| - Set-up: Agents are trapped in a dungeon with a dragon, and must work together to escape. |
| To retrieve the key, one of the agents must find and slay the dragon, sacrificing itself |
| to do so. The dragon will drop a key for the others to use. The other agents can then pick |
| up this key and unlock the dungeon door. If the agents take too long, the dragon will escape |
| through a portal and the environment resets. |
| - Goal: Unlock the dungeon door and leave. |
| - Agents: The environment contains three Agents in a Multi Agent Group and one Dragon, which |
| moves in a predetermined pattern. |
| - Agent Reward Function: |
| - +1 group reward if any agent successfully unlocks the door and leaves the dungeon. |
| - Behavior Parameters: |
| - Observation space: A Ray Perception Sensor with separate tags for the walls, other agents, |
| the door, key, the dragon, and the dragon's portal. A single Vector Observation which indicates |
| whether the agent is holding a key. |
| - Actions: 1 discrete action branch with 7 actions, corresponding to turn clockwise |
| and counterclockwise, move along four different face directions, or do nothing. |
| - Float Properties: None |
| - Benchmark Mean Reward: 1.0 (Group Reward) |
| |