| # Table of Contents |
|
|
| * [mlagents.trainers.optimizer.torch\_optimizer](#mlagents.trainers.optimizer.torch_optimizer) |
| * [TorchOptimizer](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer) |
| * [create\_reward\_signals](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.create_reward_signals) |
| * [get\_trajectory\_value\_estimates](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.get_trajectory_value_estimates) |
| * [mlagents.trainers.optimizer.optimizer](#mlagents.trainers.optimizer.optimizer) |
| * [Optimizer](#mlagents.trainers.optimizer.optimizer.Optimizer) |
| * [update](#mlagents.trainers.optimizer.optimizer.Optimizer.update) |
|
|
| <a name="mlagents.trainers.optimizer.torch_optimizer"></a> |
| # mlagents.trainers.optimizer.torch\_optimizer |
| |
| <a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer"></a> |
| ## TorchOptimizer Objects |
| |
| ```python |
| class TorchOptimizer(Optimizer) |
| ``` |
| |
| <a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.create_reward_signals"></a> |
| #### create\_reward\_signals |
| |
| ```python |
| | create_reward_signals(reward_signal_configs: Dict[RewardSignalType, RewardSignalSettings]) -> None |
| ``` |
| |
| Create reward signals |
| |
| **Arguments**: |
| |
| - `reward_signal_configs`: Reward signal config. |
| |
| <a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.get_trajectory_value_estimates"></a> |
| #### get\_trajectory\_value\_estimates |
|
|
| ```python |
| | get_trajectory_value_estimates(batch: AgentBuffer, next_obs: List[np.ndarray], done: bool, agent_id: str = "") -> Tuple[Dict[str, np.ndarray], Dict[str, float], Optional[AgentBufferField]] |
| ``` |
|
|
| Get value estimates and memories for a trajectory, in batch form. |
|
|
| **Arguments**: |
|
|
| - `batch`: An AgentBuffer that consists of a trajectory. |
| - `next_obs`: the next observation (after the trajectory). Used for boostrapping |
| if this is not a termiinal trajectory. |
| - `done`: Set true if this is a terminal trajectory. |
| - `agent_id`: Agent ID of the agent that this trajectory belongs to. |
| |
| **Returns**: |
|
|
| A Tuple of the Value Estimates as a Dict of [name, np.ndarray(trajectory_len)], |
| the final value estimate as a Dict of [name, float], and optionally (if using memories) |
| an AgentBufferField of initial critic memories to be used during update. |
| |
| <a name="mlagents.trainers.optimizer.optimizer"></a> |
| # mlagents.trainers.optimizer.optimizer |
| |
| <a name="mlagents.trainers.optimizer.optimizer.Optimizer"></a> |
| ## Optimizer Objects |
| |
| ```python |
| class Optimizer(abc.ABC) |
| ``` |
| |
| Creates loss functions and auxillary networks (e.g. Q or Value) needed for training. |
| Provides methods to update the Policy. |
| |
| <a name="mlagents.trainers.optimizer.optimizer.Optimizer.update"></a> |
| #### update |
| |
| ```python |
| | @abc.abstractmethod |
| | update(batch: AgentBuffer, num_sequences: int) -> Dict[str, float] |
| ``` |
| |
| Update the Policy based on the batch that was passed in. |
| |
| **Arguments**: |
| |
| - `batch`: AgentBuffer that contains the minibatch of data used for this update. |
| - `num_sequences`: Number of recurrent sequences found in the minibatch. |
| |
| **Returns**: |
| |
| A Dict containing statistics (name, value) from the update (e.g. loss) |
| |