ppo-Pyramids-Training / docs /Python-Optimizer-Documentation.md

Second Push

05c9ac2 over 2 years ago

3.11 kB

	# Table of Contents

	* [mlagents.trainers.optimizer.torch\_optimizer](#mlagents.trainers.optimizer.torch_optimizer)
	* [TorchOptimizer](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer)
	* [create\_reward\_signals](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.create_reward_signals)
	* [get\_trajectory\_value\_estimates](#mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.get_trajectory_value_estimates)
	* [mlagents.trainers.optimizer.optimizer](#mlagents.trainers.optimizer.optimizer)
	* [Optimizer](#mlagents.trainers.optimizer.optimizer.Optimizer)
	* [update](#mlagents.trainers.optimizer.optimizer.Optimizer.update)

	<a name="mlagents.trainers.optimizer.torch_optimizer"></a>
	# mlagents.trainers.optimizer.torch\_optimizer

	<a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer"></a>
	## TorchOptimizer Objects

	```python
	class TorchOptimizer(Optimizer)
	```

	<a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.create_reward_signals"></a>
	#### create\_reward\_signals

	```python
	\| create_reward_signals(reward_signal_configs: Dict[RewardSignalType, RewardSignalSettings]) -> None
	```

	Create reward signals

	Arguments:

	- `reward_signal_configs`: Reward signal config.

	<a name="mlagents.trainers.optimizer.torch_optimizer.TorchOptimizer.get_trajectory_value_estimates"></a>
	#### get\_trajectory\_value\_estimates

	```python
	\| get_trajectory_value_estimates(batch: AgentBuffer, next_obs: List[np.ndarray], done: bool, agent_id: str = "") -> Tuple[Dict[str, np.ndarray], Dict[str, float], Optional[AgentBufferField]]
	```

	Get value estimates and memories for a trajectory, in batch form.

	Arguments:

	- `batch`: An AgentBuffer that consists of a trajectory.
	- `next_obs`: the next observation (after the trajectory). Used for boostrapping
	if this is not a termiinal trajectory.
	- `done`: Set true if this is a terminal trajectory.
	- `agent_id`: Agent ID of the agent that this trajectory belongs to.

	Returns:

	A Tuple of the Value Estimates as a Dict of [name, np.ndarray(trajectory_len)],
	the final value estimate as a Dict of [name, float], and optionally (if using memories)
	an AgentBufferField of initial critic memories to be used during update.

	<a name="mlagents.trainers.optimizer.optimizer"></a>
	# mlagents.trainers.optimizer.optimizer

	<a name="mlagents.trainers.optimizer.optimizer.Optimizer"></a>
	## Optimizer Objects

	```python
	class Optimizer(abc.ABC)
	```

	Creates loss functions and auxillary networks (e.g. Q or Value) needed for training.
	Provides methods to update the Policy.

	<a name="mlagents.trainers.optimizer.optimizer.Optimizer.update"></a>
	#### update

	```python
	\| @abc.abstractmethod
	\| update(batch: AgentBuffer, num_sequences: int) -> Dict[str, float]
	```

	Update the Policy based on the batch that was passed in.

	Arguments:

	- `batch`: AgentBuffer that contains the minibatch of data used for this update.
	- `num_sequences`: Number of recurrent sequences found in the minibatch.

	Returns:

	A Dict containing statistics (name, value) from the update (e.g. loss)