algorembrant commited on
Commit
3e7f2be
·
verified ·
1 Parent(s): 8744e5e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -80
README.md CHANGED
@@ -1,80 +1,86 @@
1
- ---title: Reinforcement Learning Graphical Representationsdate: 2026-04-08category: Reinforcement Learningdescription: A comprehensive gallery of 72 standard RL components and their graphical presentations.---
2
-
3
- # Reinforcement Learning Graphical Representations
4
-
5
- This repository contains a full set of 72 visualizations representing foundational concepts, algorithms, and advanced topics in Reinforcement Learning.
6
-
7
- | Category | Component | Illustration | Details | Context |
8
- |----------|-----------|--------------|---------|---------|
9
- | **MDP & Environment** | **Agent-Environment Interaction Loop** | ![Illustration](graphs/agent_environment_interaction_loop.png) | Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state | All RL algorithms |
10
- | **MDP & Environment** | **Markov Decision Process (MDP) Tuple** | ![Illustration](graphs/markov_decision_process_mdp_tuple.png) | (S, A, P, R, γ) with transition dynamics and reward function | s,a) and R(s,a,s′)) |
11
- | **MDP & Environment** | **State Transition Graph** | ![Illustration](graphs/state_transition_graph.png) | Full probabilistic transitions between discrete states | Gridworld, Taxi, Cliff Walking |
12
- | **MDP & Environment** | **Trajectory / Episode Sequence** | ![Illustration](graphs/trajectory_episode_sequence.png) | Sequence of (s₀, a₀, r₁, s₁, …, s_T) | Monte Carlo, episodic tasks |
13
- | **MDP & Environment** | **Continuous State/Action Space Visualization** | ![Illustration](graphs/continuous_state_action_space_visualization.png) | High-dimensional spaces (e.g., robot joints, pixel inputs) | Continuous-control tasks (MuJoCo, PyBullet) |
14
- | **MDP & Environment** | **Reward Function / Landscape** | ![Illustration](graphs/reward_function_landscape.png) | Scalar reward as function of state/action | All algorithms; especially reward shaping |
15
- | **MDP & Environment** | **Discount Factor (γ) Effect** | ![Illustration](graphs/discount_factor_effect.png) | How future rewards are weighted | All discounted MDPs |
16
- | **Value & Policy** | **State-Value Function V(s)** | ![Illustration](graphs/state_value_function_v_s.png) | Expected return from state s under policy π | Value-based methods |
17
- | **Value & Policy** | **Action-Value Function Q(s,a)** | ![Illustration](graphs/action_value_function_q_s_a.png) | Expected return from state-action pair | Q-learning family |
18
- | **Value & Policy** | **Policy π(s) or π(a\** | ![Illustration](graphs/policy_s_or_a.png) | s) | Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps |
19
- | **Value & Policy** | **Advantage Function A(s,a)** | ![Illustration](graphs/advantage_function_a_s_a.png) | Q(s,a) V(s) | A2C, PPO, SAC, TD3 |
20
- | **Value & Policy** | **Optimal Value Function V* / Q*** | ![Illustration](graphs/optimal_value_function_v_q.png) | Solution to Bellman optimality | Value iteration, Q-learning |
21
- | **Dynamic Programming** | **Policy Evaluation Backup** | ![Illustration](graphs/policy_evaluation_backup.png) | Iterative update of V using Bellman expectation | Policy iteration |
22
- | **Dynamic Programming** | **Policy Improvement** | ![Illustration](graphs/policy_improvement.png) | Greedy policy update over Q | Policy iteration |
23
- | **Dynamic Programming** | **Value Iteration Backup** | ![Illustration](graphs/value_iteration_backup.png) | Update using Bellman optimality | Value iteration |
24
- | **Dynamic Programming** | **Policy Iteration Full Cycle** | ![Illustration](graphs/policy_iteration_full_cycle.png) | Evaluation Improvement loop | Classic DP methods |
25
- | **Monte Carlo** | **Monte Carlo Backup** | ![Illustration](graphs/monte_carlo_backup.png) | Update using full episode return G_t | First-visit / every-visit MC |
26
- | **Monte Carlo** | **Monte Carlo Tree (MCTS)** | ![Illustration](graphs/monte_carlo_tree_mcts.png) | Search tree with selection, expansion, simulation, backprop | AlphaGo, AlphaZero |
27
- | **Monte Carlo** | **Importance Sampling Ratio** | ![Illustration](graphs/importance_sampling_ratio.png) | Off-policy correction ρ = π(a\ | s) |
28
- | **Temporal Difference** | **TD(0) Backup** | ![Illustration](graphs/td_0_backup.png) | Bootstrapped update using R + γV(s′) | TD learning |
29
- | **Temporal Difference** | **Bootstrapping (general)** | ![Illustration](graphs/bootstrapping_general.png) | Using estimated future value instead of full return | All TD methods |
30
- | **Temporal Difference** | **n-step TD Backup** | ![Illustration](graphs/n_step_td_backup.png) | Multi-step return G_t^{(n)} | n-step TD, TD(λ) |
31
- | **Temporal Difference** | **TD(λ) & Eligibility Traces** | ![Illustration](graphs/td_eligibility_traces.png) | Decaying trace z_t for credit assignment | TD(λ), SARSA(λ), Q(λ) |
32
- | **Temporal Difference** | **SARSA Update** | ![Illustration](graphs/sarsa_update.png) | On-policy TD control | SARSA |
33
- | **Temporal Difference** | **Q-Learning Update** | ![Illustration](graphs/q_learning_update.png) | Off-policy TD control | Q-learning, Deep Q-Network |
34
- | **Temporal Difference** | **Expected SARSA** | ![Illustration](graphs/expected_sarsa.png) | Expectation over next action under policy | Expected SARSA |
35
- | **Temporal Difference** | **Double Q-Learning / Double DQN** | ![Illustration](graphs/double_q_learning_double_dqn.png) | Two separate Q estimators to reduce overestimation | Double DQN, TD3 |
36
- | **Temporal Difference** | **Dueling DQN Architecture** | ![Illustration](graphs/dueling_dqn_architecture.png) | Separate streams for state value V(s) and advantage A(s,a) | Dueling DQN |
37
- | **Temporal Difference** | **Prioritized Experience Replay** | ![Illustration](graphs/prioritized_experience_replay.png) | Importance sampling of transitions by TD error | Prioritized DQN, Rainbow |
38
- | **Temporal Difference** | **Rainbow DQN Components** | ![Illustration](graphs/rainbow_dqn_components.png) | All extensions combined (Double, Dueling, PER, etc.) | Rainbow DQN |
39
- | **Function Approximation** | **Linear Function Approximation** | ![Illustration](graphs/linear_function_approximation.png) | Feature vector φ(s) → wᵀφ(s) | Tabular linear FA |
40
- | **Function Approximation** | **Neural Network Layers (MLP, CNN, RNN, Transformer)** | ![Illustration](graphs/neural_network_layers_mlp_cnn_rnn_transformer.png) | Full deep network for value/policy | DQN, A3C, PPO, Decision Transformer |
41
- | **Function Approximation** | **Computation Graph / Backpropagation Flow** | ![Illustration](graphs/computation_graph_backpropagation_flow.png) | Gradient flow through network | All deep RL |
42
- | **Function Approximation** | **Target Network** | ![Illustration](graphs/target_network.png) | Frozen copy of Q-network for stability | DQN, DDQN, SAC, TD3 |
43
- | **Policy Gradients** | **Policy Gradient Theorem** | ![Illustration](graphs/policy_gradient_theorem.png) | ∇_θ J(θ) = E[∇_θ log π(a\ | Flow diagram from reward → log-prob → gradient |
44
- | **Policy Gradients** | **REINFORCE Update** | ![Illustration](graphs/reinforce_update.png) | Monte-Carlo policy gradient | REINFORCE |
45
- | **Policy Gradients** | **Baseline / Advantage Subtraction** | ![Illustration](graphs/baseline_advantage_subtraction.png) | Subtract b(s) to reduce variance | All modern PG |
46
- | **Policy Gradients** | **Trust Region (TRPO)** | ![Illustration](graphs/trust_region_trpo.png) | KL-divergence constraint on policy update | TRPO |
47
- | **Policy Gradients** | **Proximal Policy Optimization (PPO)** | ![Illustration](graphs/proximal_policy_optimization_ppo.png) | Clipped surrogate objective | PPO, PPO-Clip |
48
- | **Actor-Critic** | **Actor-Critic Architecture** | ![Illustration](graphs/actor_critic_architecture.png) | Separate or shared actor (policy) + critic (value) networks | A2C, A3C, SAC, TD3 |
49
- | **Actor-Critic** | **Advantage Actor-Critic (A2C/A3C)** | ![Illustration](graphs/advantage_actor_critic_a2c_a3c.png) | Synchronous/asynchronous multi-worker | A2C/A3C |
50
- | **Actor-Critic** | **Soft Actor-Critic (SAC)** | ![Illustration](graphs/soft_actor_critic_sac.png) | Entropy-regularized policy + twin critics | SAC |
51
- | **Actor-Critic** | **Twin Delayed DDPG (TD3)** | ![Illustration](graphs/twin_delayed_ddpg_td3.png) | Twin critics + delayed policy + target smoothing | TD3 |
52
- | **Exploration** | **ε-Greedy Strategy** | ![Illustration](graphs/greedy_strategy.png) | Probability ε of random action | DQN family |
53
- | **Exploration** | **Softmax / Boltzmann Exploration** | ![Illustration](graphs/softmax_boltzmann_exploration.png) | Temperature τ in softmax | Softmax policies |
54
- | **Exploration** | **Upper Confidence Bound (UCB)** | ![Illustration](graphs/upper_confidence_bound_ucb.png) | Optimism in face of uncertainty | UCB1, bandits |
55
- | **Exploration** | **Intrinsic Motivation / Curiosity** | ![Illustration](graphs/intrinsic_motivation_curiosity.png) | Prediction error as intrinsic reward | ICM, RND, Curiosity-driven RL |
56
- | **Exploration** | **Entropy Regularization** | ![Illustration](graphs/entropy_regularization.png) | Bonus term αH(π) | SAC, maximum-entropy RL |
57
- | **Hierarchical RL** | **Options Framework** | ![Illustration](graphs/options_framework.png) | High-level policy over options (temporally extended actions) | Option-Critic |
58
- | **Hierarchical RL** | **Feudal Networks / Hierarchical Actor-Critic** | ![Illustration](graphs/feudal_networks_hierarchical_actor_critic.png) | Manager-worker hierarchy | Feudal RL |
59
- | **Hierarchical RL** | **Skill Discovery** | ![Illustration](graphs/skill_discovery.png) | Unsupervised emergence of reusable skills | DIAYN, VALOR |
60
- | **Model-Based RL** | **Learned Dynamics Model** | ![Illustration](graphs/learned_dynamics_model.png) | ˆP(s′\ | Separate model network diagram (often RNN or transformer) |
61
- | **Model-Based RL** | **Model-Based Planning** | ![Illustration](graphs/model_based_planning.png) | Rollouts inside learned model | MuZero, DreamerV3 |
62
- | **Model-Based RL** | **Imagination-Augmented Agents (I2A)** | ![Illustration](graphs/imagination_augmented_agents_i2a.png) | Imagination module + policy | I2A |
63
- | **Offline RL** | **Offline Dataset** | ![Illustration](graphs/offline_dataset.png) | Fixed batch of trajectories | BC, CQL, IQL |
64
- | **Offline RL** | **Conservative Q-Learning (CQL)** | ![Illustration](graphs/conservative_q_learning_cql.png) | Penalty on out-of-distribution actions | CQL |
65
- | **Multi-Agent RL** | **Multi-Agent Interaction Graph** | ![Illustration](graphs/multi_agent_interaction_graph.png) | Agents communicating or competing | MARL, MADDPG |
66
- | **Multi-Agent RL** | **Centralized Training Decentralized Execution (CTDE)** | ![Illustration](graphs/centralized_training_decentralized_execution_ctde.png) | Shared critic during training | QMIX, VDN, MADDPG |
67
- | **Multi-Agent RL** | **Cooperative / Competitive Payoff Matrix** | ![Illustration](graphs/cooperative_competitive_payoff_matrix.png) | Joint reward for multiple agents | Prisoner's Dilemma, multi-agent gridworlds |
68
- | **Inverse RL / IRL** | **Reward Inference** | ![Illustration](graphs/reward_inference.png) | Infer reward from expert demonstrations | IRL, GAIL |
69
- | **Inverse RL / IRL** | **Generative Adversarial Imitation Learning (GAIL)** | ![Illustration](graphs/generative_adversarial_imitation_learning_gail.png) | Discriminator vs. policy generator | GAIL, AIRL |
70
- | **Meta-RL** | **Meta-RL Architecture** | ![Illustration](graphs/meta_rl_architecture.png) | Outer loop (meta-policy) + inner loop (task adaptation) | MAML for RL, RL² |
71
- | **Meta-RL** | **Task Distribution Visualization** | ![Illustration](graphs/task_distribution_visualization.png) | Multiple MDPs sampled from meta-distribution | Meta-RL benchmarks |
72
- | **Advanced / Misc** | **Experience Replay Buffer** | ![Illustration](graphs/experience_replay_buffer.png) | Stored (s,a,r,s′,done) tuples | DQN and all off-policy deep RL |
73
- | **Advanced / Misc** | **State Visitation / Occupancy Measure** | ![Illustration](graphs/state_visitation_occupancy_measure.png) | Frequency of visiting each state | All algorithms (analysis) |
74
- | **Advanced / Misc** | **Learning Curve** | ![Illustration](graphs/learning_curve.png) | Average episodic return vs. episodes / steps | Standard performance reporting |
75
- | **Advanced / Misc** | **Regret / Cumulative Regret** | ![Illustration](graphs/regret_cumulative_regret.png) | Sub-optimality accumulated | Bandits and online RL |
76
- | **Advanced / Misc** | **Attention Mechanisms (Transformers in RL)** | ![Illustration](graphs/attention_mechanisms_transformers_in_rl.png) | Attention weights | Decision Transformer, Trajectory Transformer |
77
- | **Advanced / Misc** | **Diffusion Policy** | ![Illustration](graphs/diffusion_policy.png) | Denoising diffusion process for action generation | Diffusion-RL policies |
78
- | **Advanced / Misc** | **Graph Neural Networks for RL** | ![Illustration](graphs/graph_neural_networks_for_rl.png) | Node/edge message passing | Graph RL, relational RL |
79
- | **Advanced / Misc** | **World Model / Latent Space** | ![Illustration](graphs/world_model_latent_space.png) | Encoder-decoder dynamics in latent space | Dreamer, PlaNet |
80
- | **Advanced / Misc** | **Convergence Analysis Plots** | ![Illustration](graphs/convergence_analysis_plots.png) | Error / value change over iterations | DP, TD, value iteration |
 
 
 
 
 
 
 
1
+ ---
2
+ title: Reinforcement Learning Graphical Representations
3
+ date: 2026-04-08
4
+ category: Reinforcement Learning
5
+ description: A comprehensive gallery of 72 standard RL components and their graphical presentations.
6
+ license: mit
7
+ ---
8
+
9
+ # Reinforcement Learning Graphical Representations
10
+
11
+ This repository contains a full set of 72 visualizations representing foundational concepts, algorithms, and advanced topics in Reinforcement Learning.
12
+
13
+ | Category | Component | Illustration | Details | Context |
14
+ |----------|-----------|--------------|---------|---------|
15
+ | **MDP & Environment** | **Agent-Environment Interaction Loop** | ![Illustration](graphs/agent_environment_interaction_loop.png) | Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state | All RL algorithms |
16
+ | **MDP & Environment** | **Markov Decision Process (MDP) Tuple** | ![Illustration](graphs/markov_decision_process_mdp_tuple.png) | (S, A, P, R, γ) with transition dynamics and reward function | s,a) and R(s,a,s′)) |
17
+ | **MDP & Environment** | **State Transition Graph** | ![Illustration](graphs/state_transition_graph.png) | Full probabilistic transitions between discrete states | Gridworld, Taxi, Cliff Walking |
18
+ | **MDP & Environment** | **Trajectory / Episode Sequence** | ![Illustration](graphs/trajectory_episode_sequence.png) | Sequence of (s₀, a₀, r₁, s₁, …, s_T) | Monte Carlo, episodic tasks |
19
+ | **MDP & Environment** | **Continuous State/Action Space Visualization** | ![Illustration](graphs/continuous_state_action_space_visualization.png) | High-dimensional spaces (e.g., robot joints, pixel inputs) | Continuous-control tasks (MuJoCo, PyBullet) |
20
+ | **MDP & Environment** | **Reward Function / Landscape** | ![Illustration](graphs/reward_function_landscape.png) | Scalar reward as function of state/action | All algorithms; especially reward shaping |
21
+ | **MDP & Environment** | **Discount Factor (γ) Effect** | ![Illustration](graphs/discount_factor_effect.png) | How future rewards are weighted | All discounted MDPs |
22
+ | **Value & Policy** | **State-Value Function V(s)** | ![Illustration](graphs/state_value_function_v_s.png) | Expected return from state s under policy π | Value-based methods |
23
+ | **Value & Policy** | **Action-Value Function Q(s,a)** | ![Illustration](graphs/action_value_function_q_s_a.png) | Expected return from state-action pair | Q-learning family |
24
+ | **Value & Policy** | **Policy π(s) or π(a\** | ![Illustration](graphs/policy_s_or_a.png) | s) | Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps |
25
+ | **Value & Policy** | **Advantage Function A(s,a)** | ![Illustration](graphs/advantage_function_a_s_a.png) | Q(s,a) V(s) | A2C, PPO, SAC, TD3 |
26
+ | **Value & Policy** | **Optimal Value Function V* / Q*** | ![Illustration](graphs/optimal_value_function_v_q.png) | Solution to Bellman optimality | Value iteration, Q-learning |
27
+ | **Dynamic Programming** | **Policy Evaluation Backup** | ![Illustration](graphs/policy_evaluation_backup.png) | Iterative update of V using Bellman expectation | Policy iteration |
28
+ | **Dynamic Programming** | **Policy Improvement** | ![Illustration](graphs/policy_improvement.png) | Greedy policy update over Q | Policy iteration |
29
+ | **Dynamic Programming** | **Value Iteration Backup** | ![Illustration](graphs/value_iteration_backup.png) | Update using Bellman optimality | Value iteration |
30
+ | **Dynamic Programming** | **Policy Iteration Full Cycle** | ![Illustration](graphs/policy_iteration_full_cycle.png) | Evaluation Improvement loop | Classic DP methods |
31
+ | **Monte Carlo** | **Monte Carlo Backup** | ![Illustration](graphs/monte_carlo_backup.png) | Update using full episode return G_t | First-visit / every-visit MC |
32
+ | **Monte Carlo** | **Monte Carlo Tree (MCTS)** | ![Illustration](graphs/monte_carlo_tree_mcts.png) | Search tree with selection, expansion, simulation, backprop | AlphaGo, AlphaZero |
33
+ | **Monte Carlo** | **Importance Sampling Ratio** | ![Illustration](graphs/importance_sampling_ratio.png) | Off-policy correction ρ = π(a\ | s) |
34
+ | **Temporal Difference** | **TD(0) Backup** | ![Illustration](graphs/td_0_backup.png) | Bootstrapped update using R + γV(s′) | TD learning |
35
+ | **Temporal Difference** | **Bootstrapping (general)** | ![Illustration](graphs/bootstrapping_general.png) | Using estimated future value instead of full return | All TD methods |
36
+ | **Temporal Difference** | **n-step TD Backup** | ![Illustration](graphs/n_step_td_backup.png) | Multi-step return G_t^{(n)} | n-step TD, TD(λ) |
37
+ | **Temporal Difference** | **TD(λ) & Eligibility Traces** | ![Illustration](graphs/td_eligibility_traces.png) | Decaying trace z_t for credit assignment | TD(λ), SARSA(λ), Q(λ) |
38
+ | **Temporal Difference** | **SARSA Update** | ![Illustration](graphs/sarsa_update.png) | On-policy TD control | SARSA |
39
+ | **Temporal Difference** | **Q-Learning Update** | ![Illustration](graphs/q_learning_update.png) | Off-policy TD control | Q-learning, Deep Q-Network |
40
+ | **Temporal Difference** | **Expected SARSA** | ![Illustration](graphs/expected_sarsa.png) | Expectation over next action under policy | Expected SARSA |
41
+ | **Temporal Difference** | **Double Q-Learning / Double DQN** | ![Illustration](graphs/double_q_learning_double_dqn.png) | Two separate Q estimators to reduce overestimation | Double DQN, TD3 |
42
+ | **Temporal Difference** | **Dueling DQN Architecture** | ![Illustration](graphs/dueling_dqn_architecture.png) | Separate streams for state value V(s) and advantage A(s,a) | Dueling DQN |
43
+ | **Temporal Difference** | **Prioritized Experience Replay** | ![Illustration](graphs/prioritized_experience_replay.png) | Importance sampling of transitions by TD error | Prioritized DQN, Rainbow |
44
+ | **Temporal Difference** | **Rainbow DQN Components** | ![Illustration](graphs/rainbow_dqn_components.png) | All extensions combined (Double, Dueling, PER, etc.) | Rainbow DQN |
45
+ | **Function Approximation** | **Linear Function Approximation** | ![Illustration](graphs/linear_function_approximation.png) | Feature vector φ(s) wᵀφ(s) | Tabular linear FA |
46
+ | **Function Approximation** | **Neural Network Layers (MLP, CNN, RNN, Transformer)** | ![Illustration](graphs/neural_network_layers_mlp_cnn_rnn_transformer.png) | Full deep network for value/policy | DQN, A3C, PPO, Decision Transformer |
47
+ | **Function Approximation** | **Computation Graph / Backpropagation Flow** | ![Illustration](graphs/computation_graph_backpropagation_flow.png) | Gradient flow through network | All deep RL |
48
+ | **Function Approximation** | **Target Network** | ![Illustration](graphs/target_network.png) | Frozen copy of Q-network for stability | DQN, DDQN, SAC, TD3 |
49
+ | **Policy Gradients** | **Policy Gradient Theorem** | ![Illustration](graphs/policy_gradient_theorem.png) | ∇_θ J(θ) = E[∇_θ log π(a\ | Flow diagram from reward → log-prob → gradient |
50
+ | **Policy Gradients** | **REINFORCE Update** | ![Illustration](graphs/reinforce_update.png) | Monte-Carlo policy gradient | REINFORCE |
51
+ | **Policy Gradients** | **Baseline / Advantage Subtraction** | ![Illustration](graphs/baseline_advantage_subtraction.png) | Subtract b(s) to reduce variance | All modern PG |
52
+ | **Policy Gradients** | **Trust Region (TRPO)** | ![Illustration](graphs/trust_region_trpo.png) | KL-divergence constraint on policy update | TRPO |
53
+ | **Policy Gradients** | **Proximal Policy Optimization (PPO)** | ![Illustration](graphs/proximal_policy_optimization_ppo.png) | Clipped surrogate objective | PPO, PPO-Clip |
54
+ | **Actor-Critic** | **Actor-Critic Architecture** | ![Illustration](graphs/actor_critic_architecture.png) | Separate or shared actor (policy) + critic (value) networks | A2C, A3C, SAC, TD3 |
55
+ | **Actor-Critic** | **Advantage Actor-Critic (A2C/A3C)** | ![Illustration](graphs/advantage_actor_critic_a2c_a3c.png) | Synchronous/asynchronous multi-worker | A2C/A3C |
56
+ | **Actor-Critic** | **Soft Actor-Critic (SAC)** | ![Illustration](graphs/soft_actor_critic_sac.png) | Entropy-regularized policy + twin critics | SAC |
57
+ | **Actor-Critic** | **Twin Delayed DDPG (TD3)** | ![Illustration](graphs/twin_delayed_ddpg_td3.png) | Twin critics + delayed policy + target smoothing | TD3 |
58
+ | **Exploration** | **ε-Greedy Strategy** | ![Illustration](graphs/greedy_strategy.png) | Probability ε of random action | DQN family |
59
+ | **Exploration** | **Softmax / Boltzmann Exploration** | ![Illustration](graphs/softmax_boltzmann_exploration.png) | Temperature τ in softmax | Softmax policies |
60
+ | **Exploration** | **Upper Confidence Bound (UCB)** | ![Illustration](graphs/upper_confidence_bound_ucb.png) | Optimism in face of uncertainty | UCB1, bandits |
61
+ | **Exploration** | **Intrinsic Motivation / Curiosity** | ![Illustration](graphs/intrinsic_motivation_curiosity.png) | Prediction error as intrinsic reward | ICM, RND, Curiosity-driven RL |
62
+ | **Exploration** | **Entropy Regularization** | ![Illustration](graphs/entropy_regularization.png) | Bonus term αH(π) | SAC, maximum-entropy RL |
63
+ | **Hierarchical RL** | **Options Framework** | ![Illustration](graphs/options_framework.png) | High-level policy over options (temporally extended actions) | Option-Critic |
64
+ | **Hierarchical RL** | **Feudal Networks / Hierarchical Actor-Critic** | ![Illustration](graphs/feudal_networks_hierarchical_actor_critic.png) | Manager-worker hierarchy | Feudal RL |
65
+ | **Hierarchical RL** | **Skill Discovery** | ![Illustration](graphs/skill_discovery.png) | Unsupervised emergence of reusable skills | DIAYN, VALOR |
66
+ | **Model-Based RL** | **Learned Dynamics Model** | ![Illustration](graphs/learned_dynamics_model.png) | ˆP(s′\ | Separate model network diagram (often RNN or transformer) |
67
+ | **Model-Based RL** | **Model-Based Planning** | ![Illustration](graphs/model_based_planning.png) | Rollouts inside learned model | MuZero, DreamerV3 |
68
+ | **Model-Based RL** | **Imagination-Augmented Agents (I2A)** | ![Illustration](graphs/imagination_augmented_agents_i2a.png) | Imagination module + policy | I2A |
69
+ | **Offline RL** | **Offline Dataset** | ![Illustration](graphs/offline_dataset.png) | Fixed batch of trajectories | BC, CQL, IQL |
70
+ | **Offline RL** | **Conservative Q-Learning (CQL)** | ![Illustration](graphs/conservative_q_learning_cql.png) | Penalty on out-of-distribution actions | CQL |
71
+ | **Multi-Agent RL** | **Multi-Agent Interaction Graph** | ![Illustration](graphs/multi_agent_interaction_graph.png) | Agents communicating or competing | MARL, MADDPG |
72
+ | **Multi-Agent RL** | **Centralized Training Decentralized Execution (CTDE)** | ![Illustration](graphs/centralized_training_decentralized_execution_ctde.png) | Shared critic during training | QMIX, VDN, MADDPG |
73
+ | **Multi-Agent RL** | **Cooperative / Competitive Payoff Matrix** | ![Illustration](graphs/cooperative_competitive_payoff_matrix.png) | Joint reward for multiple agents | Prisoner's Dilemma, multi-agent gridworlds |
74
+ | **Inverse RL / IRL** | **Reward Inference** | ![Illustration](graphs/reward_inference.png) | Infer reward from expert demonstrations | IRL, GAIL |
75
+ | **Inverse RL / IRL** | **Generative Adversarial Imitation Learning (GAIL)** | ![Illustration](graphs/generative_adversarial_imitation_learning_gail.png) | Discriminator vs. policy generator | GAIL, AIRL |
76
+ | **Meta-RL** | **Meta-RL Architecture** | ![Illustration](graphs/meta_rl_architecture.png) | Outer loop (meta-policy) + inner loop (task adaptation) | MAML for RL, RL² |
77
+ | **Meta-RL** | **Task Distribution Visualization** | ![Illustration](graphs/task_distribution_visualization.png) | Multiple MDPs sampled from meta-distribution | Meta-RL benchmarks |
78
+ | **Advanced / Misc** | **Experience Replay Buffer** | ![Illustration](graphs/experience_replay_buffer.png) | Stored (s,a,r,s′,done) tuples | DQN and all off-policy deep RL |
79
+ | **Advanced / Misc** | **State Visitation / Occupancy Measure** | ![Illustration](graphs/state_visitation_occupancy_measure.png) | Frequency of visiting each state | All algorithms (analysis) |
80
+ | **Advanced / Misc** | **Learning Curve** | ![Illustration](graphs/learning_curve.png) | Average episodic return vs. episodes / steps | Standard performance reporting |
81
+ | **Advanced / Misc** | **Regret / Cumulative Regret** | ![Illustration](graphs/regret_cumulative_regret.png) | Sub-optimality accumulated | Bandits and online RL |
82
+ | **Advanced / Misc** | **Attention Mechanisms (Transformers in RL)** | ![Illustration](graphs/attention_mechanisms_transformers_in_rl.png) | Attention weights | Decision Transformer, Trajectory Transformer |
83
+ | **Advanced / Misc** | **Diffusion Policy** | ![Illustration](graphs/diffusion_policy.png) | Denoising diffusion process for action generation | Diffusion-RL policies |
84
+ | **Advanced / Misc** | **Graph Neural Networks for RL** | ![Illustration](graphs/graph_neural_networks_for_rl.png) | Node/edge message passing | Graph RL, relational RL |
85
+ | **Advanced / Misc** | **World Model / Latent Space** | ![Illustration](graphs/world_model_latent_space.png) | Encoder-decoder dynamics in latent space | Dreamer, PlaNet |
86
+ | **Advanced / Misc** | **Convergence Analysis Plots** | ![Illustration](graphs/convergence_analysis_plots.png) | Error / value change over iterations | DP, TD, value iteration |