Update README.md
Browse files
README.md
CHANGED
|
@@ -1,80 +1,86 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
|
| 14 |
-
|
|
| 15 |
-
| **MDP & Environment** | **
|
| 16 |
-
| **
|
| 17 |
-
| **
|
| 18 |
-
| **
|
| 19 |
-
| **
|
| 20 |
-
| **
|
| 21 |
-
| **
|
| 22 |
-
| **
|
| 23 |
-
| **
|
| 24 |
-
| **
|
| 25 |
-
| **
|
| 26 |
-
| **
|
| 27 |
-
| **
|
| 28 |
-
| **
|
| 29 |
-
| **
|
| 30 |
-
| **
|
| 31 |
-
| **
|
| 32 |
-
| **
|
| 33 |
-
| **
|
| 34 |
-
| **Temporal Difference** | **
|
| 35 |
-
| **Temporal Difference** | **
|
| 36 |
-
| **Temporal Difference** | **
|
| 37 |
-
| **Temporal Difference** | **
|
| 38 |
-
| **Temporal Difference** | **
|
| 39 |
-
| **
|
| 40 |
-
| **
|
| 41 |
-
| **
|
| 42 |
-
| **
|
| 43 |
-
| **
|
| 44 |
-
| **
|
| 45 |
-
| **
|
| 46 |
-
| **
|
| 47 |
-
| **
|
| 48 |
-
| **
|
| 49 |
-
| **
|
| 50 |
-
| **
|
| 51 |
-
| **
|
| 52 |
-
| **
|
| 53 |
-
| **
|
| 54 |
-
| **
|
| 55 |
-
| **
|
| 56 |
-
| **
|
| 57 |
-
| **
|
| 58 |
-
| **
|
| 59 |
-
| **
|
| 60 |
-
| **
|
| 61 |
-
| **
|
| 62 |
-
| **
|
| 63 |
-
| **
|
| 64 |
-
| **
|
| 65 |
-
| **
|
| 66 |
-
| **
|
| 67 |
-
| **
|
| 68 |
-
| **
|
| 69 |
-
| **
|
| 70 |
-
| **
|
| 71 |
-
| **
|
| 72 |
-
| **
|
| 73 |
-
| **
|
| 74 |
-
| **
|
| 75 |
-
| **
|
| 76 |
-
| **
|
| 77 |
-
| **
|
| 78 |
-
| **Advanced / Misc** | **
|
| 79 |
-
| **Advanced / Misc** | **
|
| 80 |
-
| **Advanced / Misc** | **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Reinforcement Learning Graphical Representations
|
| 3 |
+
date: 2026-04-08
|
| 4 |
+
category: Reinforcement Learning
|
| 5 |
+
description: A comprehensive gallery of 72 standard RL components and their graphical presentations.
|
| 6 |
+
license: mit
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# Reinforcement Learning Graphical Representations
|
| 10 |
+
|
| 11 |
+
This repository contains a full set of 72 visualizations representing foundational concepts, algorithms, and advanced topics in Reinforcement Learning.
|
| 12 |
+
|
| 13 |
+
| Category | Component | Illustration | Details | Context |
|
| 14 |
+
|----------|-----------|--------------|---------|---------|
|
| 15 |
+
| **MDP & Environment** | **Agent-Environment Interaction Loop** |  | Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state | All RL algorithms |
|
| 16 |
+
| **MDP & Environment** | **Markov Decision Process (MDP) Tuple** |  | (S, A, P, R, γ) with transition dynamics and reward function | s,a) and R(s,a,s′)) |
|
| 17 |
+
| **MDP & Environment** | **State Transition Graph** |  | Full probabilistic transitions between discrete states | Gridworld, Taxi, Cliff Walking |
|
| 18 |
+
| **MDP & Environment** | **Trajectory / Episode Sequence** |  | Sequence of (s₀, a₀, r₁, s₁, …, s_T) | Monte Carlo, episodic tasks |
|
| 19 |
+
| **MDP & Environment** | **Continuous State/Action Space Visualization** |  | High-dimensional spaces (e.g., robot joints, pixel inputs) | Continuous-control tasks (MuJoCo, PyBullet) |
|
| 20 |
+
| **MDP & Environment** | **Reward Function / Landscape** |  | Scalar reward as function of state/action | All algorithms; especially reward shaping |
|
| 21 |
+
| **MDP & Environment** | **Discount Factor (γ) Effect** |  | How future rewards are weighted | All discounted MDPs |
|
| 22 |
+
| **Value & Policy** | **State-Value Function V(s)** |  | Expected return from state s under policy π | Value-based methods |
|
| 23 |
+
| **Value & Policy** | **Action-Value Function Q(s,a)** |  | Expected return from state-action pair | Q-learning family |
|
| 24 |
+
| **Value & Policy** | **Policy π(s) or π(a\** |  | s) | Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps |
|
| 25 |
+
| **Value & Policy** | **Advantage Function A(s,a)** |  | Q(s,a) – V(s) | A2C, PPO, SAC, TD3 |
|
| 26 |
+
| **Value & Policy** | **Optimal Value Function V* / Q*** |  | Solution to Bellman optimality | Value iteration, Q-learning |
|
| 27 |
+
| **Dynamic Programming** | **Policy Evaluation Backup** |  | Iterative update of V using Bellman expectation | Policy iteration |
|
| 28 |
+
| **Dynamic Programming** | **Policy Improvement** |  | Greedy policy update over Q | Policy iteration |
|
| 29 |
+
| **Dynamic Programming** | **Value Iteration Backup** |  | Update using Bellman optimality | Value iteration |
|
| 30 |
+
| **Dynamic Programming** | **Policy Iteration Full Cycle** |  | Evaluation → Improvement loop | Classic DP methods |
|
| 31 |
+
| **Monte Carlo** | **Monte Carlo Backup** |  | Update using full episode return G_t | First-visit / every-visit MC |
|
| 32 |
+
| **Monte Carlo** | **Monte Carlo Tree (MCTS)** |  | Search tree with selection, expansion, simulation, backprop | AlphaGo, AlphaZero |
|
| 33 |
+
| **Monte Carlo** | **Importance Sampling Ratio** |  | Off-policy correction ρ = π(a\ | s) |
|
| 34 |
+
| **Temporal Difference** | **TD(0) Backup** |  | Bootstrapped update using R + γV(s′) | TD learning |
|
| 35 |
+
| **Temporal Difference** | **Bootstrapping (general)** |  | Using estimated future value instead of full return | All TD methods |
|
| 36 |
+
| **Temporal Difference** | **n-step TD Backup** |  | Multi-step return G_t^{(n)} | n-step TD, TD(λ) |
|
| 37 |
+
| **Temporal Difference** | **TD(λ) & Eligibility Traces** |  | Decaying trace z_t for credit assignment | TD(λ), SARSA(λ), Q(λ) |
|
| 38 |
+
| **Temporal Difference** | **SARSA Update** |  | On-policy TD control | SARSA |
|
| 39 |
+
| **Temporal Difference** | **Q-Learning Update** |  | Off-policy TD control | Q-learning, Deep Q-Network |
|
| 40 |
+
| **Temporal Difference** | **Expected SARSA** |  | Expectation over next action under policy | Expected SARSA |
|
| 41 |
+
| **Temporal Difference** | **Double Q-Learning / Double DQN** |  | Two separate Q estimators to reduce overestimation | Double DQN, TD3 |
|
| 42 |
+
| **Temporal Difference** | **Dueling DQN Architecture** |  | Separate streams for state value V(s) and advantage A(s,a) | Dueling DQN |
|
| 43 |
+
| **Temporal Difference** | **Prioritized Experience Replay** |  | Importance sampling of transitions by TD error | Prioritized DQN, Rainbow |
|
| 44 |
+
| **Temporal Difference** | **Rainbow DQN Components** |  | All extensions combined (Double, Dueling, PER, etc.) | Rainbow DQN |
|
| 45 |
+
| **Function Approximation** | **Linear Function Approximation** |  | Feature vector φ(s) → wᵀφ(s) | Tabular → linear FA |
|
| 46 |
+
| **Function Approximation** | **Neural Network Layers (MLP, CNN, RNN, Transformer)** |  | Full deep network for value/policy | DQN, A3C, PPO, Decision Transformer |
|
| 47 |
+
| **Function Approximation** | **Computation Graph / Backpropagation Flow** |  | Gradient flow through network | All deep RL |
|
| 48 |
+
| **Function Approximation** | **Target Network** |  | Frozen copy of Q-network for stability | DQN, DDQN, SAC, TD3 |
|
| 49 |
+
| **Policy Gradients** | **Policy Gradient Theorem** |  | ∇_θ J(θ) = E[∇_θ log π(a\ | Flow diagram from reward → log-prob → gradient |
|
| 50 |
+
| **Policy Gradients** | **REINFORCE Update** |  | Monte-Carlo policy gradient | REINFORCE |
|
| 51 |
+
| **Policy Gradients** | **Baseline / Advantage Subtraction** |  | Subtract b(s) to reduce variance | All modern PG |
|
| 52 |
+
| **Policy Gradients** | **Trust Region (TRPO)** |  | KL-divergence constraint on policy update | TRPO |
|
| 53 |
+
| **Policy Gradients** | **Proximal Policy Optimization (PPO)** |  | Clipped surrogate objective | PPO, PPO-Clip |
|
| 54 |
+
| **Actor-Critic** | **Actor-Critic Architecture** |  | Separate or shared actor (policy) + critic (value) networks | A2C, A3C, SAC, TD3 |
|
| 55 |
+
| **Actor-Critic** | **Advantage Actor-Critic (A2C/A3C)** |  | Synchronous/asynchronous multi-worker | A2C/A3C |
|
| 56 |
+
| **Actor-Critic** | **Soft Actor-Critic (SAC)** |  | Entropy-regularized policy + twin critics | SAC |
|
| 57 |
+
| **Actor-Critic** | **Twin Delayed DDPG (TD3)** |  | Twin critics + delayed policy + target smoothing | TD3 |
|
| 58 |
+
| **Exploration** | **ε-Greedy Strategy** |  | Probability ε of random action | DQN family |
|
| 59 |
+
| **Exploration** | **Softmax / Boltzmann Exploration** |  | Temperature τ in softmax | Softmax policies |
|
| 60 |
+
| **Exploration** | **Upper Confidence Bound (UCB)** |  | Optimism in face of uncertainty | UCB1, bandits |
|
| 61 |
+
| **Exploration** | **Intrinsic Motivation / Curiosity** |  | Prediction error as intrinsic reward | ICM, RND, Curiosity-driven RL |
|
| 62 |
+
| **Exploration** | **Entropy Regularization** |  | Bonus term αH(π) | SAC, maximum-entropy RL |
|
| 63 |
+
| **Hierarchical RL** | **Options Framework** |  | High-level policy over options (temporally extended actions) | Option-Critic |
|
| 64 |
+
| **Hierarchical RL** | **Feudal Networks / Hierarchical Actor-Critic** |  | Manager-worker hierarchy | Feudal RL |
|
| 65 |
+
| **Hierarchical RL** | **Skill Discovery** |  | Unsupervised emergence of reusable skills | DIAYN, VALOR |
|
| 66 |
+
| **Model-Based RL** | **Learned Dynamics Model** |  | ˆP(s′\ | Separate model network diagram (often RNN or transformer) |
|
| 67 |
+
| **Model-Based RL** | **Model-Based Planning** |  | Rollouts inside learned model | MuZero, DreamerV3 |
|
| 68 |
+
| **Model-Based RL** | **Imagination-Augmented Agents (I2A)** |  | Imagination module + policy | I2A |
|
| 69 |
+
| **Offline RL** | **Offline Dataset** |  | Fixed batch of trajectories | BC, CQL, IQL |
|
| 70 |
+
| **Offline RL** | **Conservative Q-Learning (CQL)** |  | Penalty on out-of-distribution actions | CQL |
|
| 71 |
+
| **Multi-Agent RL** | **Multi-Agent Interaction Graph** |  | Agents communicating or competing | MARL, MADDPG |
|
| 72 |
+
| **Multi-Agent RL** | **Centralized Training Decentralized Execution (CTDE)** |  | Shared critic during training | QMIX, VDN, MADDPG |
|
| 73 |
+
| **Multi-Agent RL** | **Cooperative / Competitive Payoff Matrix** |  | Joint reward for multiple agents | Prisoner's Dilemma, multi-agent gridworlds |
|
| 74 |
+
| **Inverse RL / IRL** | **Reward Inference** |  | Infer reward from expert demonstrations | IRL, GAIL |
|
| 75 |
+
| **Inverse RL / IRL** | **Generative Adversarial Imitation Learning (GAIL)** |  | Discriminator vs. policy generator | GAIL, AIRL |
|
| 76 |
+
| **Meta-RL** | **Meta-RL Architecture** |  | Outer loop (meta-policy) + inner loop (task adaptation) | MAML for RL, RL² |
|
| 77 |
+
| **Meta-RL** | **Task Distribution Visualization** |  | Multiple MDPs sampled from meta-distribution | Meta-RL benchmarks |
|
| 78 |
+
| **Advanced / Misc** | **Experience Replay Buffer** |  | Stored (s,a,r,s′,done) tuples | DQN and all off-policy deep RL |
|
| 79 |
+
| **Advanced / Misc** | **State Visitation / Occupancy Measure** |  | Frequency of visiting each state | All algorithms (analysis) |
|
| 80 |
+
| **Advanced / Misc** | **Learning Curve** |  | Average episodic return vs. episodes / steps | Standard performance reporting |
|
| 81 |
+
| **Advanced / Misc** | **Regret / Cumulative Regret** |  | Sub-optimality accumulated | Bandits and online RL |
|
| 82 |
+
| **Advanced / Misc** | **Attention Mechanisms (Transformers in RL)** |  | Attention weights | Decision Transformer, Trajectory Transformer |
|
| 83 |
+
| **Advanced / Misc** | **Diffusion Policy** |  | Denoising diffusion process for action generation | Diffusion-RL policies |
|
| 84 |
+
| **Advanced / Misc** | **Graph Neural Networks for RL** |  | Node/edge message passing | Graph RL, relational RL |
|
| 85 |
+
| **Advanced / Misc** | **World Model / Latent Space** |  | Encoder-decoder dynamics in latent space | Dreamer, PlaNet |
|
| 86 |
+
| **Advanced / Misc** | **Convergence Analysis Plots** |  | Error / value change over iterations | DP, TD, value iteration |
|