| Category | Component | Detailed Description | Common Graphical Presentation | Typical Algorithms / Contexts |
|---|---|---|---|---|
| MDP & Environment | Agent-Environment Interaction Loop | Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state | Circular flowchart or block diagram with arrows (S → A → R, S′) | All RL algorithms |
| MDP & Environment | Markov Decision Process (MDP) Tuple | (S, A, P, R, γ) with transition dynamics and reward function | Directed graph (nodes = states, labeled edges = actions with P(s′|s,a) and R(s,a,s′)) | Foundational theory, all model-based methods |
| MDP & Environment | State Transition Graph | Full probabilistic transitions between discrete states | Graph diagram with probability-weighted arrows | Gridworld, Taxi, Cliff Walking |
| MDP & Environment | Trajectory / Episode Sequence | Sequence of (s₀, a₀, r₁, s₁, …, s_T) | Linear timeline or chain diagram | Monte Carlo, episodic tasks |
| MDP & Environment | Continuous State/Action Space Visualization | High-dimensional spaces (e.g., robot joints, pixel inputs) | 2D/3D scatter plots, density heatmaps, or manifold projections | Continuous-control tasks (MuJoCo, PyBullet) |
| MDP & Environment | Reward Function / Landscape | Scalar reward as function of state/action | 3D surface plot, contour plot, or heatmap | All algorithms; especially reward shaping |
| MDP & Environment | Discount Factor (γ) Effect | How future rewards are weighted | Line plot of geometric decay series or cumulative return curves for different γ | All discounted MDPs |
| Value & Policy | State-Value Function V(s) | Expected return from state s under policy π | Heatmap (gridworld), 3D surface plot, or contour plot | Value-based methods |
| Value & Policy | Action-Value Function Q(s,a) | Expected return from state-action pair | Q-table (discrete) or heatmap per action; 3D surface for continuous | Q-learning family |
| Value & Policy | Policy π(s) or π(a|s) | Stochastic or deterministic mapping | Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps | All policy-based methods |
| Value & Policy | Advantage Function A(s,a) | Q(s,a) – V(s) | Comparative bar/heatmap or signed surface plot | A2C, PPO, SAC, TD3 |
| Value & Policy | Optimal Value Function V* / Q* | Solution to Bellman optimality | Heatmap or surface with arrows showing greedy policy | Value iteration, Q-learning |
| Dynamic Programming | Policy Evaluation Backup | Iterative update of V using Bellman expectation | Backup diagram (current state points to all successor states with probabilities) | Policy iteration |
| Dynamic Programming | Policy Improvement | Greedy policy update over Q | Arrow diagram showing before/after policy on grid | Policy iteration |
| Dynamic Programming | Value Iteration Backup | Update using Bellman optimality | Single backup diagram (max over actions) | Value iteration |
| Dynamic Programming | Policy Iteration Full Cycle | Evaluation → Improvement loop | Multi-step flowchart or convergence plot (error vs iterations) | Classic DP methods |
| Monte Carlo | Monte Carlo Backup | Update using full episode return G_t | Backup diagram (leaf node = actual return G_t) | First-visit / every-visit MC |
| Monte Carlo | Monte Carlo Tree (MCTS) | Search tree with selection, expansion, simulation, backprop | Full tree diagram with visit counts and value bars | AlphaGo, AlphaZero |
| Monte Carlo | Importance Sampling Ratio | Off-policy correction ρ = π(a|s)/b(a|s) | Flow diagram showing weight multiplication along trajectory | Off-policy MC |
| Temporal Difference | TD(0) Backup | Bootstrapped update using R + γV(s′) | One-step backup diagram | TD learning |
| Temporal Difference | Bootstrapping (general) | Using estimated future value instead of full return | Layered backup diagram showing estimate ← estimate | All TD methods |
| Temporal Difference | n-step TD Backup | Multi-step return G_t^{(n)} | Multi-step backup diagram with n arrows | n-step TD, TD(λ) |
| Temporal Difference | TD(λ) & Eligibility Traces | Decaying trace z_t for credit assignment | Trace-decay curve or accumulating/replacing trace diagram | TD(λ), SARSA(λ), Q(λ) |
| Temporal Difference | SARSA Update | On-policy TD control | Backup diagram identical to TD but using next action from current policy | SARSA |
| Temporal Difference | Q-Learning Update | Off-policy TD control | Backup diagram using max_a′ Q(s′,a′) | Q-learning, Deep Q-Network |
| Temporal Difference | Expected SARSA | Expectation over next action under policy | Backup diagram with weighted sum over actions | Expected SARSA |
| Temporal Difference | Double Q-Learning / Double DQN | Two separate Q estimators to reduce overestimation | Dual-network backup diagram | Double DQN, TD3 |
| Temporal Difference | Dueling DQN Architecture | Separate streams for state value V(s) and advantage A(s,a) | Neural net diagram with two heads merging into Q | Dueling DQN |
| Temporal Difference | Prioritized Experience Replay | Importance sampling of transitions by TD error | Priority queue diagram or histogram of priorities | Prioritized DQN, Rainbow |
| Temporal Difference | Rainbow DQN Components | All extensions combined (Double, Dueling, PER, etc.) | Composite architecture diagram | Rainbow DQN |
| Function Approximation | Linear Function Approximation | Feature vector φ(s) → wᵀφ(s) | Weight vector diagram or basis function plots | Tabular → linear FA |
| Function Approximation | Neural Network Layers (MLP, CNN, RNN, Transformer) | Full deep network for value/policy | Layer-by-layer architecture diagram with activation shapes | DQN, A3C, PPO, Decision Transformer |
| Function Approximation | Computation Graph / Backpropagation Flow | Gradient flow through network | Directed acyclic graph (DAG) of operations | All deep RL |
| Function Approximation | Target Network | Frozen copy of Q-network for stability | Dual-network diagram with periodic copy arrow | DQN, DDQN, SAC, TD3 |
| Policy Gradients | Policy Gradient Theorem | ∇_θ J(θ) = E[∇_θ log π(a|s) ⋅ Â] | Flow diagram from reward → log-prob → gradient | REINFORCE, PG methods |
| Policy Gradients | REINFORCE Update | Monte-Carlo policy gradient | Full-trajectory gradient flow diagram | REINFORCE |
| Policy Gradients | Baseline / Advantage Subtraction | Subtract b(s) to reduce variance | Diagram comparing raw return vs. advantage-scaled gradient | All modern PG |
| Policy Gradients | Trust Region (TRPO) | KL-divergence constraint on policy update | Constraint boundary diagram or trust-region circle | TRPO |
| Policy Gradients | Proximal Policy Optimization (PPO) | Clipped surrogate objective | Clip function plot (min/max bounds) | PPO, PPO-Clip |
| Actor-Critic | Actor-Critic Architecture | Separate or shared actor (policy) + critic (value) networks | Dual-network diagram with shared backbone option | A2C, A3C, SAC, TD3 |
| Actor-Critic | Advantage Actor-Critic (A2C/A3C) | Synchronous/asynchronous multi-worker | Multi-threaded diagram with global parameter server | A2C/A3C |
| Actor-Critic | Soft Actor-Critic (SAC) | Entropy-regularized policy + twin critics | Architecture with entropy bonus term shown as extra input | SAC |
| Actor-Critic | Twin Delayed DDPG (TD3) | Twin critics + delayed policy + target smoothing | Three-network diagram (actor + two critics) | TD3 |
| Exploration | ε-Greedy Strategy | Probability ε of random action | Decay curve plot (ε vs. episodes) | DQN family |
| Exploration | Softmax / Boltzmann Exploration | Temperature τ in softmax | Temperature decay curve or probability surface | Softmax policies |
| Exploration | Upper Confidence Bound (UCB) | Optimism in face of uncertainty | Confidence bound bars on action values | UCB1, bandits |
| Exploration | Intrinsic Motivation / Curiosity | Prediction error as intrinsic reward | Separate intrinsic reward module diagram | ICM, RND, Curiosity-driven RL |
| Exploration | Entropy Regularization | Bonus term αH(π) | Entropy plot or bonus curve | SAC, maximum-entropy RL |
| Hierarchical RL | Options Framework | High-level policy over options (temporally extended actions) | Hierarchical diagram with option policy layer | Option-Critic |
| Hierarchical RL | Feudal Networks / Hierarchical Actor-Critic | Manager-worker hierarchy | Multi-level network diagram | Feudal RL |
| Hierarchical RL | Skill Discovery | Unsupervised emergence of reusable skills | Skill embedding space visualization | DIAYN, VALOR |
| Model-Based RL | Learned Dynamics Model | ˆP(s′|s,a) or world model | Separate model network diagram (often RNN or transformer) | Dyna, MBPO, Dreamer |
| Model-Based RL | Model-Based Planning | Rollouts inside learned model | Tree or rollout diagram inside model | MuZero, DreamerV3 |
| Model-Based RL | Imagination-Augmented Agents (I2A) | Imagination module + policy | Imagination rollout diagram | I2A |
| Offline RL | Offline Dataset | Fixed batch of trajectories | Replay buffer diagram (no interaction arrow) | BC, CQL, IQL |
| Offline RL | Conservative Q-Learning (CQL) | Penalty on out-of-distribution actions | Q-value regularization diagram | CQL |
| Multi-Agent RL | Multi-Agent Interaction Graph | Agents communicating or competing | Graph with nodes = agents, edges = communication | MARL, MADDPG |
| Multi-Agent RL | Centralized Training Decentralized Execution (CTDE) | Shared critic during training | Dual-view diagram (central critic vs. local actors) | QMIX, VDN, MADDPG |
| Multi-Agent RL | Cooperative / Competitive Payoff Matrix | Joint reward for multiple agents | Heatmap matrix of joint rewards | Prisoner's Dilemma, multi-agent gridworlds |
| Inverse RL / IRL | Reward Inference | Infer reward from expert demonstrations | Demonstration trajectory → inferred reward heatmap | IRL, GAIL |
| Inverse RL / IRL | Generative Adversarial Imitation Learning (GAIL) | Discriminator vs. policy generator | GAN-style diagram adapted for trajectories | GAIL, AIRL |
| Meta-RL | Meta-RL Architecture | Outer loop (meta-policy) + inner loop (task adaptation) | Nested loop diagram | MAML for RL, RL² |
| Meta-RL | Task Distribution Visualization | Multiple MDPs sampled from meta-distribution | Grid of task environments or embedding space | Meta-RL benchmarks |
| Advanced / Misc | Experience Replay Buffer | Stored (s,a,r,s′,done) tuples | FIFO queue or prioritized sampling diagram | DQN and all off-policy deep RL |
| Advanced / Misc | State Visitation / Occupancy Measure | Frequency of visiting each state | Heatmap or density plot | All algorithms (analysis) |
| Advanced / Misc | Learning Curve | Average episodic return vs. episodes / steps | Line plot with confidence bands | Standard performance reporting |
| Advanced / Misc | Regret / Cumulative Regret | Sub-optimality accumulated | Cumulative sum plot | Bandits and online RL |
| Advanced / Misc | Attention Mechanisms (Transformers in RL) | Attention weights | Attention heatmap or token highlighting | Decision Transformer, Trajectory Transformer |
| Advanced / Misc | Diffusion Policy | Denoising diffusion process for action generation | Step-by-step denoising trajectory diagram | Diffusion-RL policies |
| Advanced / Misc | Graph Neural Networks for RL | Node/edge message passing | Graph convolution diagram | Graph RL, relational RL |
| Advanced / Misc | World Model / Latent Space | Encoder-decoder dynamics in latent space | Encoder → latent → decoder diagram | Dreamer, PlaNet |
| Advanced / Misc | Convergence Analysis Plots | Error / value change over iterations | Log-scale convergence curves | DP, TD, value iteration |
This table contains every standard and widely-published graphically presented component in reinforcement learning (foundational theory, classic algorithms, deep RL extensions, modern variants, and analysis tools). It draws from Sutton & Barto (2nd ed.), all major deep RL papers (DQN through DreamerV3), and common visualization practices in the literature. No major component that is routinely shown in diagrams, flowcharts, backup diagrams, architectures, heatmaps, or plots has been omitted. If you need the actual image/diagram for any row or a deeper dive into one, just specify the row!