72_RL_graphical_representations / table_no_visuals.md
algorembrant's picture
Upload 76 files
8744e5e verified
Category Component Detailed Description Common Graphical Presentation Typical Algorithms / Contexts
MDP & Environment Agent-Environment Interaction Loop Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state Circular flowchart or block diagram with arrows (S → A → R, S′) All RL algorithms
MDP & Environment Markov Decision Process (MDP) Tuple (S, A, P, R, γ) with transition dynamics and reward function Directed graph (nodes = states, labeled edges = actions with P(s′|s,a) and R(s,a,s′)) Foundational theory, all model-based methods
MDP & Environment State Transition Graph Full probabilistic transitions between discrete states Graph diagram with probability-weighted arrows Gridworld, Taxi, Cliff Walking
MDP & Environment Trajectory / Episode Sequence Sequence of (s₀, a₀, r₁, s₁, …, s_T) Linear timeline or chain diagram Monte Carlo, episodic tasks
MDP & Environment Continuous State/Action Space Visualization High-dimensional spaces (e.g., robot joints, pixel inputs) 2D/3D scatter plots, density heatmaps, or manifold projections Continuous-control tasks (MuJoCo, PyBullet)
MDP & Environment Reward Function / Landscape Scalar reward as function of state/action 3D surface plot, contour plot, or heatmap All algorithms; especially reward shaping
MDP & Environment Discount Factor (γ) Effect How future rewards are weighted Line plot of geometric decay series or cumulative return curves for different γ All discounted MDPs
Value & Policy State-Value Function V(s) Expected return from state s under policy π Heatmap (gridworld), 3D surface plot, or contour plot Value-based methods
Value & Policy Action-Value Function Q(s,a) Expected return from state-action pair Q-table (discrete) or heatmap per action; 3D surface for continuous Q-learning family
Value & Policy Policy π(s) or π(a|s) Stochastic or deterministic mapping Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps All policy-based methods
Value & Policy Advantage Function A(s,a) Q(s,a) – V(s) Comparative bar/heatmap or signed surface plot A2C, PPO, SAC, TD3
Value & Policy Optimal Value Function V* / Q* Solution to Bellman optimality Heatmap or surface with arrows showing greedy policy Value iteration, Q-learning
Dynamic Programming Policy Evaluation Backup Iterative update of V using Bellman expectation Backup diagram (current state points to all successor states with probabilities) Policy iteration
Dynamic Programming Policy Improvement Greedy policy update over Q Arrow diagram showing before/after policy on grid Policy iteration
Dynamic Programming Value Iteration Backup Update using Bellman optimality Single backup diagram (max over actions) Value iteration
Dynamic Programming Policy Iteration Full Cycle Evaluation → Improvement loop Multi-step flowchart or convergence plot (error vs iterations) Classic DP methods
Monte Carlo Monte Carlo Backup Update using full episode return G_t Backup diagram (leaf node = actual return G_t) First-visit / every-visit MC
Monte Carlo Monte Carlo Tree (MCTS) Search tree with selection, expansion, simulation, backprop Full tree diagram with visit counts and value bars AlphaGo, AlphaZero
Monte Carlo Importance Sampling Ratio Off-policy correction ρ = π(a|s)/b(a|s) Flow diagram showing weight multiplication along trajectory Off-policy MC
Temporal Difference TD(0) Backup Bootstrapped update using R + γV(s′) One-step backup diagram TD learning
Temporal Difference Bootstrapping (general) Using estimated future value instead of full return Layered backup diagram showing estimate ← estimate All TD methods
Temporal Difference n-step TD Backup Multi-step return G_t^{(n)} Multi-step backup diagram with n arrows n-step TD, TD(λ)
Temporal Difference TD(λ) & Eligibility Traces Decaying trace z_t for credit assignment Trace-decay curve or accumulating/replacing trace diagram TD(λ), SARSA(λ), Q(λ)
Temporal Difference SARSA Update On-policy TD control Backup diagram identical to TD but using next action from current policy SARSA
Temporal Difference Q-Learning Update Off-policy TD control Backup diagram using max_a′ Q(s′,a′) Q-learning, Deep Q-Network
Temporal Difference Expected SARSA Expectation over next action under policy Backup diagram with weighted sum over actions Expected SARSA
Temporal Difference Double Q-Learning / Double DQN Two separate Q estimators to reduce overestimation Dual-network backup diagram Double DQN, TD3
Temporal Difference Dueling DQN Architecture Separate streams for state value V(s) and advantage A(s,a) Neural net diagram with two heads merging into Q Dueling DQN
Temporal Difference Prioritized Experience Replay Importance sampling of transitions by TD error Priority queue diagram or histogram of priorities Prioritized DQN, Rainbow
Temporal Difference Rainbow DQN Components All extensions combined (Double, Dueling, PER, etc.) Composite architecture diagram Rainbow DQN
Function Approximation Linear Function Approximation Feature vector φ(s) → wᵀφ(s) Weight vector diagram or basis function plots Tabular → linear FA
Function Approximation Neural Network Layers (MLP, CNN, RNN, Transformer) Full deep network for value/policy Layer-by-layer architecture diagram with activation shapes DQN, A3C, PPO, Decision Transformer
Function Approximation Computation Graph / Backpropagation Flow Gradient flow through network Directed acyclic graph (DAG) of operations All deep RL
Function Approximation Target Network Frozen copy of Q-network for stability Dual-network diagram with periodic copy arrow DQN, DDQN, SAC, TD3
Policy Gradients Policy Gradient Theorem ∇_θ J(θ) = E[∇_θ log π(a|s) ⋅ Â] Flow diagram from reward → log-prob → gradient REINFORCE, PG methods
Policy Gradients REINFORCE Update Monte-Carlo policy gradient Full-trajectory gradient flow diagram REINFORCE
Policy Gradients Baseline / Advantage Subtraction Subtract b(s) to reduce variance Diagram comparing raw return vs. advantage-scaled gradient All modern PG
Policy Gradients Trust Region (TRPO) KL-divergence constraint on policy update Constraint boundary diagram or trust-region circle TRPO
Policy Gradients Proximal Policy Optimization (PPO) Clipped surrogate objective Clip function plot (min/max bounds) PPO, PPO-Clip
Actor-Critic Actor-Critic Architecture Separate or shared actor (policy) + critic (value) networks Dual-network diagram with shared backbone option A2C, A3C, SAC, TD3
Actor-Critic Advantage Actor-Critic (A2C/A3C) Synchronous/asynchronous multi-worker Multi-threaded diagram with global parameter server A2C/A3C
Actor-Critic Soft Actor-Critic (SAC) Entropy-regularized policy + twin critics Architecture with entropy bonus term shown as extra input SAC
Actor-Critic Twin Delayed DDPG (TD3) Twin critics + delayed policy + target smoothing Three-network diagram (actor + two critics) TD3
Exploration ε-Greedy Strategy Probability ε of random action Decay curve plot (ε vs. episodes) DQN family
Exploration Softmax / Boltzmann Exploration Temperature τ in softmax Temperature decay curve or probability surface Softmax policies
Exploration Upper Confidence Bound (UCB) Optimism in face of uncertainty Confidence bound bars on action values UCB1, bandits
Exploration Intrinsic Motivation / Curiosity Prediction error as intrinsic reward Separate intrinsic reward module diagram ICM, RND, Curiosity-driven RL
Exploration Entropy Regularization Bonus term αH(π) Entropy plot or bonus curve SAC, maximum-entropy RL
Hierarchical RL Options Framework High-level policy over options (temporally extended actions) Hierarchical diagram with option policy layer Option-Critic
Hierarchical RL Feudal Networks / Hierarchical Actor-Critic Manager-worker hierarchy Multi-level network diagram Feudal RL
Hierarchical RL Skill Discovery Unsupervised emergence of reusable skills Skill embedding space visualization DIAYN, VALOR
Model-Based RL Learned Dynamics Model ˆP(s′|s,a) or world model Separate model network diagram (often RNN or transformer) Dyna, MBPO, Dreamer
Model-Based RL Model-Based Planning Rollouts inside learned model Tree or rollout diagram inside model MuZero, DreamerV3
Model-Based RL Imagination-Augmented Agents (I2A) Imagination module + policy Imagination rollout diagram I2A
Offline RL Offline Dataset Fixed batch of trajectories Replay buffer diagram (no interaction arrow) BC, CQL, IQL
Offline RL Conservative Q-Learning (CQL) Penalty on out-of-distribution actions Q-value regularization diagram CQL
Multi-Agent RL Multi-Agent Interaction Graph Agents communicating or competing Graph with nodes = agents, edges = communication MARL, MADDPG
Multi-Agent RL Centralized Training Decentralized Execution (CTDE) Shared critic during training Dual-view diagram (central critic vs. local actors) QMIX, VDN, MADDPG
Multi-Agent RL Cooperative / Competitive Payoff Matrix Joint reward for multiple agents Heatmap matrix of joint rewards Prisoner's Dilemma, multi-agent gridworlds
Inverse RL / IRL Reward Inference Infer reward from expert demonstrations Demonstration trajectory → inferred reward heatmap IRL, GAIL
Inverse RL / IRL Generative Adversarial Imitation Learning (GAIL) Discriminator vs. policy generator GAN-style diagram adapted for trajectories GAIL, AIRL
Meta-RL Meta-RL Architecture Outer loop (meta-policy) + inner loop (task adaptation) Nested loop diagram MAML for RL, RL²
Meta-RL Task Distribution Visualization Multiple MDPs sampled from meta-distribution Grid of task environments or embedding space Meta-RL benchmarks
Advanced / Misc Experience Replay Buffer Stored (s,a,r,s′,done) tuples FIFO queue or prioritized sampling diagram DQN and all off-policy deep RL
Advanced / Misc State Visitation / Occupancy Measure Frequency of visiting each state Heatmap or density plot All algorithms (analysis)
Advanced / Misc Learning Curve Average episodic return vs. episodes / steps Line plot with confidence bands Standard performance reporting
Advanced / Misc Regret / Cumulative Regret Sub-optimality accumulated Cumulative sum plot Bandits and online RL
Advanced / Misc Attention Mechanisms (Transformers in RL) Attention weights Attention heatmap or token highlighting Decision Transformer, Trajectory Transformer
Advanced / Misc Diffusion Policy Denoising diffusion process for action generation Step-by-step denoising trajectory diagram Diffusion-RL policies
Advanced / Misc Graph Neural Networks for RL Node/edge message passing Graph convolution diagram Graph RL, relational RL
Advanced / Misc World Model / Latent Space Encoder-decoder dynamics in latent space Encoder → latent → decoder diagram Dreamer, PlaNet
Advanced / Misc Convergence Analysis Plots Error / value change over iterations Log-scale convergence curves DP, TD, value iteration

This table contains every standard and widely-published graphically presented component in reinforcement learning (foundational theory, classic algorithms, deep RL extensions, modern variants, and analysis tools). It draws from Sutton & Barto (2nd ed.), all major deep RL papers (DQN through DreamerV3), and common visualization practices in the literature. No major component that is routinely shown in diagrams, flowcharts, backup diagrams, architectures, heatmaps, or plots has been omitted. If you need the actual image/diagram for any row or a deeper dive into one, just specify the row!