Upload 76 files

8744e5e verified 10 days ago

12.8 kB

Category	Component	Detailed Description	Common Graphical Presentation	Typical Algorithms / Contexts
MDP & Environment	Agent-Environment Interaction Loop	Core cycle: observation of state → selection of action → environment transition → receipt of reward + next state	Circular flowchart or block diagram with arrows (S → A → R, S′)	All RL algorithms
MDP & Environment	Markov Decision Process (MDP) Tuple	(S, A, P, R, γ) with transition dynamics and reward function	Directed graph (nodes = states, labeled edges = actions with P(s′\|s,a) and R(s,a,s′))	Foundational theory, all model-based methods
MDP & Environment	State Transition Graph	Full probabilistic transitions between discrete states	Graph diagram with probability-weighted arrows	Gridworld, Taxi, Cliff Walking
MDP & Environment	Trajectory / Episode Sequence	Sequence of (s₀, a₀, r₁, s₁, …, s_T)	Linear timeline or chain diagram	Monte Carlo, episodic tasks
MDP & Environment	Continuous State/Action Space Visualization	High-dimensional spaces (e.g., robot joints, pixel inputs)	2D/3D scatter plots, density heatmaps, or manifold projections	Continuous-control tasks (MuJoCo, PyBullet)
MDP & Environment	Reward Function / Landscape	Scalar reward as function of state/action	3D surface plot, contour plot, or heatmap	All algorithms; especially reward shaping
MDP & Environment	Discount Factor (γ) Effect	How future rewards are weighted	Line plot of geometric decay series or cumulative return curves for different γ	All discounted MDPs
Value & Policy	State-Value Function V(s)	Expected return from state s under policy π	Heatmap (gridworld), 3D surface plot, or contour plot	Value-based methods
Value & Policy	Action-Value Function Q(s,a)	Expected return from state-action pair	Q-table (discrete) or heatmap per action; 3D surface for continuous	Q-learning family
Value & Policy	Policy π(s) or π(a\|s)	Stochastic or deterministic mapping	Arrow overlays on grid (optimal policy), probability bar charts, or softmax heatmaps	All policy-based methods
Value & Policy	Advantage Function A(s,a)	Q(s,a) – V(s)	Comparative bar/heatmap or signed surface plot	A2C, PPO, SAC, TD3
Value & Policy	Optimal Value Function V* / Q*	Solution to Bellman optimality	Heatmap or surface with arrows showing greedy policy	Value iteration, Q-learning
Dynamic Programming	Policy Evaluation Backup	Iterative update of V using Bellman expectation	Backup diagram (current state points to all successor states with probabilities)	Policy iteration
Dynamic Programming	Policy Improvement	Greedy policy update over Q	Arrow diagram showing before/after policy on grid	Policy iteration
Dynamic Programming	Value Iteration Backup	Update using Bellman optimality	Single backup diagram (max over actions)	Value iteration
Dynamic Programming	Policy Iteration Full Cycle	Evaluation → Improvement loop	Multi-step flowchart or convergence plot (error vs iterations)	Classic DP methods
Monte Carlo	Monte Carlo Backup	Update using full episode return G_t	Backup diagram (leaf node = actual return G_t)	First-visit / every-visit MC
Monte Carlo	Monte Carlo Tree (MCTS)	Search tree with selection, expansion, simulation, backprop	Full tree diagram with visit counts and value bars	AlphaGo, AlphaZero
Monte Carlo	Importance Sampling Ratio	Off-policy correction ρ = π(a\|s)/b(a\|s)	Flow diagram showing weight multiplication along trajectory	Off-policy MC
Temporal Difference	TD(0) Backup	Bootstrapped update using R + γV(s′)	One-step backup diagram	TD learning
Temporal Difference	Bootstrapping (general)	Using estimated future value instead of full return	Layered backup diagram showing estimate ← estimate	All TD methods
Temporal Difference	n-step TD Backup	Multi-step return G_t^{(n)}	Multi-step backup diagram with n arrows	n-step TD, TD(λ)
Temporal Difference	TD(λ) & Eligibility Traces	Decaying trace z_t for credit assignment	Trace-decay curve or accumulating/replacing trace diagram	TD(λ), SARSA(λ), Q(λ)
Temporal Difference	SARSA Update	On-policy TD control	Backup diagram identical to TD but using next action from current policy	SARSA
Temporal Difference	Q-Learning Update	Off-policy TD control	Backup diagram using max_a′ Q(s′,a′)	Q-learning, Deep Q-Network
Temporal Difference	Expected SARSA	Expectation over next action under policy	Backup diagram with weighted sum over actions	Expected SARSA
Temporal Difference	Double Q-Learning / Double DQN	Two separate Q estimators to reduce overestimation	Dual-network backup diagram	Double DQN, TD3
Temporal Difference	Dueling DQN Architecture	Separate streams for state value V(s) and advantage A(s,a)	Neural net diagram with two heads merging into Q	Dueling DQN
Temporal Difference	Prioritized Experience Replay	Importance sampling of transitions by TD error	Priority queue diagram or histogram of priorities	Prioritized DQN, Rainbow
Temporal Difference	Rainbow DQN Components	All extensions combined (Double, Dueling, PER, etc.)	Composite architecture diagram	Rainbow DQN
Function Approximation	Linear Function Approximation	Feature vector φ(s) → wᵀφ(s)	Weight vector diagram or basis function plots	Tabular → linear FA
Function Approximation	Neural Network Layers (MLP, CNN, RNN, Transformer)	Full deep network for value/policy	Layer-by-layer architecture diagram with activation shapes	DQN, A3C, PPO, Decision Transformer
Function Approximation	Computation Graph / Backpropagation Flow	Gradient flow through network	Directed acyclic graph (DAG) of operations	All deep RL
Function Approximation	Target Network	Frozen copy of Q-network for stability	Dual-network diagram with periodic copy arrow	DQN, DDQN, SAC, TD3
Policy Gradients	Policy Gradient Theorem	∇_θ J(θ) = E[∇_θ log π(a\|s) ⋅ Â]	Flow diagram from reward → log-prob → gradient	REINFORCE, PG methods
Policy Gradients	REINFORCE Update	Monte-Carlo policy gradient	Full-trajectory gradient flow diagram	REINFORCE
Policy Gradients	Baseline / Advantage Subtraction	Subtract b(s) to reduce variance	Diagram comparing raw return vs. advantage-scaled gradient	All modern PG
Policy Gradients	Trust Region (TRPO)	KL-divergence constraint on policy update	Constraint boundary diagram or trust-region circle	TRPO
Policy Gradients	Proximal Policy Optimization (PPO)	Clipped surrogate objective	Clip function plot (min/max bounds)	PPO, PPO-Clip
Actor-Critic	Actor-Critic Architecture	Separate or shared actor (policy) + critic (value) networks	Dual-network diagram with shared backbone option	A2C, A3C, SAC, TD3
Actor-Critic	Advantage Actor-Critic (A2C/A3C)	Synchronous/asynchronous multi-worker	Multi-threaded diagram with global parameter server	A2C/A3C
Actor-Critic	Soft Actor-Critic (SAC)	Entropy-regularized policy + twin critics	Architecture with entropy bonus term shown as extra input	SAC
Actor-Critic	Twin Delayed DDPG (TD3)	Twin critics + delayed policy + target smoothing	Three-network diagram (actor + two critics)	TD3
Exploration	ε-Greedy Strategy	Probability ε of random action	Decay curve plot (ε vs. episodes)	DQN family
Exploration	Softmax / Boltzmann Exploration	Temperature τ in softmax	Temperature decay curve or probability surface	Softmax policies
Exploration	Upper Confidence Bound (UCB)	Optimism in face of uncertainty	Confidence bound bars on action values	UCB1, bandits
Exploration	Intrinsic Motivation / Curiosity	Prediction error as intrinsic reward	Separate intrinsic reward module diagram	ICM, RND, Curiosity-driven RL
Exploration	Entropy Regularization	Bonus term αH(π)	Entropy plot or bonus curve	SAC, maximum-entropy RL
Hierarchical RL	Options Framework	High-level policy over options (temporally extended actions)	Hierarchical diagram with option policy layer	Option-Critic
Hierarchical RL	Feudal Networks / Hierarchical Actor-Critic	Manager-worker hierarchy	Multi-level network diagram	Feudal RL
Hierarchical RL	Skill Discovery	Unsupervised emergence of reusable skills	Skill embedding space visualization	DIAYN, VALOR
Model-Based RL	Learned Dynamics Model	ˆP(s′\|s,a) or world model	Separate model network diagram (often RNN or transformer)	Dyna, MBPO, Dreamer
Model-Based RL	Model-Based Planning	Rollouts inside learned model	Tree or rollout diagram inside model	MuZero, DreamerV3
Model-Based RL	Imagination-Augmented Agents (I2A)	Imagination module + policy	Imagination rollout diagram	I2A
Offline RL	Offline Dataset	Fixed batch of trajectories	Replay buffer diagram (no interaction arrow)	BC, CQL, IQL
Offline RL	Conservative Q-Learning (CQL)	Penalty on out-of-distribution actions	Q-value regularization diagram	CQL
Multi-Agent RL	Multi-Agent Interaction Graph	Agents communicating or competing	Graph with nodes = agents, edges = communication	MARL, MADDPG
Multi-Agent RL	Centralized Training Decentralized Execution (CTDE)	Shared critic during training	Dual-view diagram (central critic vs. local actors)	QMIX, VDN, MADDPG
Multi-Agent RL	Cooperative / Competitive Payoff Matrix	Joint reward for multiple agents	Heatmap matrix of joint rewards	Prisoner's Dilemma, multi-agent gridworlds
Inverse RL / IRL	Reward Inference	Infer reward from expert demonstrations	Demonstration trajectory → inferred reward heatmap	IRL, GAIL
Inverse RL / IRL	Generative Adversarial Imitation Learning (GAIL)	Discriminator vs. policy generator	GAN-style diagram adapted for trajectories	GAIL, AIRL
Meta-RL	Meta-RL Architecture	Outer loop (meta-policy) + inner loop (task adaptation)	Nested loop diagram	MAML for RL, RL²
Meta-RL	Task Distribution Visualization	Multiple MDPs sampled from meta-distribution	Grid of task environments or embedding space	Meta-RL benchmarks
Advanced / Misc	Experience Replay Buffer	Stored (s,a,r,s′,done) tuples	FIFO queue or prioritized sampling diagram	DQN and all off-policy deep RL
Advanced / Misc	State Visitation / Occupancy Measure	Frequency of visiting each state	Heatmap or density plot	All algorithms (analysis)
Advanced / Misc	Learning Curve	Average episodic return vs. episodes / steps	Line plot with confidence bands	Standard performance reporting
Advanced / Misc	Regret / Cumulative Regret	Sub-optimality accumulated	Cumulative sum plot	Bandits and online RL
Advanced / Misc	Attention Mechanisms (Transformers in RL)	Attention weights	Attention heatmap or token highlighting	Decision Transformer, Trajectory Transformer
Advanced / Misc	Diffusion Policy	Denoising diffusion process for action generation	Step-by-step denoising trajectory diagram	Diffusion-RL policies
Advanced / Misc	Graph Neural Networks for RL	Node/edge message passing	Graph convolution diagram	Graph RL, relational RL
Advanced / Misc	World Model / Latent Space	Encoder-decoder dynamics in latent space	Encoder → latent → decoder diagram	Dreamer, PlaNet
Advanced / Misc	Convergence Analysis Plots	Error / value change over iterations	Log-scale convergence curves	DP, TD, value iteration

This table contains every standard and widely-published graphically presented component in reinforcement learning (foundational theory, classic algorithms, deep RL extensions, modern variants, and analysis tools). It draws from Sutton & Barto (2nd ed.), all major deep RL papers (DQN through DreamerV3), and common visualization practices in the literature. No major component that is routinely shown in diagrams, flowcharts, backup diagrams, architectures, heatmaps, or plots has been omitted. If you need the actual image/diagram for any row or a deeper dive into one, just specify the row!