Spaces:
Running
title: DT-Explorer
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
DT-Circuits: Mechanistic Interpretability for Decision Transformers
DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.
Live Interactive Demo: DT-Explorer on Hugging Face Spaces
Table of Contents
- Core Objectives
- Technical Overview
- Capabilities
- Project Structure
- Installation and Usage
- Documentation
- Foundational Research & References
- Citation
- License
Core Objectives
- Map Information Flow: Quantify how input tokens (State, Action, Reward-to-Go) contribute to the output action logits.
- Causal Verification: Use intervention techniques to identify the minimal set of model components required for specific behaviors.
- Feature Decomposition: Use Sparse Autoencoders (SAEs) to identify monosemantic features within the model's residual stream.
- Behavioral Control: Modify agent decisions at inference time by manipulating internal activations.
Technical Overview
The framework centers around HookedDT, a Decision Transformer implementation that allows for activation hooking and cache management.
Information Flow Diagram
graph TD
subgraph Input_Sequence
S[State Tokens]
A[Action Tokens]
RTG[Reward-to-Go Tokens]
end
Input_Sequence --> Embed[Embedding Layers]
Embed --> Hooks[Activation Hooks]
subgraph Transformer_Block
Hooks --> Attn[Multi-Head Attention]
Attn --> MLP[MLP Layers]
MLP --> Res[Residual Stream]
end
Res --> DLA[Direct Logit Attribution]
Res --> SAE[Sparse Autoencoder]
Res --> Output[Action Logits]
subgraph Interpretability_&_Safety
DLA -.-> Analysis
DLA -.-> MAD[Functional Attribution MAD]
SAE -.-> Features
SAE -.-> Auditor[Deceptive Alignment Auditor]
Intervention[Activation Patching] -.-> Hooks
Output & S --> Directer[Dynamic Rejection Steering]
Directer -.-> |Feedback Adjust Alpha| Hooks
end
subgraph Interactive_Surgeon_Dashboard
Surgeon[Circuit Surgeon Ablation Engine] -.-> |Dynamic Node/Edge Hooks| Hooks
Surgeon --> |Format Schema| Neuronpedia[Neuronpedia Export Hub]
Surgeon --> |Live Loop Execution| MiniGrid[MiniGrid Behavioral Audit]
Output -.-> Surgeon
end
Capabilities
Causal Mediation and Attribution
- Direct Logit Attribution (DLA): Measures the direct contribution of individual attention heads and MLP layers to the final logit output.
- Activation Patching: Substitutes internal activations from different runs to isolate the causal effect of specific inputs on model behavior.
- Path Patching: Traces how information flows through specific connections between model components.
Feature Discovery and Analysis
- Sparse Autoencoders (SAEs): Decomposes the residual stream into a set of sparse features, helping to resolve polysemanticity.
- Induction Scanning: Identifies attention heads that perform pattern-matching and temporal sequence recognition.
- Automated Circuit Discovery (ACDC): Prunes the model to identify the smallest functional subgraph sufficient to perform a specific task.
Behavioral Steering & Safety Auditing
- Activation Steering: Injects specific vectors into the residual stream to bias the agent's decision-making without retraining the weights.
- Dynamic Rejection Steering (Directer): Integrates a feedback loop during inference to dynamically scale back steering magnitude if it pushes the action distribution toward illegal or dangerous actions.
- Deceptive Alignment Auditing: Uses SAE feature decomposition to identify the "situational awareness switch" feature in deceptively aligned agents (model organisms watched vs unwatched) and traces the circuit of attention heads that activate it.
- Functional Attribution MAD: Detects mechanistic anomalies (such as backdoors or reward hacks) by comparing active logit attribution signatures to a cached reference profile, flagging when goals are met using atypical circuits.
Interactive Surgical Auditing & Peer Review
- Interactive Circuit Surgery: Provides real-time interactive node (Heads, MLPs) and communication path (edges) ablation tools. Severed pathways dynamically update the underlying architecture using custom forward hooks.
- Live Behavioral Audits: Evaluates guided agent behavior inside a live Gymnasium (MiniGrid) environment step-by-step to immediately visualize behavioral changes under currently selected surgical configurations.
- Neuronpedia Export: Formats the discovered circuit blueprint, active components, and performance metrics into standardized schemas for publishing directly to the Neuronpedia platform for public peer review.
Project Structure
DT-Circuits/
βββ src/
β βββ dashboard/
β β βββ app.py # Streamlit-based visualization UI
β βββ data/
β β βββ harvester.py # PPO-based expert trajectory harvester
β βββ interpretability/
β β βββ acdc.py # Automated Circuit Discovery logic
β β βββ attribution.py # Direct Logit Attribution (DLA)
β β βββ circuit_surgeon.py # Interactive node & path ablation engine
β β βββ evolution.py # Training Dynamics Analysis
β β βββ induction_scan.py # Induction head detection logic
β β βββ neuronpedia.py # Neuronpedia publishing client
β β βββ nla.py # Natural Language Autoencoder Explainer
β β βββ patching.py # Causal activation patching tools
β β βββ path_patching.py # Path-based causal intervention engine
β β βββ safety.py # Safety auditing, directer, and deceptive alignment tools
β β βββ sae_manager.py # SAE deployment and anomaly detection
β β βββ steering.py # Steering vector generation and injection
β β βββ universality.py # Cross-architecture feature mapping
β βββ models/
β β βββ hooked_dt.py # TransformerLens-wrapped Decision Transformer
β βββ config.py # Centralized hyperparameter management
β βββ utils/
βββ tests/ # Unit tests for all modules
βββ config.yaml # External hyperparameter storage
βββ requirements.txt
βββ docs/
Configuration
Hyperparameters are managed through a dual-system for both ease of use and research reproducibility:
config.yaml: The primary interface for users. You can modify model dimensions, training epochs, and environment settings here without touching the code.src/config.py: Defines the underlying structure using Python dataclasses. It automatically loads overrides fromconfig.yamlat runtime.
Key Configuration Sections
| Section | Description | Key Parameters |
|---|---|---|
model |
Architecture settings for the Decision Transformer | n_layers, d_model, n_heads, max_length |
data |
Settings for expert trajectory collection | env_id, num_episodes (for DT training) |
train |
DT training hyperparameters | lr, epochs, seed |
sae |
Sparse Autoencoder training hyperparameters | expansion_factor, k, num_episodes (SAE specific) |
Example: Independent Data Control You can control the amount of data used for general training vs. interpretability separately:
data:
num_episodes: 1000 # Episodes for training the DT teacher
sae:
num_episodes: 500 # Episodes for extracting SAE activations
Execution Modes: Installation and Usage
There are two primary ways to run and interact with the DT-Circuits framework depending on your research needs:
Way 1: Interactive Cloud Demo (Hugging Face Spaces)
For instant visual exploration, path intervention, and alignment auditing without any local workspace preparation, launch the web dashboard directly:
- Demo Link: DT-Explorer on Hugging Face Spaces
Concise Demo Constraints:
- CPU-Bound Resources: Runs on standard free-tier CPU instances (2 vCPUs, 16 GB RAM); high-overhead operations like ACDC scans may show higher latency than on a local GPU workspace.
- Slices Dataset: Trajectory datasets are dynamically sliced down to a lightweight demo set under a 10MB limit (defined in deploy.sh) for storage and memory footprint constraints.
- Read-Only / Ephemeral Container: Uses pre-baked static weights (
mini_dt.pt) and pre-trained SAE checkpoints. Training new models or writing persistent states is disabled.
Way 2: Clone and Run Locally (Full Pipeline)
For full end-to-end research, customized hyperparameter tuning, local data harvesting, and GPU-accelerated model or SAE training, run the workspace on your machine.
Local Environment Setup
First, clone the repository, set up a virtual environment, and install dependencies:
git clone https://github.com/sadhumitha-s/DT-Circuits
cd DT-Circuits
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Option 2.1: Simple Workflows via Makefile
The workspace includes a standardized Makefile to orchestrate common research pipelines with single commands:
make setup # Set up local environment & install requirements
make train # Run the full end-to-end pipeline (Data harvesting -> DT -> SAE training)
make dashboard # Run the Streamlit visualization dashboard locally
Option 2.2: Granular Control via Bash & Python
For research flexibility, execute each step of the pipeline manually using granular terminal scripts:
Trajectories & Model Training Harvest teacher trajectories and train the target Decision Transformer (
HookedDT):python scripts/train_dt.pyTopK Sparse Autoencoder (SAE) Training Train sparse autoencoders on target activation layers:
python scripts/train_sae.pyInteractive Analysis Launch the Streamlit visualization engine locally to run audits with custom weights:
streamlit run src/dashboard/app.py
Documentation
Detailed technical documentation for specific modules:
Foundational Research & References
This framework implements and builds upon the following foundational methodologies:
- Decision Transformers: Chen et al., 2021 β Reinforcement learning as sequence modeling.
- Transformer Circuits: Elhage et al., 2021 β Mathematical foundations of mechanistic interpretability.
- ACDC (Automated Circuit Discovery): Conmy et al., 2023 β Algorithmic discovery of subgraphs.
- Sparse Autoencoders (SAEs): Bricken et al., 2023 (monosemantic features) & Gao et al., 2024 (TopK SAEs).
- Activation Steering: Turner et al., 2023 β Control via residual stream vector additions.
- Path Patching: Goldowsky-Dill et al., 2023 β Inter-component causal mediation.
Citation
@software{dt_circuits2026,
author = {Sadhumitha S.},
title = {DT-Circuits: Mechanistic Interpretability for Decision Transformers},
year = {2026},
url = {https://github.com/sadhumitha-s/DT-Circuits}
}
License
Apache 2.0