Spaces:
Running
Running
| title: DT-Explorer | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| pinned: false | |
| # DT-Circuits: Mechanistic Interpretability for Decision Transformers | |
| [](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer) | |
| [](https://www.python.org/downloads/) | |
| [](https://pytorch.org/) | |
| [](https://opensource.org/licenses/Apache-2.0) | |
| [](https://github.com/TransformerLensOrg/TransformerLens) | |
| DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents. | |
| **Live Interactive Demo:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer) | |
| --- | |
| ## Table of Contents | |
| - [Core Objectives](#core-objectives) | |
| - [Technical Overview](#technical-overview) | |
| - [Capabilities](#capabilities) | |
| - [Project Structure](#project-structure) | |
| - [Installation and Usage](#installation-and-usage) | |
| - [Documentation](#documentation) | |
| - [Foundational Research & References](#foundational-research--references) | |
| - [Citation](#citation) | |
| - [License](#license) | |
| --- | |
| ## Core Objectives | |
| 1. **Map Information Flow**: Quantify how input tokens (State, Action, Reward-to-Go) contribute to the output action logits. | |
| 2. **Causal Verification**: Use intervention techniques to identify the minimal set of model components required for specific behaviors. | |
| 3. **Feature Decomposition**: Use Sparse Autoencoders (SAEs) to identify monosemantic features within the model's residual stream. | |
| 4. **Behavioral Control**: Modify agent decisions at inference time by manipulating internal activations. | |
| --- | |
| ## Technical Overview | |
| The framework centers around `HookedDT`, a Decision Transformer implementation that allows for activation hooking and cache management. | |
| ### Information Flow Diagram | |
| ```mermaid | |
| graph TD | |
| subgraph Input_Sequence | |
| S[State Tokens] | |
| A[Action Tokens] | |
| RTG[Reward-to-Go Tokens] | |
| end | |
| Input_Sequence --> Embed[Embedding Layers] | |
| Embed --> Hooks[Activation Hooks] | |
| subgraph Transformer_Block | |
| Hooks --> Attn[Multi-Head Attention] | |
| Attn --> MLP[MLP Layers] | |
| MLP --> Res[Residual Stream] | |
| end | |
| Res --> DLA[Direct Logit Attribution] | |
| Res --> SAE[Sparse Autoencoder] | |
| Res --> Output[Action Logits] | |
| subgraph Interpretability_&_Safety | |
| DLA -.-> Analysis | |
| DLA -.-> MAD[Functional Attribution MAD] | |
| SAE -.-> Features | |
| SAE -.-> Auditor[Deceptive Alignment Auditor] | |
| Intervention[Activation Patching] -.-> Hooks | |
| Output & S --> Directer[Dynamic Rejection Steering] | |
| Directer -.-> |Feedback Adjust Alpha| Hooks | |
| end | |
| subgraph Interactive_Surgeon_Dashboard | |
| Surgeon[Circuit Surgeon Ablation Engine] -.-> |Dynamic Node/Edge Hooks| Hooks | |
| Surgeon --> |Format Schema| Neuronpedia[Neuronpedia Export Hub] | |
| Surgeon --> |Live Loop Execution| MiniGrid[MiniGrid Behavioral Audit] | |
| Output -.-> Surgeon | |
| end | |
| ``` | |
| --- | |
| ## Capabilities | |
| ### Causal Mediation and Attribution | |
| * **Direct Logit Attribution (DLA)**: Measures the direct contribution of individual attention heads and MLP layers to the final logit output. | |
| * **Activation Patching**: Substitutes internal activations from different runs to isolate the causal effect of specific inputs on model behavior. | |
| * **Path Patching**: Traces how information flows through specific connections between model components. | |
| ### Feature Discovery and Analysis | |
| * **Sparse Autoencoders (SAEs)**: Decomposes the residual stream into a set of sparse features, helping to resolve polysemanticity. | |
| * **Induction Scanning**: Identifies attention heads that perform pattern-matching and temporal sequence recognition. | |
| * **Automated Circuit Discovery (ACDC)**: Prunes the model to identify the smallest functional subgraph sufficient to perform a specific task. | |
| ### Behavioral Steering & Safety Auditing | |
| * **Activation Steering**: Injects specific vectors into the residual stream to bias the agent's decision-making without retraining the weights. | |
| * **Dynamic Rejection Steering (Directer)**: Integrates a feedback loop during inference to dynamically scale back steering magnitude if it pushes the action distribution toward illegal or dangerous actions. | |
| * **Deceptive Alignment Auditing**: Uses SAE feature decomposition to identify the "situational awareness switch" feature in deceptively aligned agents (model organisms watched vs unwatched) and traces the circuit of attention heads that activate it. | |
| * **Functional Attribution MAD**: Detects mechanistic anomalies (such as backdoors or reward hacks) by comparing active logit attribution signatures to a cached reference profile, flagging when goals are met using atypical circuits. | |
| ### Interactive Surgical Auditing & Peer Review | |
| * **Interactive Circuit Surgery**: Provides real-time interactive node (Heads, MLPs) and communication path (edges) ablation tools. Severed pathways dynamically update the underlying architecture using custom forward hooks. | |
| * **Live Behavioral Audits**: Evaluates guided agent behavior inside a live Gymnasium (MiniGrid) environment step-by-step to immediately visualize behavioral changes under currently selected surgical configurations. | |
| * **Neuronpedia Export**: Formats the discovered circuit blueprint, active components, and performance metrics into standardized schemas for publishing directly to the Neuronpedia platform for public peer review. | |
| --- | |
| ## Project Structure | |
| ```text | |
| DT-Circuits/ | |
| βββ src/ | |
| β βββ dashboard/ | |
| β β βββ app.py # Streamlit-based visualization UI | |
| β βββ data/ | |
| β β βββ harvester.py # PPO-based expert trajectory harvester | |
| β βββ interpretability/ | |
| β β βββ acdc.py # Automated Circuit Discovery logic | |
| β β βββ attribution.py # Direct Logit Attribution (DLA) | |
| β β βββ circuit_surgeon.py # Interactive node & path ablation engine | |
| β β βββ evolution.py # Training Dynamics Analysis | |
| β β βββ induction_scan.py # Induction head detection logic | |
| β β βββ neuronpedia.py # Neuronpedia publishing client | |
| β β βββ nla.py # Natural Language Autoencoder Explainer | |
| β β βββ patching.py # Causal activation patching tools | |
| β β βββ path_patching.py # Path-based causal intervention engine | |
| β β βββ safety.py # Safety auditing, directer, and deceptive alignment tools | |
| β β βββ sae_manager.py # SAE deployment and anomaly detection | |
| β β βββ steering.py # Steering vector generation and injection | |
| β β βββ universality.py # Cross-architecture feature mapping | |
| β βββ models/ | |
| β β βββ hooked_dt.py # TransformerLens-wrapped Decision Transformer | |
| β βββ config.py # Centralized hyperparameter management | |
| β βββ utils/ | |
| βββ tests/ # Unit tests for all modules | |
| βββ config.yaml # External hyperparameter storage | |
| βββ requirements.txt | |
| βββ docs/ | |
| ``` | |
| --- | |
| ## Configuration | |
| Hyperparameters are managed through a dual-system for both ease of use and research reproducibility: | |
| 1. **`config.yaml`**: The primary interface for users. You can modify model dimensions, training epochs, and environment settings here without touching the code. | |
| 2. **`src/config.py`**: Defines the underlying structure using Python dataclasses. It automatically loads overrides from `config.yaml` at runtime. | |
| ### Key Configuration Sections | |
| | Section | Description | Key Parameters | | |
| | :--- | :--- | :--- | | |
| | **`model`** | Architecture settings for the Decision Transformer | `n_layers`, `d_model`, `n_heads`, `max_length` | | |
| | **`data`** | Settings for expert trajectory collection | `env_id`, `num_episodes` (for DT training) | | |
| | **`train`** | DT training hyperparameters | `lr`, `epochs`, `seed` | | |
| | **`sae`** | Sparse Autoencoder training hyperparameters | `expansion_factor`, `k`, `num_episodes` (SAE specific) | | |
| **Example: Independent Data Control** | |
| You can control the amount of data used for general training vs. interpretability separately: | |
| ```yaml | |
| data: | |
| num_episodes: 1000 # Episodes for training the DT teacher | |
| sae: | |
| num_episodes: 500 # Episodes for extracting SAE activations | |
| ``` | |
| --- | |
| ## Execution Modes: Installation and Usage | |
| There are two primary ways to run and interact with the **DT-Circuits** framework depending on your research needs: | |
| --- | |
| ### Way 1: Interactive Cloud Demo (Hugging Face Spaces) | |
| For instant visual exploration, path intervention, and alignment auditing without any local workspace preparation, launch the web dashboard directly: | |
| * **Demo Link:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer) | |
| > [!NOTE] | |
| > **Concise Demo Constraints:** | |
| > * **CPU-Bound Resources:** Runs on standard free-tier CPU instances (2 vCPUs, 16 GB RAM); high-overhead operations like ACDC scans may show higher latency than on a local GPU workspace. | |
| > * **Slices Dataset:** Trajectory datasets are dynamically sliced down to a lightweight demo set under a **10MB limit** (defined in [deploy.sh](file:///Users/sadhumitha/Documents/projects/DT-Circuits/scripts/deploy.sh#L19-L33)) for storage and memory footprint constraints. | |
| > * **Read-Only / Ephemeral Container:** Uses pre-baked static weights (`mini_dt.pt`) and pre-trained SAE checkpoints. Training new models or writing persistent states is disabled. | |
| --- | |
| ### Way 2: Clone and Run Locally (Full Pipeline) | |
| For full end-to-end research, customized hyperparameter tuning, local data harvesting, and GPU-accelerated model or SAE training, run the workspace on your machine. | |
| #### Local Environment Setup | |
| First, clone the repository, set up a virtual environment, and install dependencies: | |
| ```bash | |
| git clone https://github.com/sadhumitha-s/DT-Circuits | |
| cd DT-Circuits | |
| python -m venv venv | |
| source venv/bin/activate | |
| pip install -r requirements.txt | |
| ``` | |
| #### Option 2.1: Simple Workflows via Makefile | |
| The workspace includes a standardized [Makefile](file:///Users/sadhumitha/Documents/projects/DT-Circuits/Makefile) to orchestrate common research pipelines with single commands: | |
| ```bash | |
| make setup # Set up local environment & install requirements | |
| make train # Run the full end-to-end pipeline (Data harvesting -> DT -> SAE training) | |
| make dashboard # Run the Streamlit visualization dashboard locally | |
| ``` | |
| #### Option 2.2: Granular Control via Bash & Python | |
| For research flexibility, execute each step of the pipeline manually using granular terminal scripts: | |
| 1. **Trajectories & Model Training** | |
| Harvest teacher trajectories and train the target Decision Transformer (`HookedDT`): | |
| ```bash | |
| python scripts/train_dt.py | |
| ``` | |
| 2. **TopK Sparse Autoencoder (SAE) Training** | |
| Train sparse autoencoders on target activation layers: | |
| ```bash | |
| python scripts/train_sae.py | |
| ``` | |
| 3. **Interactive Analysis** | |
| Launch the Streamlit visualization engine locally to run audits with custom weights: | |
| ```bash | |
| streamlit run src/dashboard/app.py | |
| ``` | |
| --- | |
| ## Documentation | |
| Detailed technical documentation for specific modules: | |
| * [Circuit Discovery](./docs/circuit_discovery.md) | |
| * [Causal Intervention](./docs/activation_patching.md) | |
| * [SAEs and Steering](./docs/sae_steering.md) | |
| * [Safety Auditing & Steering](./docs/safety_auditing.md) | |
| --- | |
| ## Foundational Research & References | |
| This framework implements and builds upon the following foundational methodologies: | |
| * **Decision Transformers**: [Chen et al., 2021](https://arxiv.org/abs/2106.01345) β Reinforcement learning as sequence modeling. | |
| * **Transformer Circuits**: [Elhage et al., 2021](https://transformer-circuits.pub/2021/framework/index.html) β Mathematical foundations of mechanistic interpretability. | |
| * **ACDC (Automated Circuit Discovery)**: [Conmy et al., 2023](https://arxiv.org/abs/2304.14997) β Algorithmic discovery of subgraphs. | |
| * **Sparse Autoencoders (SAEs)**: [Bricken et al., 2023](https://transformer-circuits.pub/2023/monosemantic-features/index.html) (monosemantic features) & [Gao et al., 2024](https://arxiv.org/abs/2406.04096) (TopK SAEs). | |
| * **Activation Steering**: [Turner et al., 2023](https://arxiv.org/abs/2308.10248) β Control via residual stream vector additions. | |
| * **Path Patching**: [Goldowsky-Dill et al., 2023](https://arxiv.org/abs/2304.05969) β Inter-component causal mediation. | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @software{dt_circuits2026, | |
| author = {Sadhumitha S.}, | |
| title = {DT-Circuits: Mechanistic Interpretability for Decision Transformers}, | |
| year = {2026}, | |
| url = {https://github.com/sadhumitha-s/DT-Circuits} | |
| } | |
| ``` | |
| --- | |
| ## License | |
| Apache 2.0 | |