Spaces:
Running
Running
File size: 13,357 Bytes
a825f06 0346604 e2614dc 705175b b7ddfc6 663f50c 8577352 e2614dc 848238a 8577352 11dbbc6 b7ddfc6 11dbbc6 b7ddfc6 848238a b7ddfc6 663f50c b7ddfc6 11dbbc6 b7ddfc6 11dbbc6 b7ddfc6 e2614dc b7ddfc6 0346604 b7ddfc6 5ccbe34 b7ddfc6 5ccbe34 b7ddfc6 5ccbe34 b7ddfc6 5ccbe34 b7ddfc6 33a0021 b7ddfc6 0346604 b7ddfc6 0346604 b7ddfc6 11dbbc6 b7ddfc6 663f50c b7ddfc6 0346604 5ccbe34 b7ddfc6 5ccbe34 0346604 33a0021 663f50c b7ddfc6 663f50c 33a0021 663f50c 33a0021 8577352 663f50c 5ccbe34 663f50c 8577352 663f50c b7ddfc6 663f50c 8577352 b7ddfc6 663f50c b7ddfc6 663f50c b7ddfc6 0346604 b7ddfc6 0346604 b7ddfc6 8577352 b7ddfc6 e2614dc b7ddfc6 8577352 b7ddfc6 e2614dc b7ddfc6 848238a b7ddfc6 848238a b7ddfc6 848238a b7ddfc6 848238a b7ddfc6 848238a 705175b 848238a e2614dc 848238a 8577352 848238a 8577352 848238a b7ddfc6 848238a 8577352 b7ddfc6 5ccbe34 b7ddfc6 848238a b7ddfc6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 | ---
title: DT-Explorer
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---
# DT-Circuits: Mechanistic Interpretability for Decision Transformers
[](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)
[](https://www.python.org/downloads/)
[](https://pytorch.org/)
[](https://opensource.org/licenses/Apache-2.0)
[](https://github.com/TransformerLensOrg/TransformerLens)
DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.
**Live Interactive Demo:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)
---
## Table of Contents
- [Core Objectives](#core-objectives)
- [Technical Overview](#technical-overview)
- [Capabilities](#capabilities)
- [Project Structure](#project-structure)
- [Installation and Usage](#installation-and-usage)
- [Documentation](#documentation)
- [Foundational Research & References](#foundational-research--references)
- [Citation](#citation)
- [License](#license)
---
## Core Objectives
1. **Map Information Flow**: Quantify how input tokens (State, Action, Reward-to-Go) contribute to the output action logits.
2. **Causal Verification**: Use intervention techniques to identify the minimal set of model components required for specific behaviors.
3. **Feature Decomposition**: Use Sparse Autoencoders (SAEs) to identify monosemantic features within the model's residual stream.
4. **Behavioral Control**: Modify agent decisions at inference time by manipulating internal activations.
---
## Technical Overview
The framework centers around `HookedDT`, a Decision Transformer implementation that allows for activation hooking and cache management.
### Information Flow Diagram
```mermaid
graph TD
subgraph Input_Sequence
S[State Tokens]
A[Action Tokens]
RTG[Reward-to-Go Tokens]
end
Input_Sequence --> Embed[Embedding Layers]
Embed --> Hooks[Activation Hooks]
subgraph Transformer_Block
Hooks --> Attn[Multi-Head Attention]
Attn --> MLP[MLP Layers]
MLP --> Res[Residual Stream]
end
Res --> DLA[Direct Logit Attribution]
Res --> SAE[Sparse Autoencoder]
Res --> Output[Action Logits]
subgraph Interpretability_&_Safety
DLA -.-> Analysis
DLA -.-> MAD[Functional Attribution MAD]
SAE -.-> Features
SAE -.-> Auditor[Deceptive Alignment Auditor]
Intervention[Activation Patching] -.-> Hooks
Output & S --> Directer[Dynamic Rejection Steering]
Directer -.-> |Feedback Adjust Alpha| Hooks
end
subgraph Interactive_Surgeon_Dashboard
Surgeon[Circuit Surgeon Ablation Engine] -.-> |Dynamic Node/Edge Hooks| Hooks
Surgeon --> |Format Schema| Neuronpedia[Neuronpedia Export Hub]
Surgeon --> |Live Loop Execution| MiniGrid[MiniGrid Behavioral Audit]
Output -.-> Surgeon
end
```
---
## Capabilities
### Causal Mediation and Attribution
* **Direct Logit Attribution (DLA)**: Measures the direct contribution of individual attention heads and MLP layers to the final logit output.
* **Activation Patching**: Substitutes internal activations from different runs to isolate the causal effect of specific inputs on model behavior.
* **Path Patching**: Traces how information flows through specific connections between model components.
### Feature Discovery and Analysis
* **Sparse Autoencoders (SAEs)**: Decomposes the residual stream into a set of sparse features, helping to resolve polysemanticity.
* **Induction Scanning**: Identifies attention heads that perform pattern-matching and temporal sequence recognition.
* **Automated Circuit Discovery (ACDC)**: Prunes the model to identify the smallest functional subgraph sufficient to perform a specific task.
### Behavioral Steering & Safety Auditing
* **Activation Steering**: Injects specific vectors into the residual stream to bias the agent's decision-making without retraining the weights.
* **Dynamic Rejection Steering (Directer)**: Integrates a feedback loop during inference to dynamically scale back steering magnitude if it pushes the action distribution toward illegal or dangerous actions.
* **Deceptive Alignment Auditing**: Uses SAE feature decomposition to identify the "situational awareness switch" feature in deceptively aligned agents (model organisms watched vs unwatched) and traces the circuit of attention heads that activate it.
* **Functional Attribution MAD**: Detects mechanistic anomalies (such as backdoors or reward hacks) by comparing active logit attribution signatures to a cached reference profile, flagging when goals are met using atypical circuits.
### Interactive Surgical Auditing & Peer Review
* **Interactive Circuit Surgery**: Provides real-time interactive node (Heads, MLPs) and communication path (edges) ablation tools. Severed pathways dynamically update the underlying architecture using custom forward hooks.
* **Live Behavioral Audits**: Evaluates guided agent behavior inside a live Gymnasium (MiniGrid) environment step-by-step to immediately visualize behavioral changes under currently selected surgical configurations.
* **Neuronpedia Export**: Formats the discovered circuit blueprint, active components, and performance metrics into standardized schemas for publishing directly to the Neuronpedia platform for public peer review.
---
## Project Structure
```text
DT-Circuits/
βββ src/
β βββ dashboard/
β β βββ app.py # Streamlit-based visualization UI
β βββ data/
β β βββ harvester.py # PPO-based expert trajectory harvester
β βββ interpretability/
β β βββ acdc.py # Automated Circuit Discovery logic
β β βββ attribution.py # Direct Logit Attribution (DLA)
β β βββ circuit_surgeon.py # Interactive node & path ablation engine
β β βββ evolution.py # Training Dynamics Analysis
β β βββ induction_scan.py # Induction head detection logic
β β βββ neuronpedia.py # Neuronpedia publishing client
β β βββ nla.py # Natural Language Autoencoder Explainer
β β βββ patching.py # Causal activation patching tools
β β βββ path_patching.py # Path-based causal intervention engine
β β βββ safety.py # Safety auditing, directer, and deceptive alignment tools
β β βββ sae_manager.py # SAE deployment and anomaly detection
β β βββ steering.py # Steering vector generation and injection
β β βββ universality.py # Cross-architecture feature mapping
β βββ models/
β β βββ hooked_dt.py # TransformerLens-wrapped Decision Transformer
β βββ config.py # Centralized hyperparameter management
β βββ utils/
βββ tests/ # Unit tests for all modules
βββ config.yaml # External hyperparameter storage
βββ requirements.txt
βββ docs/
```
---
## Configuration
Hyperparameters are managed through a dual-system for both ease of use and research reproducibility:
1. **`config.yaml`**: The primary interface for users. You can modify model dimensions, training epochs, and environment settings here without touching the code.
2. **`src/config.py`**: Defines the underlying structure using Python dataclasses. It automatically loads overrides from `config.yaml` at runtime.
### Key Configuration Sections
| Section | Description | Key Parameters |
| :--- | :--- | :--- |
| **`model`** | Architecture settings for the Decision Transformer | `n_layers`, `d_model`, `n_heads`, `max_length` |
| **`data`** | Settings for expert trajectory collection | `env_id`, `num_episodes` (for DT training) |
| **`train`** | DT training hyperparameters | `lr`, `epochs`, `seed` |
| **`sae`** | Sparse Autoencoder training hyperparameters | `expansion_factor`, `k`, `num_episodes` (SAE specific) |
**Example: Independent Data Control**
You can control the amount of data used for general training vs. interpretability separately:
```yaml
data:
num_episodes: 1000 # Episodes for training the DT teacher
sae:
num_episodes: 500 # Episodes for extracting SAE activations
```
---
## Execution Modes: Installation and Usage
There are two primary ways to run and interact with the **DT-Circuits** framework depending on your research needs:
---
### Way 1: Interactive Cloud Demo (Hugging Face Spaces)
For instant visual exploration, path intervention, and alignment auditing without any local workspace preparation, launch the web dashboard directly:
* **Demo Link:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)
> [!NOTE]
> **Concise Demo Constraints:**
> * **CPU-Bound Resources:** Runs on standard free-tier CPU instances (2 vCPUs, 16 GB RAM); high-overhead operations like ACDC scans may show higher latency than on a local GPU workspace.
> * **Slices Dataset:** Trajectory datasets are dynamically sliced down to a lightweight demo set under a **10MB limit** (defined in [deploy.sh](file:///Users/sadhumitha/Documents/projects/DT-Circuits/scripts/deploy.sh#L19-L33)) for storage and memory footprint constraints.
> * **Read-Only / Ephemeral Container:** Uses pre-baked static weights (`mini_dt.pt`) and pre-trained SAE checkpoints. Training new models or writing persistent states is disabled.
---
### Way 2: Clone and Run Locally (Full Pipeline)
For full end-to-end research, customized hyperparameter tuning, local data harvesting, and GPU-accelerated model or SAE training, run the workspace on your machine.
#### Local Environment Setup
First, clone the repository, set up a virtual environment, and install dependencies:
```bash
git clone https://github.com/sadhumitha-s/DT-Circuits
cd DT-Circuits
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
#### Option 2.1: Simple Workflows via Makefile
The workspace includes a standardized [Makefile](file:///Users/sadhumitha/Documents/projects/DT-Circuits/Makefile) to orchestrate common research pipelines with single commands:
```bash
make setup # Set up local environment & install requirements
make train # Run the full end-to-end pipeline (Data harvesting -> DT -> SAE training)
make dashboard # Run the Streamlit visualization dashboard locally
```
#### Option 2.2: Granular Control via Bash & Python
For research flexibility, execute each step of the pipeline manually using granular terminal scripts:
1. **Trajectories & Model Training**
Harvest teacher trajectories and train the target Decision Transformer (`HookedDT`):
```bash
python scripts/train_dt.py
```
2. **TopK Sparse Autoencoder (SAE) Training**
Train sparse autoencoders on target activation layers:
```bash
python scripts/train_sae.py
```
3. **Interactive Analysis**
Launch the Streamlit visualization engine locally to run audits with custom weights:
```bash
streamlit run src/dashboard/app.py
```
---
## Documentation
Detailed technical documentation for specific modules:
* [Circuit Discovery](./docs/circuit_discovery.md)
* [Causal Intervention](./docs/activation_patching.md)
* [SAEs and Steering](./docs/sae_steering.md)
* [Safety Auditing & Steering](./docs/safety_auditing.md)
---
## Foundational Research & References
This framework implements and builds upon the following foundational methodologies:
* **Decision Transformers**: [Chen et al., 2021](https://arxiv.org/abs/2106.01345) β Reinforcement learning as sequence modeling.
* **Transformer Circuits**: [Elhage et al., 2021](https://transformer-circuits.pub/2021/framework/index.html) β Mathematical foundations of mechanistic interpretability.
* **ACDC (Automated Circuit Discovery)**: [Conmy et al., 2023](https://arxiv.org/abs/2304.14997) β Algorithmic discovery of subgraphs.
* **Sparse Autoencoders (SAEs)**: [Bricken et al., 2023](https://transformer-circuits.pub/2023/monosemantic-features/index.html) (monosemantic features) & [Gao et al., 2024](https://arxiv.org/abs/2406.04096) (TopK SAEs).
* **Activation Steering**: [Turner et al., 2023](https://arxiv.org/abs/2308.10248) β Control via residual stream vector additions.
* **Path Patching**: [Goldowsky-Dill et al., 2023](https://arxiv.org/abs/2304.05969) β Inter-component causal mediation.
---
## Citation
```bibtex
@software{dt_circuits2026,
author = {Sadhumitha S.},
title = {DT-Circuits: Mechanistic Interpretability for Decision Transformers},
year = {2026},
url = {https://github.com/sadhumitha-s/DT-Circuits}
}
```
---
## License
Apache 2.0
|