DT-Explorer / README.md
GitHub Actions
chore: inject Hugging Face frontmatter metadata dynamically
a825f06
---
title: DT-Explorer
emoji: πŸ”
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---
# DT-Circuits: Mechanistic Interpretability for Decision Transformers
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A5%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch 2.x](https://img.shields.io/badge/PyTorch-2.x-red.svg)](https://pytorch.org/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
[![Framework: TransformerLens](https://img.shields.io/badge/Framework-TransformerLens-orange.svg)](https://github.com/TransformerLensOrg/TransformerLens)
DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.
**Live Interactive Demo:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)
---
## Table of Contents
- [Core Objectives](#core-objectives)
- [Technical Overview](#technical-overview)
- [Capabilities](#capabilities)
- [Project Structure](#project-structure)
- [Installation and Usage](#installation-and-usage)
- [Documentation](#documentation)
- [Foundational Research & References](#foundational-research--references)
- [Citation](#citation)
- [License](#license)
---
## Core Objectives
1. **Map Information Flow**: Quantify how input tokens (State, Action, Reward-to-Go) contribute to the output action logits.
2. **Causal Verification**: Use intervention techniques to identify the minimal set of model components required for specific behaviors.
3. **Feature Decomposition**: Use Sparse Autoencoders (SAEs) to identify monosemantic features within the model's residual stream.
4. **Behavioral Control**: Modify agent decisions at inference time by manipulating internal activations.
---
## Technical Overview
The framework centers around `HookedDT`, a Decision Transformer implementation that allows for activation hooking and cache management.
### Information Flow Diagram
```mermaid
graph TD
subgraph Input_Sequence
S[State Tokens]
A[Action Tokens]
RTG[Reward-to-Go Tokens]
end
Input_Sequence --> Embed[Embedding Layers]
Embed --> Hooks[Activation Hooks]
subgraph Transformer_Block
Hooks --> Attn[Multi-Head Attention]
Attn --> MLP[MLP Layers]
MLP --> Res[Residual Stream]
end
Res --> DLA[Direct Logit Attribution]
Res --> SAE[Sparse Autoencoder]
Res --> Output[Action Logits]
subgraph Interpretability_&_Safety
DLA -.-> Analysis
DLA -.-> MAD[Functional Attribution MAD]
SAE -.-> Features
SAE -.-> Auditor[Deceptive Alignment Auditor]
Intervention[Activation Patching] -.-> Hooks
Output & S --> Directer[Dynamic Rejection Steering]
Directer -.-> |Feedback Adjust Alpha| Hooks
end
subgraph Interactive_Surgeon_Dashboard
Surgeon[Circuit Surgeon Ablation Engine] -.-> |Dynamic Node/Edge Hooks| Hooks
Surgeon --> |Format Schema| Neuronpedia[Neuronpedia Export Hub]
Surgeon --> |Live Loop Execution| MiniGrid[MiniGrid Behavioral Audit]
Output -.-> Surgeon
end
```
---
## Capabilities
### Causal Mediation and Attribution
* **Direct Logit Attribution (DLA)**: Measures the direct contribution of individual attention heads and MLP layers to the final logit output.
* **Activation Patching**: Substitutes internal activations from different runs to isolate the causal effect of specific inputs on model behavior.
* **Path Patching**: Traces how information flows through specific connections between model components.
### Feature Discovery and Analysis
* **Sparse Autoencoders (SAEs)**: Decomposes the residual stream into a set of sparse features, helping to resolve polysemanticity.
* **Induction Scanning**: Identifies attention heads that perform pattern-matching and temporal sequence recognition.
* **Automated Circuit Discovery (ACDC)**: Prunes the model to identify the smallest functional subgraph sufficient to perform a specific task.
### Behavioral Steering & Safety Auditing
* **Activation Steering**: Injects specific vectors into the residual stream to bias the agent's decision-making without retraining the weights.
* **Dynamic Rejection Steering (Directer)**: Integrates a feedback loop during inference to dynamically scale back steering magnitude if it pushes the action distribution toward illegal or dangerous actions.
* **Deceptive Alignment Auditing**: Uses SAE feature decomposition to identify the "situational awareness switch" feature in deceptively aligned agents (model organisms watched vs unwatched) and traces the circuit of attention heads that activate it.
* **Functional Attribution MAD**: Detects mechanistic anomalies (such as backdoors or reward hacks) by comparing active logit attribution signatures to a cached reference profile, flagging when goals are met using atypical circuits.
### Interactive Surgical Auditing & Peer Review
* **Interactive Circuit Surgery**: Provides real-time interactive node (Heads, MLPs) and communication path (edges) ablation tools. Severed pathways dynamically update the underlying architecture using custom forward hooks.
* **Live Behavioral Audits**: Evaluates guided agent behavior inside a live Gymnasium (MiniGrid) environment step-by-step to immediately visualize behavioral changes under currently selected surgical configurations.
* **Neuronpedia Export**: Formats the discovered circuit blueprint, active components, and performance metrics into standardized schemas for publishing directly to the Neuronpedia platform for public peer review.
---
## Project Structure
```text
DT-Circuits/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ dashboard/
β”‚ β”‚ └── app.py # Streamlit-based visualization UI
β”‚ β”œβ”€β”€ data/
β”‚ β”‚ └── harvester.py # PPO-based expert trajectory harvester
β”‚ β”œβ”€β”€ interpretability/
β”‚ β”‚ β”œβ”€β”€ acdc.py # Automated Circuit Discovery logic
β”‚ β”‚ β”œβ”€β”€ attribution.py # Direct Logit Attribution (DLA)
β”‚ β”‚ β”œβ”€β”€ circuit_surgeon.py # Interactive node & path ablation engine
β”‚ β”‚ β”œβ”€β”€ evolution.py # Training Dynamics Analysis
β”‚ β”‚ β”œβ”€β”€ induction_scan.py # Induction head detection logic
β”‚ β”‚ β”œβ”€β”€ neuronpedia.py # Neuronpedia publishing client
β”‚ β”‚ β”œβ”€β”€ nla.py # Natural Language Autoencoder Explainer
β”‚ β”‚ β”œβ”€β”€ patching.py # Causal activation patching tools
β”‚ β”‚ β”œβ”€β”€ path_patching.py # Path-based causal intervention engine
β”‚ β”‚ β”œβ”€β”€ safety.py # Safety auditing, directer, and deceptive alignment tools
β”‚ β”‚ β”œβ”€β”€ sae_manager.py # SAE deployment and anomaly detection
β”‚ β”‚ β”œβ”€β”€ steering.py # Steering vector generation and injection
β”‚ β”‚ └── universality.py # Cross-architecture feature mapping
β”‚ β”œβ”€β”€ models/
β”‚ β”‚ └── hooked_dt.py # TransformerLens-wrapped Decision Transformer
β”‚ β”œβ”€β”€ config.py # Centralized hyperparameter management
β”‚ └── utils/
β”œβ”€β”€ tests/ # Unit tests for all modules
β”œβ”€β”€ config.yaml # External hyperparameter storage
β”œβ”€β”€ requirements.txt
└── docs/
```
---
## Configuration
Hyperparameters are managed through a dual-system for both ease of use and research reproducibility:
1. **`config.yaml`**: The primary interface for users. You can modify model dimensions, training epochs, and environment settings here without touching the code.
2. **`src/config.py`**: Defines the underlying structure using Python dataclasses. It automatically loads overrides from `config.yaml` at runtime.
### Key Configuration Sections
| Section | Description | Key Parameters |
| :--- | :--- | :--- |
| **`model`** | Architecture settings for the Decision Transformer | `n_layers`, `d_model`, `n_heads`, `max_length` |
| **`data`** | Settings for expert trajectory collection | `env_id`, `num_episodes` (for DT training) |
| **`train`** | DT training hyperparameters | `lr`, `epochs`, `seed` |
| **`sae`** | Sparse Autoencoder training hyperparameters | `expansion_factor`, `k`, `num_episodes` (SAE specific) |
**Example: Independent Data Control**
You can control the amount of data used for general training vs. interpretability separately:
```yaml
data:
num_episodes: 1000 # Episodes for training the DT teacher
sae:
num_episodes: 500 # Episodes for extracting SAE activations
```
---
## Execution Modes: Installation and Usage
There are two primary ways to run and interact with the **DT-Circuits** framework depending on your research needs:
---
### Way 1: Interactive Cloud Demo (Hugging Face Spaces)
For instant visual exploration, path intervention, and alignment auditing without any local workspace preparation, launch the web dashboard directly:
* **Demo Link:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)
> [!NOTE]
> **Concise Demo Constraints:**
> * **CPU-Bound Resources:** Runs on standard free-tier CPU instances (2 vCPUs, 16 GB RAM); high-overhead operations like ACDC scans may show higher latency than on a local GPU workspace.
> * **Slices Dataset:** Trajectory datasets are dynamically sliced down to a lightweight demo set under a **10MB limit** (defined in [deploy.sh](file:///Users/sadhumitha/Documents/projects/DT-Circuits/scripts/deploy.sh#L19-L33)) for storage and memory footprint constraints.
> * **Read-Only / Ephemeral Container:** Uses pre-baked static weights (`mini_dt.pt`) and pre-trained SAE checkpoints. Training new models or writing persistent states is disabled.
---
### Way 2: Clone and Run Locally (Full Pipeline)
For full end-to-end research, customized hyperparameter tuning, local data harvesting, and GPU-accelerated model or SAE training, run the workspace on your machine.
#### Local Environment Setup
First, clone the repository, set up a virtual environment, and install dependencies:
```bash
git clone https://github.com/sadhumitha-s/DT-Circuits
cd DT-Circuits
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
#### Option 2.1: Simple Workflows via Makefile
The workspace includes a standardized [Makefile](file:///Users/sadhumitha/Documents/projects/DT-Circuits/Makefile) to orchestrate common research pipelines with single commands:
```bash
make setup # Set up local environment & install requirements
make train # Run the full end-to-end pipeline (Data harvesting -> DT -> SAE training)
make dashboard # Run the Streamlit visualization dashboard locally
```
#### Option 2.2: Granular Control via Bash & Python
For research flexibility, execute each step of the pipeline manually using granular terminal scripts:
1. **Trajectories & Model Training**
Harvest teacher trajectories and train the target Decision Transformer (`HookedDT`):
```bash
python scripts/train_dt.py
```
2. **TopK Sparse Autoencoder (SAE) Training**
Train sparse autoencoders on target activation layers:
```bash
python scripts/train_sae.py
```
3. **Interactive Analysis**
Launch the Streamlit visualization engine locally to run audits with custom weights:
```bash
streamlit run src/dashboard/app.py
```
---
## Documentation
Detailed technical documentation for specific modules:
* [Circuit Discovery](./docs/circuit_discovery.md)
* [Causal Intervention](./docs/activation_patching.md)
* [SAEs and Steering](./docs/sae_steering.md)
* [Safety Auditing & Steering](./docs/safety_auditing.md)
---
## Foundational Research & References
This framework implements and builds upon the following foundational methodologies:
* **Decision Transformers**: [Chen et al., 2021](https://arxiv.org/abs/2106.01345) β€” Reinforcement learning as sequence modeling.
* **Transformer Circuits**: [Elhage et al., 2021](https://transformer-circuits.pub/2021/framework/index.html) β€” Mathematical foundations of mechanistic interpretability.
* **ACDC (Automated Circuit Discovery)**: [Conmy et al., 2023](https://arxiv.org/abs/2304.14997) β€” Algorithmic discovery of subgraphs.
* **Sparse Autoencoders (SAEs)**: [Bricken et al., 2023](https://transformer-circuits.pub/2023/monosemantic-features/index.html) (monosemantic features) & [Gao et al., 2024](https://arxiv.org/abs/2406.04096) (TopK SAEs).
* **Activation Steering**: [Turner et al., 2023](https://arxiv.org/abs/2308.10248) β€” Control via residual stream vector additions.
* **Path Patching**: [Goldowsky-Dill et al., 2023](https://arxiv.org/abs/2304.05969) β€” Inter-component causal mediation.
---
## Citation
```bibtex
@software{dt_circuits2026,
author = {Sadhumitha S.},
title = {DT-Circuits: Mechanistic Interpretability for Decision Transformers},
year = {2026},
url = {https://github.com/sadhumitha-s/DT-Circuits}
}
```
---
## License
Apache 2.0