--- title: DT-Explorer emoji: 🔍 colorFrom: blue colorTo: indigo sdk: docker pinned: false --- # DT-Circuits: Mechanistic Interpretability for Decision Transformers [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A5%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer) [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/) [![PyTorch 2.x](https://img.shields.io/badge/PyTorch-2.x-red.svg)](https://pytorch.org/) [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0) [![Framework: TransformerLens](https://img.shields.io/badge/Framework-TransformerLens-orange.svg)](https://github.com/TransformerLensOrg/TransformerLens) DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents. **Live Interactive Demo:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer) --- ## Table of Contents - [Core Objectives](#core-objectives) - [Technical Overview](#technical-overview) - [Capabilities](#capabilities) - [Project Structure](#project-structure) - [Installation and Usage](#installation-and-usage) - [Documentation](#documentation) - [Foundational Research & References](#foundational-research--references) - [Citation](#citation) - [License](#license) --- ## Core Objectives 1. **Map Information Flow**: Quantify how input tokens (State, Action, Reward-to-Go) contribute to the output action logits. 2. **Causal Verification**: Use intervention techniques to identify the minimal set of model components required for specific behaviors. 3. **Feature Decomposition**: Use Sparse Autoencoders (SAEs) to identify monosemantic features within the model's residual stream. 4. **Behavioral Control**: Modify agent decisions at inference time by manipulating internal activations. --- ## Technical Overview The framework centers around `HookedDT`, a Decision Transformer implementation that allows for activation hooking and cache management. ### Information Flow Diagram ```mermaid graph TD subgraph Input_Sequence S[State Tokens] A[Action Tokens] RTG[Reward-to-Go Tokens] end Input_Sequence --> Embed[Embedding Layers] Embed --> Hooks[Activation Hooks] subgraph Transformer_Block Hooks --> Attn[Multi-Head Attention] Attn --> MLP[MLP Layers] MLP --> Res[Residual Stream] end Res --> DLA[Direct Logit Attribution] Res --> SAE[Sparse Autoencoder] Res --> Output[Action Logits] subgraph Interpretability_&_Safety DLA -.-> Analysis DLA -.-> MAD[Functional Attribution MAD] SAE -.-> Features SAE -.-> Auditor[Deceptive Alignment Auditor] Intervention[Activation Patching] -.-> Hooks Output & S --> Directer[Dynamic Rejection Steering] Directer -.-> |Feedback Adjust Alpha| Hooks end subgraph Interactive_Surgeon_Dashboard Surgeon[Circuit Surgeon Ablation Engine] -.-> |Dynamic Node/Edge Hooks| Hooks Surgeon --> |Format Schema| Neuronpedia[Neuronpedia Export Hub] Surgeon --> |Live Loop Execution| MiniGrid[MiniGrid Behavioral Audit] Output -.-> Surgeon end ``` --- ## Capabilities ### Causal Mediation and Attribution * **Direct Logit Attribution (DLA)**: Measures the direct contribution of individual attention heads and MLP layers to the final logit output. * **Activation Patching**: Substitutes internal activations from different runs to isolate the causal effect of specific inputs on model behavior. * **Path Patching**: Traces how information flows through specific connections between model components. ### Feature Discovery and Analysis * **Sparse Autoencoders (SAEs)**: Decomposes the residual stream into a set of sparse features, helping to resolve polysemanticity. * **Induction Scanning**: Identifies attention heads that perform pattern-matching and temporal sequence recognition. * **Automated Circuit Discovery (ACDC)**: Prunes the model to identify the smallest functional subgraph sufficient to perform a specific task. ### Behavioral Steering & Safety Auditing * **Activation Steering**: Injects specific vectors into the residual stream to bias the agent's decision-making without retraining the weights. * **Dynamic Rejection Steering (Directer)**: Integrates a feedback loop during inference to dynamically scale back steering magnitude if it pushes the action distribution toward illegal or dangerous actions. * **Deceptive Alignment Auditing**: Uses SAE feature decomposition to identify the "situational awareness switch" feature in deceptively aligned agents (model organisms watched vs unwatched) and traces the circuit of attention heads that activate it. * **Functional Attribution MAD**: Detects mechanistic anomalies (such as backdoors or reward hacks) by comparing active logit attribution signatures to a cached reference profile, flagging when goals are met using atypical circuits. ### Interactive Surgical Auditing & Peer Review * **Interactive Circuit Surgery**: Provides real-time interactive node (Heads, MLPs) and communication path (edges) ablation tools. Severed pathways dynamically update the underlying architecture using custom forward hooks. * **Live Behavioral Audits**: Evaluates guided agent behavior inside a live Gymnasium (MiniGrid) environment step-by-step to immediately visualize behavioral changes under currently selected surgical configurations. * **Neuronpedia Export**: Formats the discovered circuit blueprint, active components, and performance metrics into standardized schemas for publishing directly to the Neuronpedia platform for public peer review. --- ## Project Structure ```text DT-Circuits/ ├── src/ │ ├── dashboard/ │ │ └── app.py # Streamlit-based visualization UI │ ├── data/ │ │ └── harvester.py # PPO-based expert trajectory harvester │ ├── interpretability/ │ │ ├── acdc.py # Automated Circuit Discovery logic │ │ ├── attribution.py # Direct Logit Attribution (DLA) │ │ ├── circuit_surgeon.py # Interactive node & path ablation engine │ │ ├── evolution.py # Training Dynamics Analysis │ │ ├── induction_scan.py # Induction head detection logic │ │ ├── neuronpedia.py # Neuronpedia publishing client │ │ ├── nla.py # Natural Language Autoencoder Explainer │ │ ├── patching.py # Causal activation patching tools │ │ ├── path_patching.py # Path-based causal intervention engine │ │ ├── safety.py # Safety auditing, directer, and deceptive alignment tools │ │ ├── sae_manager.py # SAE deployment and anomaly detection │ │ ├── steering.py # Steering vector generation and injection │ │ └── universality.py # Cross-architecture feature mapping │ ├── models/ │ │ └── hooked_dt.py # TransformerLens-wrapped Decision Transformer │ ├── config.py # Centralized hyperparameter management │ └── utils/ ├── tests/ # Unit tests for all modules ├── config.yaml # External hyperparameter storage ├── requirements.txt └── docs/ ``` --- ## Configuration Hyperparameters are managed through a dual-system for both ease of use and research reproducibility: 1. **`config.yaml`**: The primary interface for users. You can modify model dimensions, training epochs, and environment settings here without touching the code. 2. **`src/config.py`**: Defines the underlying structure using Python dataclasses. It automatically loads overrides from `config.yaml` at runtime. ### Key Configuration Sections | Section | Description | Key Parameters | | :--- | :--- | :--- | | **`model`** | Architecture settings for the Decision Transformer | `n_layers`, `d_model`, `n_heads`, `max_length` | | **`data`** | Settings for expert trajectory collection | `env_id`, `num_episodes` (for DT training) | | **`train`** | DT training hyperparameters | `lr`, `epochs`, `seed` | | **`sae`** | Sparse Autoencoder training hyperparameters | `expansion_factor`, `k`, `num_episodes` (SAE specific) | **Example: Independent Data Control** You can control the amount of data used for general training vs. interpretability separately: ```yaml data: num_episodes: 1000 # Episodes for training the DT teacher sae: num_episodes: 500 # Episodes for extracting SAE activations ``` --- ## Execution Modes: Installation and Usage There are two primary ways to run and interact with the **DT-Circuits** framework depending on your research needs: --- ### Way 1: Interactive Cloud Demo (Hugging Face Spaces) For instant visual exploration, path intervention, and alignment auditing without any local workspace preparation, launch the web dashboard directly: * **Demo Link:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer) > [!NOTE] > **Concise Demo Constraints:** > * **CPU-Bound Resources:** Runs on standard free-tier CPU instances (2 vCPUs, 16 GB RAM); high-overhead operations like ACDC scans may show higher latency than on a local GPU workspace. > * **Slices Dataset:** Trajectory datasets are dynamically sliced down to a lightweight demo set under a **10MB limit** (defined in [deploy.sh](file:///Users/sadhumitha/Documents/projects/DT-Circuits/scripts/deploy.sh#L19-L33)) for storage and memory footprint constraints. > * **Read-Only / Ephemeral Container:** Uses pre-baked static weights (`mini_dt.pt`) and pre-trained SAE checkpoints. Training new models or writing persistent states is disabled. --- ### Way 2: Clone and Run Locally (Full Pipeline) For full end-to-end research, customized hyperparameter tuning, local data harvesting, and GPU-accelerated model or SAE training, run the workspace on your machine. #### Local Environment Setup First, clone the repository, set up a virtual environment, and install dependencies: ```bash git clone https://github.com/sadhumitha-s/DT-Circuits cd DT-Circuits python -m venv venv source venv/bin/activate pip install -r requirements.txt ``` #### Option 2.1: Simple Workflows via Makefile The workspace includes a standardized [Makefile](file:///Users/sadhumitha/Documents/projects/DT-Circuits/Makefile) to orchestrate common research pipelines with single commands: ```bash make setup # Set up local environment & install requirements make train # Run the full end-to-end pipeline (Data harvesting -> DT -> SAE training) make dashboard # Run the Streamlit visualization dashboard locally ``` #### Option 2.2: Granular Control via Bash & Python For research flexibility, execute each step of the pipeline manually using granular terminal scripts: 1. **Trajectories & Model Training** Harvest teacher trajectories and train the target Decision Transformer (`HookedDT`): ```bash python scripts/train_dt.py ``` 2. **TopK Sparse Autoencoder (SAE) Training** Train sparse autoencoders on target activation layers: ```bash python scripts/train_sae.py ``` 3. **Interactive Analysis** Launch the Streamlit visualization engine locally to run audits with custom weights: ```bash streamlit run src/dashboard/app.py ``` --- ## Documentation Detailed technical documentation for specific modules: * [Circuit Discovery](./docs/circuit_discovery.md) * [Causal Intervention](./docs/activation_patching.md) * [SAEs and Steering](./docs/sae_steering.md) * [Safety Auditing & Steering](./docs/safety_auditing.md) --- ## Foundational Research & References This framework implements and builds upon the following foundational methodologies: * **Decision Transformers**: [Chen et al., 2021](https://arxiv.org/abs/2106.01345) — Reinforcement learning as sequence modeling. * **Transformer Circuits**: [Elhage et al., 2021](https://transformer-circuits.pub/2021/framework/index.html) — Mathematical foundations of mechanistic interpretability. * **ACDC (Automated Circuit Discovery)**: [Conmy et al., 2023](https://arxiv.org/abs/2304.14997) — Algorithmic discovery of subgraphs. * **Sparse Autoencoders (SAEs)**: [Bricken et al., 2023](https://transformer-circuits.pub/2023/monosemantic-features/index.html) (monosemantic features) & [Gao et al., 2024](https://arxiv.org/abs/2406.04096) (TopK SAEs). * **Activation Steering**: [Turner et al., 2023](https://arxiv.org/abs/2308.10248) — Control via residual stream vector additions. * **Path Patching**: [Goldowsky-Dill et al., 2023](https://arxiv.org/abs/2304.05969) — Inter-component causal mediation. --- ## Citation ```bibtex @software{dt_circuits2026, author = {Sadhumitha S.}, title = {DT-Circuits: Mechanistic Interpretability for Decision Transformers}, year = {2026}, url = {https://github.com/sadhumitha-s/DT-Circuits} } ``` --- ## License Apache 2.0