---
title: DT-Explorer
emoji: 🔍
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---

# DT-Circuits: Mechanistic Interpretability for Decision Transformers

[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A5%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch 2.x](https://img.shields.io/badge/PyTorch-2.x-red.svg)](https://pytorch.org/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
[![Framework: TransformerLens](https://img.shields.io/badge/Framework-TransformerLens-orange.svg)](https://github.com/TransformerLensOrg/TransformerLens)

DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.

**Live Interactive Demo:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)

---

## Table of Contents
- [Core Objectives](#core-objectives)
- [Technical Overview](#technical-overview)
- [Capabilities](#capabilities)
- [Project Structure](#project-structure)
- [Installation and Usage](#installation-and-usage)
- [Documentation](#documentation)
- [Foundational Research & References](#foundational-research--references)
- [Citation](#citation)
- [License](#license)

---

## Core Objectives

1.  **Map Information Flow**: Quantify how input tokens (State, Action, Reward-to-Go) contribute to the output action logits.
2.  **Causal Verification**: Use intervention techniques to identify the minimal set of model components required for specific behaviors.
3.  **Feature Decomposition**: Use Sparse Autoencoders (SAEs) to identify monosemantic features within the model's residual stream.
4.  **Behavioral Control**: Modify agent decisions at inference time by manipulating internal activations.

---

## Technical Overview

The framework centers around `HookedDT`, a Decision Transformer implementation that allows for activation hooking and cache management.

### Information Flow Diagram

```mermaid
graph TD
    subgraph Input_Sequence
        S[State Tokens]
        A[Action Tokens]
        RTG[Reward-to-Go Tokens]
    end

    Input_Sequence --> Embed[Embedding Layers]
    Embed --> Hooks[Activation Hooks]
    
    subgraph Transformer_Block
        Hooks --> Attn[Multi-Head Attention]
        Attn --> MLP[MLP Layers]
        MLP --> Res[Residual Stream]
    end

    Res --> DLA[Direct Logit Attribution]
    Res --> SAE[Sparse Autoencoder]
    Res --> Output[Action Logits]

    subgraph Interpretability_&_Safety
        DLA -.-> Analysis
        DLA -.-> MAD[Functional Attribution MAD]
        SAE -.-> Features
        SAE -.-> Auditor[Deceptive Alignment Auditor]
        Intervention[Activation Patching] -.-> Hooks
        
        Output & S --> Directer[Dynamic Rejection Steering]
        Directer -.-> |Feedback Adjust Alpha| Hooks
    end

    subgraph Interactive_Surgeon_Dashboard
        Surgeon[Circuit Surgeon Ablation Engine] -.-> |Dynamic Node/Edge Hooks| Hooks
        Surgeon --> |Format Schema| Neuronpedia[Neuronpedia Export Hub]
        Surgeon --> |Live Loop Execution| MiniGrid[MiniGrid Behavioral Audit]
        Output -.-> Surgeon
    end
```

---

## Capabilities

### Causal Mediation and Attribution
*   **Direct Logit Attribution (DLA)**: Measures the direct contribution of individual attention heads and MLP layers to the final logit output.
*   **Activation Patching**: Substitutes internal activations from different runs to isolate the causal effect of specific inputs on model behavior.
*   **Path Patching**: Traces how information flows through specific connections between model components.

### Feature Discovery and Analysis
*   **Sparse Autoencoders (SAEs)**: Decomposes the residual stream into a set of sparse features, helping to resolve polysemanticity.
*   **Induction Scanning**: Identifies attention heads that perform pattern-matching and temporal sequence recognition.
*   **Automated Circuit Discovery (ACDC)**: Prunes the model to identify the smallest functional subgraph sufficient to perform a specific task.

### Behavioral Steering & Safety Auditing
*   **Activation Steering**: Injects specific vectors into the residual stream to bias the agent's decision-making without retraining the weights.
*   **Dynamic Rejection Steering (Directer)**: Integrates a feedback loop during inference to dynamically scale back steering magnitude if it pushes the action distribution toward illegal or dangerous actions.
*   **Deceptive Alignment Auditing**: Uses SAE feature decomposition to identify the "situational awareness switch" feature in deceptively aligned agents (model organisms watched vs unwatched) and traces the circuit of attention heads that activate it.
*   **Functional Attribution MAD**: Detects mechanistic anomalies (such as backdoors or reward hacks) by comparing active logit attribution signatures to a cached reference profile, flagging when goals are met using atypical circuits.

### Interactive Surgical Auditing & Peer Review
*   **Interactive Circuit Surgery**: Provides real-time interactive node (Heads, MLPs) and communication path (edges) ablation tools. Severed pathways dynamically update the underlying architecture using custom forward hooks.
*   **Live Behavioral Audits**: Evaluates guided agent behavior inside a live Gymnasium (MiniGrid) environment step-by-step to immediately visualize behavioral changes under currently selected surgical configurations.
*   **Neuronpedia Export**: Formats the discovered circuit blueprint, active components, and performance metrics into standardized schemas for publishing directly to the Neuronpedia platform for public peer review.

---

## Project Structure

```text
DT-Circuits/
├── src/
│   ├── dashboard/          
│   │   └── app.py          # Streamlit-based visualization UI
│   ├── data/               
│   │   └── harvester.py    # PPO-based expert trajectory harvester
│   ├── interpretability/   
│   │   ├── acdc.py         # Automated Circuit Discovery logic
│   │   ├── attribution.py  # Direct Logit Attribution (DLA)
│   │   ├── circuit_surgeon.py # Interactive node & path ablation engine
│   │   ├── evolution.py    # Training Dynamics Analysis
│   │   ├── induction_scan.py # Induction head detection logic
│   │   ├── neuronpedia.py  # Neuronpedia publishing client
│   │   ├── nla.py          # Natural Language Autoencoder Explainer
│   │   ├── patching.py     # Causal activation patching tools
│   │   ├── path_patching.py # Path-based causal intervention engine
│   │   ├── safety.py       # Safety auditing, directer, and deceptive alignment tools
│   │   ├── sae_manager.py  # SAE deployment and anomaly detection
│   │   ├── steering.py     # Steering vector generation and injection
│   │   └── universality.py # Cross-architecture feature mapping
│   ├── models/             
│   │   └── hooked_dt.py    # TransformerLens-wrapped Decision Transformer
│   ├── config.py           # Centralized hyperparameter management
│   └── utils/              
├── tests/                  # Unit tests for all modules
├── config.yaml             # External hyperparameter storage
├── requirements.txt 
└── docs/                        
```

---

## Configuration

Hyperparameters are managed through a dual-system for both ease of use and research reproducibility:

1.  **`config.yaml`**: The primary interface for users. You can modify model dimensions, training epochs, and environment settings here without touching the code.
2.  **`src/config.py`**: Defines the underlying structure using Python dataclasses. It automatically loads overrides from `config.yaml` at runtime.

### Key Configuration Sections

| Section | Description | Key Parameters |
| :--- | :--- | :--- |
| **`model`** | Architecture settings for the Decision Transformer | `n_layers`, `d_model`, `n_heads`, `max_length` |
| **`data`** | Settings for expert trajectory collection | `env_id`, `num_episodes` (for DT training) |
| **`train`** | DT training hyperparameters | `lr`, `epochs`, `seed` |
| **`sae`** | Sparse Autoencoder training hyperparameters | `expansion_factor`, `k`, `num_episodes` (SAE specific) |

**Example: Independent Data Control** 
You can control the amount of data used for general training vs. interpretability separately:
```yaml
data:
  num_episodes: 1000  # Episodes for training the DT teacher

sae:
  num_episodes: 500   # Episodes for extracting SAE activations
```

---

## Execution Modes: Installation and Usage

There are two primary ways to run and interact with the **DT-Circuits** framework depending on your research needs:

---

### Way 1: Interactive Cloud Demo (Hugging Face Spaces)

For instant visual exploration, path intervention, and alignment auditing without any local workspace preparation, launch the web dashboard directly:

* **Demo Link:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)

> [!NOTE]
> **Concise Demo Constraints:**
> * **CPU-Bound Resources:** Runs on standard free-tier CPU instances (2 vCPUs, 16 GB RAM); high-overhead operations like ACDC scans may show higher latency than on a local GPU workspace.
> * **Slices Dataset:** Trajectory datasets are dynamically sliced down to a lightweight demo set under a **10MB limit** (defined in [deploy.sh](file:///Users/sadhumitha/Documents/projects/DT-Circuits/scripts/deploy.sh#L19-L33)) for storage and memory footprint constraints.
> * **Read-Only / Ephemeral Container:** Uses pre-baked static weights (`mini_dt.pt`) and pre-trained SAE checkpoints. Training new models or writing persistent states is disabled.

---

### Way 2: Clone and Run Locally (Full Pipeline)

For full end-to-end research, customized hyperparameter tuning, local data harvesting, and GPU-accelerated model or SAE training, run the workspace on your machine.

#### Local Environment Setup
First, clone the repository, set up a virtual environment, and install dependencies:
```bash
git clone https://github.com/sadhumitha-s/DT-Circuits
cd DT-Circuits

python -m venv venv
source venv/bin/activate  

pip install -r requirements.txt
```

#### Option 2.1: Simple Workflows via Makefile
The workspace includes a standardized [Makefile](file:///Users/sadhumitha/Documents/projects/DT-Circuits/Makefile) to orchestrate common research pipelines with single commands:

```bash
make setup      # Set up local environment & install requirements
make train      # Run the full end-to-end pipeline (Data harvesting -> DT -> SAE training)
make dashboard  # Run the Streamlit visualization dashboard locally
```

#### Option 2.2: Granular Control via Bash & Python
For research flexibility, execute each step of the pipeline manually using granular terminal scripts:

1. **Trajectories & Model Training**
   Harvest teacher trajectories and train the target Decision Transformer (`HookedDT`):
   ```bash
   python scripts/train_dt.py
   ```

2. **TopK Sparse Autoencoder (SAE) Training**
   Train sparse autoencoders on target activation layers:
   ```bash
   python scripts/train_sae.py
   ```

3. **Interactive Analysis**
   Launch the Streamlit visualization engine locally to run audits with custom weights:
   ```bash
   streamlit run src/dashboard/app.py
   ```

---

## Documentation

Detailed technical documentation for specific modules:
*   [Circuit Discovery](./docs/circuit_discovery.md)
*   [Causal Intervention](./docs/activation_patching.md)
*   [SAEs and Steering](./docs/sae_steering.md)
*   [Safety Auditing & Steering](./docs/safety_auditing.md)

---

## Foundational Research & References

This framework implements and builds upon the following foundational methodologies:

*   **Decision Transformers**: [Chen et al., 2021](https://arxiv.org/abs/2106.01345) — Reinforcement learning as sequence modeling.
*   **Transformer Circuits**: [Elhage et al., 2021](https://transformer-circuits.pub/2021/framework/index.html) — Mathematical foundations of mechanistic interpretability.
*   **ACDC (Automated Circuit Discovery)**: [Conmy et al., 2023](https://arxiv.org/abs/2304.14997) — Algorithmic discovery of subgraphs.
*   **Sparse Autoencoders (SAEs)**: [Bricken et al., 2023](https://transformer-circuits.pub/2023/monosemantic-features/index.html) (monosemantic features) & [Gao et al., 2024](https://arxiv.org/abs/2406.04096) (TopK SAEs).
*   **Activation Steering**: [Turner et al., 2023](https://arxiv.org/abs/2308.10248) — Control via residual stream vector additions.
*   **Path Patching**: [Goldowsky-Dill et al., 2023](https://arxiv.org/abs/2304.05969) — Inter-component causal mediation.

---

## Citation

```bibtex
@software{dt_circuits2026,
  author = {Sadhumitha S.},
  title = {DT-Circuits: Mechanistic Interpretability for Decision Transformers},
  year = {2026},
  url = {https://github.com/sadhumitha-s/DT-Circuits}
}
```

---

## License
Apache 2.0