Spaces:

sadhumitha-s
/

DT-Explorer

Running

File size: 13,357 Bytes

a825f06
 
 
 
 
 
 
 
 
0346604
e2614dc
705175b
b7ddfc6
 
 
 
663f50c
8577352
e2614dc
848238a
 
8577352
 
11dbbc6
b7ddfc6
 
 
11dbbc6
b7ddfc6
 
848238a
b7ddfc6
 
663f50c
b7ddfc6
11dbbc6
b7ddfc6
11dbbc6
b7ddfc6
 
 
 
e2614dc
b7ddfc6
0346604
b7ddfc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5ccbe34
b7ddfc6
5ccbe34
b7ddfc6
5ccbe34
b7ddfc6
5ccbe34
 
 
b7ddfc6
33a0021
 
 
 
 
 
 
b7ddfc6
0346604
b7ddfc6
0346604
b7ddfc6
11dbbc6
b7ddfc6
 
 
 
663f50c
b7ddfc6
 
 
 
0346604
5ccbe34
b7ddfc6
5ccbe34
 
 
0346604
33a0021
 
 
 
 
663f50c
 
 
 
 
 
b7ddfc6
663f50c
 
 
 
 
 
 
33a0021
663f50c
 
33a0021
8577352
663f50c
 
5ccbe34
663f50c
8577352
 
663f50c
 
b7ddfc6
663f50c
8577352
b7ddfc6
 
 
663f50c
 
b7ddfc6
663f50c
b7ddfc6
0346604
b7ddfc6
0346604
b7ddfc6
 
8577352
b7ddfc6
e2614dc
b7ddfc6
 
 
 
 
 
8577352
b7ddfc6
 
 
 
 
e2614dc
b7ddfc6
 
 
 
 
 
848238a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b7ddfc6
848238a
 
b7ddfc6
848238a
 
 
b7ddfc6
 
848238a
b7ddfc6
 
 
848238a
 
705175b
848238a
 
 
 
 
e2614dc
848238a
 
8577352
848238a
 
8577352
 
 
 
848238a
 
b7ddfc6
 
 
 
848238a
 
8577352
 
 
b7ddfc6
 
 
 
 
 
 
 
 
5ccbe34
b7ddfc6
 
 
848238a
 
 
 
 
 
 
 
 
 
 
 
 
b7ddfc6

---
title: DT-Explorer
emoji: 🔍
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---

# DT-Circuits: Mechanistic Interpretability for Decision Transformers

[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A5%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch 2.x](https://img.shields.io/badge/PyTorch-2.x-red.svg)](https://pytorch.org/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
[![Framework: TransformerLens](https://img.shields.io/badge/Framework-TransformerLens-orange.svg)](https://github.com/TransformerLensOrg/TransformerLens)

DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.

**Live Interactive Demo:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)

---

## Table of Contents
- [Core Objectives](#core-objectives)
- [Technical Overview](#technical-overview)
- [Capabilities](#capabilities)
- [Project Structure](#project-structure)
- [Installation and Usage](#installation-and-usage)
- [Documentation](#documentation)
- [Foundational Research & References](#foundational-research--references)
- [Citation](#citation)
- [License](#license)

---

## Core Objectives

1.  **Map Information Flow**: Quantify how input tokens (State, Action, Reward-to-Go) contribute to the output action logits.
2.  **Causal Verification**: Use intervention techniques to identify the minimal set of model components required for specific behaviors.
3.  **Feature Decomposition**: Use Sparse Autoencoders (SAEs) to identify monosemantic features within the model's residual stream.
4.  **Behavioral Control**: Modify agent decisions at inference time by manipulating internal activations.

---

## Technical Overview

The framework centers around `HookedDT`, a Decision Transformer implementation that allows for activation hooking and cache management.

### Information Flow Diagram

```mermaid
graph TD
    subgraph Input_Sequence
        S[State Tokens]
        A[Action Tokens]
        RTG[Reward-to-Go Tokens]
    end

    Input_Sequence --> Embed[Embedding Layers]
    Embed --> Hooks[Activation Hooks]
    
    subgraph Transformer_Block
        Hooks --> Attn[Multi-Head Attention]
        Attn --> MLP[MLP Layers]
        MLP --> Res[Residual Stream]
    end

    Res --> DLA[Direct Logit Attribution]
    Res --> SAE[Sparse Autoencoder]
    Res --> Output[Action Logits]

    subgraph Interpretability_&_Safety
        DLA -.-> Analysis
        DLA -.-> MAD[Functional Attribution MAD]
        SAE -.-> Features
        SAE -.-> Auditor[Deceptive Alignment Auditor]
        Intervention[Activation Patching] -.-> Hooks
        
        Output & S --> Directer[Dynamic Rejection Steering]
        Directer -.-> |Feedback Adjust Alpha| Hooks
    end

    subgraph Interactive_Surgeon_Dashboard
        Surgeon[Circuit Surgeon Ablation Engine] -.-> |Dynamic Node/Edge Hooks| Hooks
        Surgeon --> |Format Schema| Neuronpedia[Neuronpedia Export Hub]
        Surgeon --> |Live Loop Execution| MiniGrid[MiniGrid Behavioral Audit]
        Output -.-> Surgeon
    end
```

---

## Capabilities

### Causal Mediation and Attribution
*   **Direct Logit Attribution (DLA)**: Measures the direct contribution of individual attention heads and MLP layers to the final logit output.
*   **Activation Patching**: Substitutes internal activations from different runs to isolate the causal effect of specific inputs on model behavior.
*   **Path Patching**: Traces how information flows through specific connections between model components.

### Feature Discovery and Analysis
*   **Sparse Autoencoders (SAEs)**: Decomposes the residual stream into a set of sparse features, helping to resolve polysemanticity.
*   **Induction Scanning**: Identifies attention heads that perform pattern-matching and temporal sequence recognition.
*   **Automated Circuit Discovery (ACDC)**: Prunes the model to identify the smallest functional subgraph sufficient to perform a specific task.

### Behavioral Steering & Safety Auditing
*   **Activation Steering**: Injects specific vectors into the residual stream to bias the agent's decision-making without retraining the weights.
*   **Dynamic Rejection Steering (Directer)**: Integrates a feedback loop during inference to dynamically scale back steering magnitude if it pushes the action distribution toward illegal or dangerous actions.
*   **Deceptive Alignment Auditing**: Uses SAE feature decomposition to identify the "situational awareness switch" feature in deceptively aligned agents (model organisms watched vs unwatched) and traces the circuit of attention heads that activate it.
*   **Functional Attribution MAD**: Detects mechanistic anomalies (such as backdoors or reward hacks) by comparing active logit attribution signatures to a cached reference profile, flagging when goals are met using atypical circuits.

### Interactive Surgical Auditing & Peer Review
*   **Interactive Circuit Surgery**: Provides real-time interactive node (Heads, MLPs) and communication path (edges) ablation tools. Severed pathways dynamically update the underlying architecture using custom forward hooks.
*   **Live Behavioral Audits**: Evaluates guided agent behavior inside a live Gymnasium (MiniGrid) environment step-by-step to immediately visualize behavioral changes under currently selected surgical configurations.
*   **Neuronpedia Export**: Formats the discovered circuit blueprint, active components, and performance metrics into standardized schemas for publishing directly to the Neuronpedia platform for public peer review.

---

## Project Structure

```text
DT-Circuits/
├── src/
│   ├── dashboard/          
│   │   └── app.py          # Streamlit-based visualization UI
│   ├── data/               
│   │   └── harvester.py    # PPO-based expert trajectory harvester
│   ├── interpretability/   
│   │   ├── acdc.py         # Automated Circuit Discovery logic
│   │   ├── attribution.py  # Direct Logit Attribution (DLA)
│   │   ├── circuit_surgeon.py # Interactive node & path ablation engine
│   │   ├── evolution.py    # Training Dynamics Analysis
│   │   ├── induction_scan.py # Induction head detection logic
│   │   ├── neuronpedia.py  # Neuronpedia publishing client
│   │   ├── nla.py          # Natural Language Autoencoder Explainer
│   │   ├── patching.py     # Causal activation patching tools
│   │   ├── path_patching.py # Path-based causal intervention engine
│   │   ├── safety.py       # Safety auditing, directer, and deceptive alignment tools
│   │   ├── sae_manager.py  # SAE deployment and anomaly detection
│   │   ├── steering.py     # Steering vector generation and injection
│   │   └── universality.py # Cross-architecture feature mapping
│   ├── models/             
│   │   └── hooked_dt.py    # TransformerLens-wrapped Decision Transformer
│   ├── config.py           # Centralized hyperparameter management
│   └── utils/              
├── tests/                  # Unit tests for all modules
├── config.yaml             # External hyperparameter storage
├── requirements.txt 
└── docs/                        
```

---

## Configuration

Hyperparameters are managed through a dual-system for both ease of use and research reproducibility:

1.  **`config.yaml`**: The primary interface for users. You can modify model dimensions, training epochs, and environment settings here without touching the code.
2.  **`src/config.py`**: Defines the underlying structure using Python dataclasses. It automatically loads overrides from `config.yaml` at runtime.

### Key Configuration Sections

| Section | Description | Key Parameters |
| :--- | :--- | :--- |
| **`model`** | Architecture settings for the Decision Transformer | `n_layers`, `d_model`, `n_heads`, `max_length` |
| **`data`** | Settings for expert trajectory collection | `env_id`, `num_episodes` (for DT training) |
| **`train`** | DT training hyperparameters | `lr`, `epochs`, `seed` |
| **`sae`** | Sparse Autoencoder training hyperparameters | `expansion_factor`, `k`, `num_episodes` (SAE specific) |

**Example: Independent Data Control** 
You can control the amount of data used for general training vs. interpretability separately:
```yaml
data:
  num_episodes: 1000  # Episodes for training the DT teacher

sae:
  num_episodes: 500   # Episodes for extracting SAE activations
```

---

## Execution Modes: Installation and Usage

There are two primary ways to run and interact with the **DT-Circuits** framework depending on your research needs:

---

### Way 1: Interactive Cloud Demo (Hugging Face Spaces)

For instant visual exploration, path intervention, and alignment auditing without any local workspace preparation, launch the web dashboard directly:

* **Demo Link:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)

> [!NOTE]
> **Concise Demo Constraints:**
> * **CPU-Bound Resources:** Runs on standard free-tier CPU instances (2 vCPUs, 16 GB RAM); high-overhead operations like ACDC scans may show higher latency than on a local GPU workspace.
> * **Slices Dataset:** Trajectory datasets are dynamically sliced down to a lightweight demo set under a **10MB limit** (defined in [deploy.sh](file:///Users/sadhumitha/Documents/projects/DT-Circuits/scripts/deploy.sh#L19-L33)) for storage and memory footprint constraints.
> * **Read-Only / Ephemeral Container:** Uses pre-baked static weights (`mini_dt.pt`) and pre-trained SAE checkpoints. Training new models or writing persistent states is disabled.

---

### Way 2: Clone and Run Locally (Full Pipeline)

For full end-to-end research, customized hyperparameter tuning, local data harvesting, and GPU-accelerated model or SAE training, run the workspace on your machine.

#### Local Environment Setup
First, clone the repository, set up a virtual environment, and install dependencies:
```bash
git clone https://github.com/sadhumitha-s/DT-Circuits
cd DT-Circuits

python -m venv venv
source venv/bin/activate  

pip install -r requirements.txt
```

#### Option 2.1: Simple Workflows via Makefile
The workspace includes a standardized [Makefile](file:///Users/sadhumitha/Documents/projects/DT-Circuits/Makefile) to orchestrate common research pipelines with single commands:

```bash
make setup      # Set up local environment & install requirements
make train      # Run the full end-to-end pipeline (Data harvesting -> DT -> SAE training)
make dashboard  # Run the Streamlit visualization dashboard locally
```

#### Option 2.2: Granular Control via Bash & Python
For research flexibility, execute each step of the pipeline manually using granular terminal scripts:

1. **Trajectories & Model Training**
   Harvest teacher trajectories and train the target Decision Transformer (`HookedDT`):
   ```bash
   python scripts/train_dt.py
   ```

2. **TopK Sparse Autoencoder (SAE) Training**
   Train sparse autoencoders on target activation layers:
   ```bash
   python scripts/train_sae.py
   ```

3. **Interactive Analysis**
   Launch the Streamlit visualization engine locally to run audits with custom weights:
   ```bash
   streamlit run src/dashboard/app.py
   ```

---

## Documentation

Detailed technical documentation for specific modules:
*   [Circuit Discovery](./docs/circuit_discovery.md)
*   [Causal Intervention](./docs/activation_patching.md)
*   [SAEs and Steering](./docs/sae_steering.md)
*   [Safety Auditing & Steering](./docs/safety_auditing.md)

---

## Foundational Research & References

This framework implements and builds upon the following foundational methodologies:

*   **Decision Transformers**: [Chen et al., 2021](https://arxiv.org/abs/2106.01345) — Reinforcement learning as sequence modeling.
*   **Transformer Circuits**: [Elhage et al., 2021](https://transformer-circuits.pub/2021/framework/index.html) — Mathematical foundations of mechanistic interpretability.
*   **ACDC (Automated Circuit Discovery)**: [Conmy et al., 2023](https://arxiv.org/abs/2304.14997) — Algorithmic discovery of subgraphs.
*   **Sparse Autoencoders (SAEs)**: [Bricken et al., 2023](https://transformer-circuits.pub/2023/monosemantic-features/index.html) (monosemantic features) & [Gao et al., 2024](https://arxiv.org/abs/2406.04096) (TopK SAEs).
*   **Activation Steering**: [Turner et al., 2023](https://arxiv.org/abs/2308.10248) — Control via residual stream vector additions.
*   **Path Patching**: [Goldowsky-Dill et al., 2023](https://arxiv.org/abs/2304.05969) — Inter-component causal mediation.

---

## Citation

```bibtex
@software{dt_circuits2026,
  author = {Sadhumitha S.},
  title = {DT-Circuits: Mechanistic Interpretability for Decision Transformers},
  year = {2026},
  url = {https://github.com/sadhumitha-s/DT-Circuits}
}
```

---

## License
Apache 2.0