Spaces:

sadhumitha-s
/

DT-Explorer

Running

App Files Files Community

DT-Explorer / README.md

GitHub Actions

chore: inject Hugging Face frontmatter metadata dynamically

a825f06 1 day ago

preview code

raw

history blame contribute delete

13.4 kB

metadata

title: DT-Explorer
emoji: 🔍
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false

DT-Circuits: Mechanistic Interpretability for Decision Transformers

DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.

Live Interactive Demo: DT-Explorer on Hugging Face Spaces

Core Objectives
Technical Overview
Capabilities
Project Structure
Installation and Usage
Documentation
Foundational Research & References
Citation
License

Core Objectives

Map Information Flow: Quantify how input tokens (State, Action, Reward-to-Go) contribute to the output action logits.
Causal Verification: Use intervention techniques to identify the minimal set of model components required for specific behaviors.
Feature Decomposition: Use Sparse Autoencoders (SAEs) to identify monosemantic features within the model's residual stream.
Behavioral Control: Modify agent decisions at inference time by manipulating internal activations.

Technical Overview

The framework centers around HookedDT, a Decision Transformer implementation that allows for activation hooking and cache management.

Information Flow Diagram

graph TD
    subgraph Input_Sequence
        S[State Tokens]
        A[Action Tokens]
        RTG[Reward-to-Go Tokens]
    end

    Input_Sequence --> Embed[Embedding Layers]
    Embed --> Hooks[Activation Hooks]
    
    subgraph Transformer_Block
        Hooks --> Attn[Multi-Head Attention]
        Attn --> MLP[MLP Layers]
        MLP --> Res[Residual Stream]
    end

    Res --> DLA[Direct Logit Attribution]
    Res --> SAE[Sparse Autoencoder]
    Res --> Output[Action Logits]

    subgraph Interpretability_&_Safety
        DLA -.-> Analysis
        DLA -.-> MAD[Functional Attribution MAD]
        SAE -.-> Features
        SAE -.-> Auditor[Deceptive Alignment Auditor]
        Intervention[Activation Patching] -.-> Hooks
        
        Output & S --> Directer[Dynamic Rejection Steering]
        Directer -.-> |Feedback Adjust Alpha| Hooks
    end

    subgraph Interactive_Surgeon_Dashboard
        Surgeon[Circuit Surgeon Ablation Engine] -.-> |Dynamic Node/Edge Hooks| Hooks
        Surgeon --> |Format Schema| Neuronpedia[Neuronpedia Export Hub]
        Surgeon --> |Live Loop Execution| MiniGrid[MiniGrid Behavioral Audit]
        Output -.-> Surgeon
    end

Capabilities

Causal Mediation and Attribution

Direct Logit Attribution (DLA): Measures the direct contribution of individual attention heads and MLP layers to the final logit output.
Activation Patching: Substitutes internal activations from different runs to isolate the causal effect of specific inputs on model behavior.
Path Patching: Traces how information flows through specific connections between model components.

Feature Discovery and Analysis

Sparse Autoencoders (SAEs): Decomposes the residual stream into a set of sparse features, helping to resolve polysemanticity.
Induction Scanning: Identifies attention heads that perform pattern-matching and temporal sequence recognition.
Automated Circuit Discovery (ACDC): Prunes the model to identify the smallest functional subgraph sufficient to perform a specific task.

Behavioral Steering & Safety Auditing

Activation Steering: Injects specific vectors into the residual stream to bias the agent's decision-making without retraining the weights.
Dynamic Rejection Steering (Directer): Integrates a feedback loop during inference to dynamically scale back steering magnitude if it pushes the action distribution toward illegal or dangerous actions.
Deceptive Alignment Auditing: Uses SAE feature decomposition to identify the "situational awareness switch" feature in deceptively aligned agents (model organisms watched vs unwatched) and traces the circuit of attention heads that activate it.
Functional Attribution MAD: Detects mechanistic anomalies (such as backdoors or reward hacks) by comparing active logit attribution signatures to a cached reference profile, flagging when goals are met using atypical circuits.

Interactive Surgical Auditing & Peer Review

Interactive Circuit Surgery: Provides real-time interactive node (Heads, MLPs) and communication path (edges) ablation tools. Severed pathways dynamically update the underlying architecture using custom forward hooks.
Live Behavioral Audits: Evaluates guided agent behavior inside a live Gymnasium (MiniGrid) environment step-by-step to immediately visualize behavioral changes under currently selected surgical configurations.
Neuronpedia Export: Formats the discovered circuit blueprint, active components, and performance metrics into standardized schemas for publishing directly to the Neuronpedia platform for public peer review.

Project Structure

DT-Circuits/
├── src/
│   ├── dashboard/          
│   │   └── app.py          # Streamlit-based visualization UI
│   ├── data/               
│   │   └── harvester.py    # PPO-based expert trajectory harvester
│   ├── interpretability/   
│   │   ├── acdc.py         # Automated Circuit Discovery logic
│   │   ├── attribution.py  # Direct Logit Attribution (DLA)
│   │   ├── circuit_surgeon.py # Interactive node & path ablation engine
│   │   ├── evolution.py    # Training Dynamics Analysis
│   │   ├── induction_scan.py # Induction head detection logic
│   │   ├── neuronpedia.py  # Neuronpedia publishing client
│   │   ├── nla.py          # Natural Language Autoencoder Explainer
│   │   ├── patching.py     # Causal activation patching tools
│   │   ├── path_patching.py # Path-based causal intervention engine
│   │   ├── safety.py       # Safety auditing, directer, and deceptive alignment tools
│   │   ├── sae_manager.py  # SAE deployment and anomaly detection
│   │   ├── steering.py     # Steering vector generation and injection
│   │   └── universality.py # Cross-architecture feature mapping
│   ├── models/             
│   │   └── hooked_dt.py    # TransformerLens-wrapped Decision Transformer
│   ├── config.py           # Centralized hyperparameter management
│   └── utils/              
├── tests/                  # Unit tests for all modules
├── config.yaml             # External hyperparameter storage
├── requirements.txt 
└── docs/

Configuration

Hyperparameters are managed through a dual-system for both ease of use and research reproducibility:

config.yaml: The primary interface for users. You can modify model dimensions, training epochs, and environment settings here without touching the code.
src/config.py: Defines the underlying structure using Python dataclasses. It automatically loads overrides from config.yaml at runtime.

Key Configuration Sections

Section	Description	Key Parameters
`model`	Architecture settings for the Decision Transformer	`n_layers`, `d_model`, `n_heads`, `max_length`
`data`	Settings for expert trajectory collection	`env_id`, `num_episodes` (for DT training)
`train`	DT training hyperparameters	`lr`, `epochs`, `seed`
`sae`	Sparse Autoencoder training hyperparameters	`expansion_factor`, `k`, `num_episodes` (SAE specific)

Example: Independent Data Control You can control the amount of data used for general training vs. interpretability separately:

data:
  num_episodes: 1000  # Episodes for training the DT teacher

sae:
  num_episodes: 500   # Episodes for extracting SAE activations

Execution Modes: Installation and Usage

There are two primary ways to run and interact with the DT-Circuits framework depending on your research needs:

Way 1: Interactive Cloud Demo (Hugging Face Spaces)

For instant visual exploration, path intervention, and alignment auditing without any local workspace preparation, launch the web dashboard directly:

Demo Link: DT-Explorer on Hugging Face Spaces

Concise Demo Constraints:

CPU-Bound Resources: Runs on standard free-tier CPU instances (2 vCPUs, 16 GB RAM); high-overhead operations like ACDC scans may show higher latency than on a local GPU workspace.

Slices Dataset: Trajectory datasets are dynamically sliced down to a lightweight demo set under a 10MB limit (defined in deploy.sh) for storage and memory footprint constraints.

Read-Only / Ephemeral Container: Uses pre-baked static weights (mini_dt.pt) and pre-trained SAE checkpoints. Training new models or writing persistent states is disabled.

Way 2: Clone and Run Locally (Full Pipeline)

For full end-to-end research, customized hyperparameter tuning, local data harvesting, and GPU-accelerated model or SAE training, run the workspace on your machine.

Local Environment Setup

First, clone the repository, set up a virtual environment, and install dependencies:

git clone https://github.com/sadhumitha-s/DT-Circuits
cd DT-Circuits

python -m venv venv
source venv/bin/activate  

pip install -r requirements.txt

Option 2.1: Simple Workflows via Makefile

The workspace includes a standardized Makefile to orchestrate common research pipelines with single commands:

make setup      # Set up local environment & install requirements
make train      # Run the full end-to-end pipeline (Data harvesting -> DT -> SAE training)
make dashboard  # Run the Streamlit visualization dashboard locally

Option 2.2: Granular Control via Bash & Python

For research flexibility, execute each step of the pipeline manually using granular terminal scripts:

Trajectories & Model Training Harvest teacher trajectories and train the target Decision Transformer (HookedDT):
```
python scripts/train_dt.py
```
TopK Sparse Autoencoder (SAE) Training Train sparse autoencoders on target activation layers:
```
python scripts/train_sae.py
```
Interactive Analysis Launch the Streamlit visualization engine locally to run audits with custom weights:
```
streamlit run src/dashboard/app.py
```

Documentation

Detailed technical documentation for specific modules:

Foundational Research & References

This framework implements and builds upon the following foundational methodologies:

Decision Transformers: Chen et al., 2021 — Reinforcement learning as sequence modeling.
Transformer Circuits: Elhage et al., 2021 — Mathematical foundations of mechanistic interpretability.
ACDC (Automated Circuit Discovery): Conmy et al., 2023 — Algorithmic discovery of subgraphs.
Sparse Autoencoders (SAEs): Bricken et al., 2023 (monosemantic features) & Gao et al., 2024 (TopK SAEs).
Activation Steering: Turner et al., 2023 — Control via residual stream vector additions.
Path Patching: Goldowsky-Dill et al., 2023 — Inter-component causal mediation.

Citation

@software{dt_circuits2026,
  author = {Sadhumitha S.},
  title = {DT-Circuits: Mechanistic Interpretability for Decision Transformers},
  year = {2026},
  url = {https://github.com/sadhumitha-s/DT-Circuits}
}

License

Apache 2.0