DT-Explorer / README.md
GitHub Actions
chore: inject Hugging Face frontmatter metadata dynamically
a825f06
metadata
title: DT-Explorer
emoji: πŸ”
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false

DT-Circuits: Mechanistic Interpretability for Decision Transformers

Hugging Face Spaces Python 3.9+ PyTorch 2.x License: Apache 2.0 Framework: TransformerLens

DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.

Live Interactive Demo: DT-Explorer on Hugging Face Spaces


Table of Contents


Core Objectives

  1. Map Information Flow: Quantify how input tokens (State, Action, Reward-to-Go) contribute to the output action logits.
  2. Causal Verification: Use intervention techniques to identify the minimal set of model components required for specific behaviors.
  3. Feature Decomposition: Use Sparse Autoencoders (SAEs) to identify monosemantic features within the model's residual stream.
  4. Behavioral Control: Modify agent decisions at inference time by manipulating internal activations.

Technical Overview

The framework centers around HookedDT, a Decision Transformer implementation that allows for activation hooking and cache management.

Information Flow Diagram

graph TD
    subgraph Input_Sequence
        S[State Tokens]
        A[Action Tokens]
        RTG[Reward-to-Go Tokens]
    end

    Input_Sequence --> Embed[Embedding Layers]
    Embed --> Hooks[Activation Hooks]
    
    subgraph Transformer_Block
        Hooks --> Attn[Multi-Head Attention]
        Attn --> MLP[MLP Layers]
        MLP --> Res[Residual Stream]
    end

    Res --> DLA[Direct Logit Attribution]
    Res --> SAE[Sparse Autoencoder]
    Res --> Output[Action Logits]

    subgraph Interpretability_&_Safety
        DLA -.-> Analysis
        DLA -.-> MAD[Functional Attribution MAD]
        SAE -.-> Features
        SAE -.-> Auditor[Deceptive Alignment Auditor]
        Intervention[Activation Patching] -.-> Hooks
        
        Output & S --> Directer[Dynamic Rejection Steering]
        Directer -.-> |Feedback Adjust Alpha| Hooks
    end

    subgraph Interactive_Surgeon_Dashboard
        Surgeon[Circuit Surgeon Ablation Engine] -.-> |Dynamic Node/Edge Hooks| Hooks
        Surgeon --> |Format Schema| Neuronpedia[Neuronpedia Export Hub]
        Surgeon --> |Live Loop Execution| MiniGrid[MiniGrid Behavioral Audit]
        Output -.-> Surgeon
    end

Capabilities

Causal Mediation and Attribution

  • Direct Logit Attribution (DLA): Measures the direct contribution of individual attention heads and MLP layers to the final logit output.
  • Activation Patching: Substitutes internal activations from different runs to isolate the causal effect of specific inputs on model behavior.
  • Path Patching: Traces how information flows through specific connections between model components.

Feature Discovery and Analysis

  • Sparse Autoencoders (SAEs): Decomposes the residual stream into a set of sparse features, helping to resolve polysemanticity.
  • Induction Scanning: Identifies attention heads that perform pattern-matching and temporal sequence recognition.
  • Automated Circuit Discovery (ACDC): Prunes the model to identify the smallest functional subgraph sufficient to perform a specific task.

Behavioral Steering & Safety Auditing

  • Activation Steering: Injects specific vectors into the residual stream to bias the agent's decision-making without retraining the weights.
  • Dynamic Rejection Steering (Directer): Integrates a feedback loop during inference to dynamically scale back steering magnitude if it pushes the action distribution toward illegal or dangerous actions.
  • Deceptive Alignment Auditing: Uses SAE feature decomposition to identify the "situational awareness switch" feature in deceptively aligned agents (model organisms watched vs unwatched) and traces the circuit of attention heads that activate it.
  • Functional Attribution MAD: Detects mechanistic anomalies (such as backdoors or reward hacks) by comparing active logit attribution signatures to a cached reference profile, flagging when goals are met using atypical circuits.

Interactive Surgical Auditing & Peer Review

  • Interactive Circuit Surgery: Provides real-time interactive node (Heads, MLPs) and communication path (edges) ablation tools. Severed pathways dynamically update the underlying architecture using custom forward hooks.
  • Live Behavioral Audits: Evaluates guided agent behavior inside a live Gymnasium (MiniGrid) environment step-by-step to immediately visualize behavioral changes under currently selected surgical configurations.
  • Neuronpedia Export: Formats the discovered circuit blueprint, active components, and performance metrics into standardized schemas for publishing directly to the Neuronpedia platform for public peer review.

Project Structure

DT-Circuits/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ dashboard/          
β”‚   β”‚   └── app.py          # Streamlit-based visualization UI
β”‚   β”œβ”€β”€ data/               
β”‚   β”‚   └── harvester.py    # PPO-based expert trajectory harvester
β”‚   β”œβ”€β”€ interpretability/   
β”‚   β”‚   β”œβ”€β”€ acdc.py         # Automated Circuit Discovery logic
β”‚   β”‚   β”œβ”€β”€ attribution.py  # Direct Logit Attribution (DLA)
β”‚   β”‚   β”œβ”€β”€ circuit_surgeon.py # Interactive node & path ablation engine
β”‚   β”‚   β”œβ”€β”€ evolution.py    # Training Dynamics Analysis
β”‚   β”‚   β”œβ”€β”€ induction_scan.py # Induction head detection logic
β”‚   β”‚   β”œβ”€β”€ neuronpedia.py  # Neuronpedia publishing client
β”‚   β”‚   β”œβ”€β”€ nla.py          # Natural Language Autoencoder Explainer
β”‚   β”‚   β”œβ”€β”€ patching.py     # Causal activation patching tools
β”‚   β”‚   β”œβ”€β”€ path_patching.py # Path-based causal intervention engine
β”‚   β”‚   β”œβ”€β”€ safety.py       # Safety auditing, directer, and deceptive alignment tools
β”‚   β”‚   β”œβ”€β”€ sae_manager.py  # SAE deployment and anomaly detection
β”‚   β”‚   β”œβ”€β”€ steering.py     # Steering vector generation and injection
β”‚   β”‚   └── universality.py # Cross-architecture feature mapping
β”‚   β”œβ”€β”€ models/             
β”‚   β”‚   └── hooked_dt.py    # TransformerLens-wrapped Decision Transformer
β”‚   β”œβ”€β”€ config.py           # Centralized hyperparameter management
β”‚   └── utils/              
β”œβ”€β”€ tests/                  # Unit tests for all modules
β”œβ”€β”€ config.yaml             # External hyperparameter storage
β”œβ”€β”€ requirements.txt 
└── docs/                        

Configuration

Hyperparameters are managed through a dual-system for both ease of use and research reproducibility:

  1. config.yaml: The primary interface for users. You can modify model dimensions, training epochs, and environment settings here without touching the code.
  2. src/config.py: Defines the underlying structure using Python dataclasses. It automatically loads overrides from config.yaml at runtime.

Key Configuration Sections

Section Description Key Parameters
model Architecture settings for the Decision Transformer n_layers, d_model, n_heads, max_length
data Settings for expert trajectory collection env_id, num_episodes (for DT training)
train DT training hyperparameters lr, epochs, seed
sae Sparse Autoencoder training hyperparameters expansion_factor, k, num_episodes (SAE specific)

Example: Independent Data Control You can control the amount of data used for general training vs. interpretability separately:

data:
  num_episodes: 1000  # Episodes for training the DT teacher

sae:
  num_episodes: 500   # Episodes for extracting SAE activations

Execution Modes: Installation and Usage

There are two primary ways to run and interact with the DT-Circuits framework depending on your research needs:


Way 1: Interactive Cloud Demo (Hugging Face Spaces)

For instant visual exploration, path intervention, and alignment auditing without any local workspace preparation, launch the web dashboard directly:

Concise Demo Constraints:

  • CPU-Bound Resources: Runs on standard free-tier CPU instances (2 vCPUs, 16 GB RAM); high-overhead operations like ACDC scans may show higher latency than on a local GPU workspace.
  • Slices Dataset: Trajectory datasets are dynamically sliced down to a lightweight demo set under a 10MB limit (defined in deploy.sh) for storage and memory footprint constraints.
  • Read-Only / Ephemeral Container: Uses pre-baked static weights (mini_dt.pt) and pre-trained SAE checkpoints. Training new models or writing persistent states is disabled.

Way 2: Clone and Run Locally (Full Pipeline)

For full end-to-end research, customized hyperparameter tuning, local data harvesting, and GPU-accelerated model or SAE training, run the workspace on your machine.

Local Environment Setup

First, clone the repository, set up a virtual environment, and install dependencies:

git clone https://github.com/sadhumitha-s/DT-Circuits
cd DT-Circuits

python -m venv venv
source venv/bin/activate  

pip install -r requirements.txt

Option 2.1: Simple Workflows via Makefile

The workspace includes a standardized Makefile to orchestrate common research pipelines with single commands:

make setup      # Set up local environment & install requirements
make train      # Run the full end-to-end pipeline (Data harvesting -> DT -> SAE training)
make dashboard  # Run the Streamlit visualization dashboard locally

Option 2.2: Granular Control via Bash & Python

For research flexibility, execute each step of the pipeline manually using granular terminal scripts:

  1. Trajectories & Model Training Harvest teacher trajectories and train the target Decision Transformer (HookedDT):

    python scripts/train_dt.py
    
  2. TopK Sparse Autoencoder (SAE) Training Train sparse autoencoders on target activation layers:

    python scripts/train_sae.py
    
  3. Interactive Analysis Launch the Streamlit visualization engine locally to run audits with custom weights:

    streamlit run src/dashboard/app.py
    

Documentation

Detailed technical documentation for specific modules:


Foundational Research & References

This framework implements and builds upon the following foundational methodologies:


Citation

@software{dt_circuits2026,
  author = {Sadhumitha S.},
  title = {DT-Circuits: Mechanistic Interpretability for Decision Transformers},
  year = {2026},
  url = {https://github.com/sadhumitha-s/DT-Circuits}
}

License

Apache 2.0