Spaces:

sadhumitha-s
/

DT-Explorer

Running

App Files Files Community

DT-Explorer / README.md

GitHub Actions

chore: inject Hugging Face frontmatter metadata dynamically

a825f06 1 day ago

preview code

raw

history blame contribute delete

13.4 kB

	---
	title: DT-Explorer
	emoji: 🔍
	colorFrom: blue
	colorTo: indigo
	sdk: docker
	pinned: false
	---

	# DT-Circuits: Mechanistic Interpretability for Decision Transformers

	[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A5%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)
	[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
	[![PyTorch 2.x](https://img.shields.io/badge/PyTorch-2.x-red.svg)](https://pytorch.org/)
	[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Framework: TransformerLens](https://img.shields.io/badge/Framework-TransformerLens-orange.svg)](https://github.com/TransformerLensOrg/TransformerLens)

	DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.

	Live Interactive Demo: [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)

	---

	## Table of Contents
	- [Core Objectives](#core-objectives)
	- [Technical Overview](#technical-overview)
	- [Capabilities](#capabilities)
	- [Project Structure](#project-structure)
	- [Installation and Usage](#installation-and-usage)
	- [Documentation](#documentation)
	- [Foundational Research & References](#foundational-research--references)
	- [Citation](#citation)
	- [License](#license)

	---

	## Core Objectives

	1. Map Information Flow: Quantify how input tokens (State, Action, Reward-to-Go) contribute to the output action logits.
	2. Causal Verification: Use intervention techniques to identify the minimal set of model components required for specific behaviors.
	3. Feature Decomposition: Use Sparse Autoencoders (SAEs) to identify monosemantic features within the model's residual stream.
	4. Behavioral Control: Modify agent decisions at inference time by manipulating internal activations.

	---

	## Technical Overview

	The framework centers around `HookedDT`, a Decision Transformer implementation that allows for activation hooking and cache management.

	### Information Flow Diagram

	```mermaid
	graph TD
	subgraph Input_Sequence
	S[State Tokens]
	A[Action Tokens]
	RTG[Reward-to-Go Tokens]
	end

	Input_Sequence --> Embed[Embedding Layers]
	Embed --> Hooks[Activation Hooks]

	subgraph Transformer_Block
	Hooks --> Attn[Multi-Head Attention]
	Attn --> MLP[MLP Layers]
	MLP --> Res[Residual Stream]
	end

	Res --> DLA[Direct Logit Attribution]
	Res --> SAE[Sparse Autoencoder]
	Res --> Output[Action Logits]

	subgraph Interpretability_&_Safety
	DLA -.-> Analysis
	DLA -.-> MAD[Functional Attribution MAD]
	SAE -.-> Features
	SAE -.-> Auditor[Deceptive Alignment Auditor]
	Intervention[Activation Patching] -.-> Hooks

	Output & S --> Directer[Dynamic Rejection Steering]
	Directer -.-> \|Feedback Adjust Alpha\| Hooks
	end

	subgraph Interactive_Surgeon_Dashboard
	Surgeon[Circuit Surgeon Ablation Engine] -.-> \|Dynamic Node/Edge Hooks\| Hooks
	Surgeon --> \|Format Schema\| Neuronpedia[Neuronpedia Export Hub]
	Surgeon --> \|Live Loop Execution\| MiniGrid[MiniGrid Behavioral Audit]
	Output -.-> Surgeon
	end
	```

	---

	## Capabilities

	### Causal Mediation and Attribution
	* Direct Logit Attribution (DLA): Measures the direct contribution of individual attention heads and MLP layers to the final logit output.
	* Activation Patching: Substitutes internal activations from different runs to isolate the causal effect of specific inputs on model behavior.
	* Path Patching: Traces how information flows through specific connections between model components.

	### Feature Discovery and Analysis
	* Sparse Autoencoders (SAEs): Decomposes the residual stream into a set of sparse features, helping to resolve polysemanticity.
	* Induction Scanning: Identifies attention heads that perform pattern-matching and temporal sequence recognition.
	* Automated Circuit Discovery (ACDC): Prunes the model to identify the smallest functional subgraph sufficient to perform a specific task.

	### Behavioral Steering & Safety Auditing
	* Activation Steering: Injects specific vectors into the residual stream to bias the agent's decision-making without retraining the weights.
	* Dynamic Rejection Steering (Directer): Integrates a feedback loop during inference to dynamically scale back steering magnitude if it pushes the action distribution toward illegal or dangerous actions.
	* Deceptive Alignment Auditing: Uses SAE feature decomposition to identify the "situational awareness switch" feature in deceptively aligned agents (model organisms watched vs unwatched) and traces the circuit of attention heads that activate it.
	* Functional Attribution MAD: Detects mechanistic anomalies (such as backdoors or reward hacks) by comparing active logit attribution signatures to a cached reference profile, flagging when goals are met using atypical circuits.

	### Interactive Surgical Auditing & Peer Review
	* Interactive Circuit Surgery: Provides real-time interactive node (Heads, MLPs) and communication path (edges) ablation tools. Severed pathways dynamically update the underlying architecture using custom forward hooks.
	* Live Behavioral Audits: Evaluates guided agent behavior inside a live Gymnasium (MiniGrid) environment step-by-step to immediately visualize behavioral changes under currently selected surgical configurations.
	* Neuronpedia Export: Formats the discovered circuit blueprint, active components, and performance metrics into standardized schemas for publishing directly to the Neuronpedia platform for public peer review.

	---

	## Project Structure

	```text
	DT-Circuits/
	├── src/
	│ ├── dashboard/
	│ │ └── app.py # Streamlit-based visualization UI
	│ ├── data/
	│ │ └── harvester.py # PPO-based expert trajectory harvester
	│ ├── interpretability/
	│ │ ├── acdc.py # Automated Circuit Discovery logic
	│ │ ├── attribution.py # Direct Logit Attribution (DLA)
	│ │ ├── circuit_surgeon.py # Interactive node & path ablation engine
	│ │ ├── evolution.py # Training Dynamics Analysis
	│ │ ├── induction_scan.py # Induction head detection logic
	│ │ ├── neuronpedia.py # Neuronpedia publishing client
	│ │ ├── nla.py # Natural Language Autoencoder Explainer
	│ │ ├── patching.py # Causal activation patching tools
	│ │ ├── path_patching.py # Path-based causal intervention engine
	│ │ ├── safety.py # Safety auditing, directer, and deceptive alignment tools
	│ │ ├── sae_manager.py # SAE deployment and anomaly detection
	│ │ ├── steering.py # Steering vector generation and injection
	│ │ └── universality.py # Cross-architecture feature mapping
	│ ├── models/
	│ │ └── hooked_dt.py # TransformerLens-wrapped Decision Transformer
	│ ├── config.py # Centralized hyperparameter management
	│ └── utils/
	├── tests/ # Unit tests for all modules
	├── config.yaml # External hyperparameter storage
	├── requirements.txt
	└── docs/
	```

	---

	## Configuration

	Hyperparameters are managed through a dual-system for both ease of use and research reproducibility:

	1. `config.yaml`: The primary interface for users. You can modify model dimensions, training epochs, and environment settings here without touching the code.
	2. `src/config.py`: Defines the underlying structure using Python dataclasses. It automatically loads overrides from `config.yaml` at runtime.

	### Key Configuration Sections

	\| Section \| Description \| Key Parameters \|
	\| :--- \| :--- \| :--- \|
	\| `model` \| Architecture settings for the Decision Transformer \| `n_layers`, `d_model`, `n_heads`, `max_length` \|
	\| `data` \| Settings for expert trajectory collection \| `env_id`, `num_episodes` (for DT training) \|
	\| `train` \| DT training hyperparameters \| `lr`, `epochs`, `seed` \|
	\| `sae` \| Sparse Autoencoder training hyperparameters \| `expansion_factor`, `k`, `num_episodes` (SAE specific) \|

	Example: Independent Data Control
	You can control the amount of data used for general training vs. interpretability separately:
	```yaml
	data:
	num_episodes: 1000 # Episodes for training the DT teacher

	sae:
	num_episodes: 500 # Episodes for extracting SAE activations
	```

	---

	## Execution Modes: Installation and Usage

	There are two primary ways to run and interact with the DT-Circuits framework depending on your research needs:

	---

	### Way 1: Interactive Cloud Demo (Hugging Face Spaces)

	For instant visual exploration, path intervention, and alignment auditing without any local workspace preparation, launch the web dashboard directly:

	* Demo Link: [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)

	> [!NOTE]
	> Concise Demo Constraints:
	> * CPU-Bound Resources: Runs on standard free-tier CPU instances (2 vCPUs, 16 GB RAM); high-overhead operations like ACDC scans may show higher latency than on a local GPU workspace.
	> * Slices Dataset: Trajectory datasets are dynamically sliced down to a lightweight demo set under a 10MB limit (defined in [deploy.sh](file:///Users/sadhumitha/Documents/projects/DT-Circuits/scripts/deploy.sh#L19-L33)) for storage and memory footprint constraints.
	> * Read-Only / Ephemeral Container: Uses pre-baked static weights (`mini_dt.pt`) and pre-trained SAE checkpoints. Training new models or writing persistent states is disabled.

	---

	### Way 2: Clone and Run Locally (Full Pipeline)

	For full end-to-end research, customized hyperparameter tuning, local data harvesting, and GPU-accelerated model or SAE training, run the workspace on your machine.

	#### Local Environment Setup
	First, clone the repository, set up a virtual environment, and install dependencies:
	```bash
	git clone https://github.com/sadhumitha-s/DT-Circuits
	cd DT-Circuits

	python -m venv venv
	source venv/bin/activate

	pip install -r requirements.txt
	```

	#### Option 2.1: Simple Workflows via Makefile
	The workspace includes a standardized [Makefile](file:///Users/sadhumitha/Documents/projects/DT-Circuits/Makefile) to orchestrate common research pipelines with single commands:

	```bash
	make setup # Set up local environment & install requirements
	make train # Run the full end-to-end pipeline (Data harvesting -> DT -> SAE training)
	make dashboard # Run the Streamlit visualization dashboard locally
	```

	#### Option 2.2: Granular Control via Bash & Python
	For research flexibility, execute each step of the pipeline manually using granular terminal scripts:

	1. Trajectories & Model Training
	Harvest teacher trajectories and train the target Decision Transformer (`HookedDT`):
	```bash
	python scripts/train_dt.py
	```

	2. TopK Sparse Autoencoder (SAE) Training
	Train sparse autoencoders on target activation layers:
	```bash
	python scripts/train_sae.py
	```

	3. Interactive Analysis
	Launch the Streamlit visualization engine locally to run audits with custom weights:
	```bash
	streamlit run src/dashboard/app.py
	```

	---

	## Documentation

	Detailed technical documentation for specific modules:
	* [Circuit Discovery](./docs/circuit_discovery.md)
	* [Causal Intervention](./docs/activation_patching.md)
	* [SAEs and Steering](./docs/sae_steering.md)
	* [Safety Auditing & Steering](./docs/safety_auditing.md)

	---

	## Foundational Research & References

	This framework implements and builds upon the following foundational methodologies:

	* Decision Transformers: [Chen et al., 2021](https://arxiv.org/abs/2106.01345) — Reinforcement learning as sequence modeling.
	* Transformer Circuits: [Elhage et al., 2021](https://transformer-circuits.pub/2021/framework/index.html) — Mathematical foundations of mechanistic interpretability.
	* ACDC (Automated Circuit Discovery): [Conmy et al., 2023](https://arxiv.org/abs/2304.14997) — Algorithmic discovery of subgraphs.
	* Sparse Autoencoders (SAEs): [Bricken et al., 2023](https://transformer-circuits.pub/2023/monosemantic-features/index.html) (monosemantic features) & [Gao et al., 2024](https://arxiv.org/abs/2406.04096) (TopK SAEs).
	* Activation Steering: [Turner et al., 2023](https://arxiv.org/abs/2308.10248) — Control via residual stream vector additions.
	* Path Patching: [Goldowsky-Dill et al., 2023](https://arxiv.org/abs/2304.05969) — Inter-component causal mediation.

	---

	## Citation

	```bibtex
	@software{dt_circuits2026,
	author = {Sadhumitha S.},
	title = {DT-Circuits: Mechanistic Interpretability for Decision Transformers},
	year = {2026},
	url = {https://github.com/sadhumitha-s/DT-Circuits}
	}
	```

	---

	## License
	Apache 2.0