Spaces:
Running
Running
revise readme
Browse files
README.md
CHANGED
|
@@ -1,16 +1,21 @@
|
|
| 1 |
# DT-Circuits: Mechanistic Interpretability for Decision Transformers
|
| 2 |
|
|
|
|
|
|
|
|
|
|
| 3 |
DT-Circuits is a framework for mechanistic interpretability of Decision Transformers (DT). Using TransformerLens, it enables mapping neural circuits, decomposing activations with Sparse Autoencoders (SAEs), and performing causal interventions on agent decision-making.
|
| 4 |
|
| 5 |
-
The goal is to understand how Reward-to-Go, State, and Action tokens are processed within the residual stream, moving beyond
|
|
|
|
|
|
|
| 6 |
|
| 7 |
## Table of Contents
|
| 8 |
- [Core Capabilities](#core-capabilities)
|
| 9 |
- [Technical Architecture](#technical-architecture)
|
| 10 |
-
- [Getting Started](#getting-started)
|
| 11 |
-
- [Project Documentation](#project-documentation)
|
| 12 |
-
- [Testing](#testing)
|
| 13 |
- [Project Structure](#project-structure)
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
## Project Documentation
|
| 16 |
Detailed explanations of the mechanistic interpretability techniques used in this project:
|
|
@@ -18,7 +23,7 @@ Detailed explanations of the mechanistic interpretability techniques used in thi
|
|
| 18 |
- [Activation Patching](./docs/activation_patching.md)
|
| 19 |
- [SAEs & Steering](./docs/sae_steering.md)
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
## Core Capabilities
|
| 24 |
|
|
@@ -35,11 +40,13 @@ Detailed explanations of the mechanistic interpretability techniques used in thi
|
|
| 35 |
- **SAE Integration**: Tools to train and deploy SAEs on the residual stream to find monosemantic latents.
|
| 36 |
- **Anomaly Detection**: Uses SAE reconstruction error to detect out-of-distribution (OOD) states.
|
| 37 |
|
| 38 |
-
### 4. Path-Causal
|
| 39 |
- **ACDC (Automated Circuit Discovery)**: Prunes the DT into a minimal sufficient subgraph for specific behaviors.
|
| 40 |
- **Path Patching**: High-fidelity causal tracing between specific internal nodes (e.g., Goal Token β Induction Head β Action Logit).
|
| 41 |
- **Evolutionary Scan**: Analyzes how decision-making circuits form and stabilize across training checkpoints.
|
| 42 |
|
|
|
|
|
|
|
| 43 |
## Technical Architecture
|
| 44 |
|
| 45 |
The platform consists of:
|
|
@@ -48,6 +55,42 @@ The platform consists of:
|
|
| 48 |
- **Interpretability Layer**: Modules for attribution, patching, SAE management, and steering.
|
| 49 |
- **Visualization Layer**: Streamlit dashboard for real-time monitoring and intervention.
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
## Getting Started
|
| 52 |
|
| 53 |
### Prerequisites
|
|
@@ -74,44 +117,8 @@ pip install -r requirements.txt
|
|
| 74 |
streamlit run src/dashboard/app.py
|
| 75 |
```
|
| 76 |
|
| 77 |
-
## Testing
|
| 78 |
|
| 79 |
```bash
|
| 80 |
PYTHONPATH=. pytest tests/
|
| 81 |
```
|
| 82 |
-
|
| 83 |
-
## Project Structure
|
| 84 |
-
|
| 85 |
-
```text
|
| 86 |
-
DT-Circuits/
|
| 87 |
-
βββ scripts/ # Training and harvesting entry points
|
| 88 |
-
β βββ train_dt.py # Decision Transformer training pipeline
|
| 89 |
-
β βββ train_sae.py # Sparse Autoencoder (SAE) training script
|
| 90 |
-
βββ src/
|
| 91 |
-
β βββ dashboard/
|
| 92 |
-
β β βββ app.py # Streamlit-based visualization UI
|
| 93 |
-
β βββ data/
|
| 94 |
-
β β βββ harvester.py # PPO-based expert trajectory harvester
|
| 95 |
-
β βββ interpretability/
|
| 96 |
-
β β βββ acdc.py # Automated Circuit Discovery logic
|
| 97 |
-
β β βββ attribution.py # Direct Logit Attribution (DLA)
|
| 98 |
-
β β βββ evolution.py # Developmental/Evolutionary MI scan
|
| 99 |
-
β β βββ induction_scan.py # Induction head detection logic
|
| 100 |
-
β β βββ patching.py # Causal activation patching tools
|
| 101 |
-
β β βββ path_patching.py # Path-based causal intervention engine
|
| 102 |
-
β β βββ sae_manager.py # SAE deployment and anomaly detection
|
| 103 |
-
β β βββ steering.py # Steering vector generation and injection
|
| 104 |
-
β βββ models/
|
| 105 |
-
β β βββ hooked_dt.py # TransformerLens-wrapped Decision Transformer
|
| 106 |
-
β βββ utils/
|
| 107 |
-
βββ tests/ # Unit and integration test suite
|
| 108 |
-
β βββ test_components.py
|
| 109 |
-
β βββ test_path_causal_microscope.py # Phase 4 Path-Causal tests
|
| 110 |
-
β βββ test_sae_and_steering.py
|
| 111 |
-
βββ config.yaml # Experiment and environment configuration
|
| 112 |
-
βββ requirements.txt # Environment dependencies
|
| 113 |
-
---
|
| 114 |
-
|
| 115 |
-
## License
|
| 116 |
-
|
| 117 |
-
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
|
|
|
|
| 1 |
# DT-Circuits: Mechanistic Interpretability for Decision Transformers
|
| 2 |
|
| 3 |
+

|
| 4 |
+

|
| 5 |
+
|
| 6 |
DT-Circuits is a framework for mechanistic interpretability of Decision Transformers (DT). Using TransformerLens, it enables mapping neural circuits, decomposing activations with Sparse Autoencoders (SAEs), and performing causal interventions on agent decision-making.
|
| 7 |
|
| 8 |
+
The goal is to understand how Reward-to-Go, State, and Action tokens are processed within the residual stream, moving beyond black-box behavioral evaluation.
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
|
| 12 |
## Table of Contents
|
| 13 |
- [Core Capabilities](#core-capabilities)
|
| 14 |
- [Technical Architecture](#technical-architecture)
|
|
|
|
|
|
|
|
|
|
| 15 |
- [Project Structure](#project-structure)
|
| 16 |
+
- [Getting Started](#getting-started)
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
|
| 20 |
## Project Documentation
|
| 21 |
Detailed explanations of the mechanistic interpretability techniques used in this project:
|
|
|
|
| 23 |
- [Activation Patching](./docs/activation_patching.md)
|
| 24 |
- [SAEs & Steering](./docs/sae_steering.md)
|
| 25 |
|
| 26 |
+
---
|
| 27 |
|
| 28 |
## Core Capabilities
|
| 29 |
|
|
|
|
| 40 |
- **SAE Integration**: Tools to train and deploy SAEs on the residual stream to find monosemantic latents.
|
| 41 |
- **Anomaly Detection**: Uses SAE reconstruction error to detect out-of-distribution (OOD) states.
|
| 42 |
|
| 43 |
+
### 4. Path-Level Causal Analysis
|
| 44 |
- **ACDC (Automated Circuit Discovery)**: Prunes the DT into a minimal sufficient subgraph for specific behaviors.
|
| 45 |
- **Path Patching**: High-fidelity causal tracing between specific internal nodes (e.g., Goal Token β Induction Head β Action Logit).
|
| 46 |
- **Evolutionary Scan**: Analyzes how decision-making circuits form and stabilize across training checkpoints.
|
| 47 |
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
## Technical Architecture
|
| 51 |
|
| 52 |
The platform consists of:
|
|
|
|
| 55 |
- **Interpretability Layer**: Modules for attribution, patching, SAE management, and steering.
|
| 56 |
- **Visualization Layer**: Streamlit dashboard for real-time monitoring and intervention.
|
| 57 |
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## Project Structure
|
| 61 |
+
|
| 62 |
+
```text
|
| 63 |
+
DT-Circuits/
|
| 64 |
+
βββ scripts/ # Training and harvesting entry points
|
| 65 |
+
β βββ train_dt.py # Decision Transformer training pipeline
|
| 66 |
+
β βββ train_sae.py # Sparse Autoencoder (SAE) training script
|
| 67 |
+
βββ src/
|
| 68 |
+
β βββ dashboard/
|
| 69 |
+
β β βββ app.py # Streamlit-based visualization UI
|
| 70 |
+
β βββ data/
|
| 71 |
+
β β βββ harvester.py # PPO-based expert trajectory harvester
|
| 72 |
+
β βββ interpretability/
|
| 73 |
+
β β βββ acdc.py # Automated Circuit Discovery logic
|
| 74 |
+
β β βββ attribution.py # Direct Logit Attribution (DLA)
|
| 75 |
+
β β βββ evolution.py # Training Dynamics Analysis
|
| 76 |
+
β β βββ induction_scan.py # Induction head detection logic
|
| 77 |
+
β β βββ patching.py # Causal activation patching tools
|
| 78 |
+
β β βββ path_patching.py # Path-based causal intervention engine
|
| 79 |
+
β β βββ sae_manager.py # SAE deployment and anomaly detection
|
| 80 |
+
β β βββ steering.py # Steering vector generation and injection
|
| 81 |
+
β βββ models/
|
| 82 |
+
β β βββ hooked_dt.py # TransformerLens-wrapped Decision Transformer
|
| 83 |
+
β βββ utils/
|
| 84 |
+
βββ tests/
|
| 85 |
+
β βββ test_components.py
|
| 86 |
+
β βββ test_path_causal_microscope.py
|
| 87 |
+
β βββ test_sae_and_steering.py
|
| 88 |
+
βββ config.yaml
|
| 89 |
+
βββ requirements.txt
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
## Getting Started
|
| 95 |
|
| 96 |
### Prerequisites
|
|
|
|
| 117 |
streamlit run src/dashboard/app.py
|
| 118 |
```
|
| 119 |
|
| 120 |
+
### Testing
|
| 121 |
|
| 122 |
```bash
|
| 123 |
PYTHONPATH=. pytest tests/
|
| 124 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|