sadhumitha-s commited on
Commit
663f50c
Β·
unverified Β·
1 Parent(s): f03abd9

revise readme

Browse files
Files changed (1) hide show
  1. README.md +50 -43
README.md CHANGED
@@ -1,16 +1,21 @@
1
  # DT-Circuits: Mechanistic Interpretability for Decision Transformers
2
 
 
 
 
3
  DT-Circuits is a framework for mechanistic interpretability of Decision Transformers (DT). Using TransformerLens, it enables mapping neural circuits, decomposing activations with Sparse Autoencoders (SAEs), and performing causal interventions on agent decision-making.
4
 
5
- The goal is to understand how Reward-to-Go, State, and Action tokens are processed within the residual stream, moving beyond basic behavioral observation.
 
 
6
 
7
  ## Table of Contents
8
  - [Core Capabilities](#core-capabilities)
9
  - [Technical Architecture](#technical-architecture)
10
- - [Getting Started](#getting-started)
11
- - [Project Documentation](#project-documentation)
12
- - [Testing](#testing)
13
  - [Project Structure](#project-structure)
 
 
 
14
 
15
  ## Project Documentation
16
  Detailed explanations of the mechanistic interpretability techniques used in this project:
@@ -18,7 +23,7 @@ Detailed explanations of the mechanistic interpretability techniques used in thi
18
  - [Activation Patching](./docs/activation_patching.md)
19
  - [SAEs & Steering](./docs/sae_steering.md)
20
 
21
-
22
 
23
  ## Core Capabilities
24
 
@@ -35,11 +40,13 @@ Detailed explanations of the mechanistic interpretability techniques used in thi
35
  - **SAE Integration**: Tools to train and deploy SAEs on the residual stream to find monosemantic latents.
36
  - **Anomaly Detection**: Uses SAE reconstruction error to detect out-of-distribution (OOD) states.
37
 
38
- ### 4. Path-Causal Microscope
39
  - **ACDC (Automated Circuit Discovery)**: Prunes the DT into a minimal sufficient subgraph for specific behaviors.
40
  - **Path Patching**: High-fidelity causal tracing between specific internal nodes (e.g., Goal Token β†’ Induction Head β†’ Action Logit).
41
  - **Evolutionary Scan**: Analyzes how decision-making circuits form and stabilize across training checkpoints.
42
 
 
 
43
  ## Technical Architecture
44
 
45
  The platform consists of:
@@ -48,6 +55,42 @@ The platform consists of:
48
  - **Interpretability Layer**: Modules for attribution, patching, SAE management, and steering.
49
  - **Visualization Layer**: Streamlit dashboard for real-time monitoring and intervention.
50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  ## Getting Started
52
 
53
  ### Prerequisites
@@ -74,44 +117,8 @@ pip install -r requirements.txt
74
  streamlit run src/dashboard/app.py
75
  ```
76
 
77
- ## Testing
78
 
79
  ```bash
80
  PYTHONPATH=. pytest tests/
81
  ```
82
-
83
- ## Project Structure
84
-
85
- ```text
86
- DT-Circuits/
87
- β”œβ”€β”€ scripts/ # Training and harvesting entry points
88
- β”‚ β”œβ”€β”€ train_dt.py # Decision Transformer training pipeline
89
- β”‚ └── train_sae.py # Sparse Autoencoder (SAE) training script
90
- β”œβ”€β”€ src/
91
- β”‚ β”œβ”€β”€ dashboard/
92
- β”‚ β”‚ └── app.py # Streamlit-based visualization UI
93
- β”‚ β”œβ”€β”€ data/
94
- β”‚ β”‚ └── harvester.py # PPO-based expert trajectory harvester
95
- β”‚ β”œβ”€β”€ interpretability/
96
- β”‚ β”‚ β”œβ”€β”€ acdc.py # Automated Circuit Discovery logic
97
- β”‚ β”‚ β”œβ”€β”€ attribution.py # Direct Logit Attribution (DLA)
98
- β”‚ β”‚ β”œβ”€β”€ evolution.py # Developmental/Evolutionary MI scan
99
- β”‚ β”‚ β”œβ”€β”€ induction_scan.py # Induction head detection logic
100
- β”‚ β”‚ β”œβ”€β”€ patching.py # Causal activation patching tools
101
- β”‚ β”‚ β”œβ”€β”€ path_patching.py # Path-based causal intervention engine
102
- β”‚ β”‚ β”œβ”€β”€ sae_manager.py # SAE deployment and anomaly detection
103
- β”‚ β”‚ └── steering.py # Steering vector generation and injection
104
- β”‚ β”œβ”€β”€ models/
105
- β”‚ β”‚ └── hooked_dt.py # TransformerLens-wrapped Decision Transformer
106
- β”‚ └── utils/
107
- β”œβ”€β”€ tests/ # Unit and integration test suite
108
- β”‚ β”œβ”€β”€ test_components.py
109
- β”‚ β”œβ”€β”€ test_path_causal_microscope.py # Phase 4 Path-Causal tests
110
- β”‚ └── test_sae_and_steering.py
111
- β”œβ”€β”€ config.yaml # Experiment and environment configuration
112
- └── requirements.txt # Environment dependencies
113
- ---
114
-
115
- ## License
116
-
117
- This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
 
1
  # DT-Circuits: Mechanistic Interpretability for Decision Transformers
2
 
3
+ ![Python](https://img.shields.io/badge/python-3.9+-blue)
4
+ ![PyTorch](https://img.shields.io/badge/PyTorch-2.x-red)
5
+
6
  DT-Circuits is a framework for mechanistic interpretability of Decision Transformers (DT). Using TransformerLens, it enables mapping neural circuits, decomposing activations with Sparse Autoencoders (SAEs), and performing causal interventions on agent decision-making.
7
 
8
+ The goal is to understand how Reward-to-Go, State, and Action tokens are processed within the residual stream, moving beyond black-box behavioral evaluation.
9
+
10
+ ---
11
 
12
  ## Table of Contents
13
  - [Core Capabilities](#core-capabilities)
14
  - [Technical Architecture](#technical-architecture)
 
 
 
15
  - [Project Structure](#project-structure)
16
+ - [Getting Started](#getting-started)
17
+
18
+ ---
19
 
20
  ## Project Documentation
21
  Detailed explanations of the mechanistic interpretability techniques used in this project:
 
23
  - [Activation Patching](./docs/activation_patching.md)
24
  - [SAEs & Steering](./docs/sae_steering.md)
25
 
26
+ ---
27
 
28
  ## Core Capabilities
29
 
 
40
  - **SAE Integration**: Tools to train and deploy SAEs on the residual stream to find monosemantic latents.
41
  - **Anomaly Detection**: Uses SAE reconstruction error to detect out-of-distribution (OOD) states.
42
 
43
+ ### 4. Path-Level Causal Analysis
44
  - **ACDC (Automated Circuit Discovery)**: Prunes the DT into a minimal sufficient subgraph for specific behaviors.
45
  - **Path Patching**: High-fidelity causal tracing between specific internal nodes (e.g., Goal Token β†’ Induction Head β†’ Action Logit).
46
  - **Evolutionary Scan**: Analyzes how decision-making circuits form and stabilize across training checkpoints.
47
 
48
+ ---
49
+
50
  ## Technical Architecture
51
 
52
  The platform consists of:
 
55
  - **Interpretability Layer**: Modules for attribution, patching, SAE management, and steering.
56
  - **Visualization Layer**: Streamlit dashboard for real-time monitoring and intervention.
57
 
58
+ ---
59
+
60
+ ## Project Structure
61
+
62
+ ```text
63
+ DT-Circuits/
64
+ β”œβ”€β”€ scripts/ # Training and harvesting entry points
65
+ β”‚ β”œβ”€β”€ train_dt.py # Decision Transformer training pipeline
66
+ β”‚ └── train_sae.py # Sparse Autoencoder (SAE) training script
67
+ β”œβ”€β”€ src/
68
+ β”‚ β”œβ”€β”€ dashboard/
69
+ β”‚ β”‚ └── app.py # Streamlit-based visualization UI
70
+ β”‚ β”œβ”€β”€ data/
71
+ β”‚ β”‚ └── harvester.py # PPO-based expert trajectory harvester
72
+ β”‚ β”œβ”€β”€ interpretability/
73
+ β”‚ β”‚ β”œβ”€β”€ acdc.py # Automated Circuit Discovery logic
74
+ β”‚ β”‚ β”œβ”€β”€ attribution.py # Direct Logit Attribution (DLA)
75
+ β”‚ β”‚ β”œβ”€β”€ evolution.py # Training Dynamics Analysis
76
+ β”‚ β”‚ β”œβ”€β”€ induction_scan.py # Induction head detection logic
77
+ β”‚ β”‚ β”œβ”€β”€ patching.py # Causal activation patching tools
78
+ β”‚ β”‚ β”œβ”€β”€ path_patching.py # Path-based causal intervention engine
79
+ β”‚ β”‚ β”œβ”€β”€ sae_manager.py # SAE deployment and anomaly detection
80
+ β”‚ β”‚ └── steering.py # Steering vector generation and injection
81
+ β”‚ β”œβ”€β”€ models/
82
+ β”‚ β”‚ └── hooked_dt.py # TransformerLens-wrapped Decision Transformer
83
+ β”‚ └── utils/
84
+ β”œβ”€β”€ tests/
85
+ β”‚ β”œβ”€β”€ test_components.py
86
+ β”‚ β”œβ”€β”€ test_path_causal_microscope.py
87
+ β”‚ └── test_sae_and_steering.py
88
+ β”œβ”€β”€ config.yaml
89
+ └── requirements.txt
90
+ ```
91
+
92
+ ---
93
+
94
  ## Getting Started
95
 
96
  ### Prerequisites
 
117
  streamlit run src/dashboard/app.py
118
  ```
119
 
120
+ ### Testing
121
 
122
  ```bash
123
  PYTHONPATH=. pytest tests/
124
  ```