Spaces:
Running
Running
Commit Β·
848238a
1
Parent(s): ef707cc
docs: update readme references and add modular trajectory harvester
Browse files- .gitignore +1 -0
- README.md +60 -21
- src/data/__init__.py +0 -0
- src/data/harvester.py +64 -0
.gitignore
CHANGED
|
@@ -47,6 +47,7 @@ wandb/
|
|
| 47 |
.pytest_cache/
|
| 48 |
.coverage
|
| 49 |
htmlcov/
|
|
|
|
| 50 |
|
| 51 |
# Streamlit
|
| 52 |
.streamlit/
|
|
|
|
| 47 |
.pytest_cache/
|
| 48 |
.coverage
|
| 49 |
htmlcov/
|
| 50 |
+
tests/artifacts/
|
| 51 |
|
| 52 |
# Streamlit
|
| 53 |
.streamlit/
|
README.md
CHANGED
|
@@ -8,6 +8,8 @@
|
|
| 8 |
|
| 9 |
DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.
|
| 10 |
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
## Table of Contents
|
|
@@ -17,6 +19,7 @@ DT-Circuits is a research framework for mechanistic interpretability of Decision
|
|
| 17 |
- [Project Structure](#project-structure)
|
| 18 |
- [Installation and Usage](#installation-and-usage)
|
| 19 |
- [Documentation](#documentation)
|
|
|
|
| 20 |
- [Citation](#citation)
|
| 21 |
- [License](#license)
|
| 22 |
|
|
@@ -167,49 +170,72 @@ sae:
|
|
| 167 |
|
| 168 |
---
|
| 169 |
|
| 170 |
-
## Installation and Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
-
### Setup
|
|
|
|
| 173 |
```bash
|
|
|
|
|
|
|
|
|
|
| 174 |
python -m venv venv
|
| 175 |
source venv/bin/activate
|
|
|
|
| 176 |
pip install -r requirements.txt
|
| 177 |
```
|
| 178 |
|
| 179 |
-
###
|
| 180 |
-
|
| 181 |
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
|
| 188 |
-
###
|
|
|
|
| 189 |
|
| 190 |
-
1. **
|
|
|
|
| 191 |
```bash
|
| 192 |
python scripts/train_dt.py
|
| 193 |
```
|
| 194 |
|
| 195 |
-
2. **SAE Training**
|
|
|
|
| 196 |
```bash
|
| 197 |
python scripts/train_sae.py
|
| 198 |
```
|
| 199 |
|
| 200 |
-
3. **
|
|
|
|
| 201 |
```bash
|
| 202 |
streamlit run src/dashboard/app.py
|
| 203 |
```
|
| 204 |
|
| 205 |
-
### Alternative: Makefile
|
| 206 |
-
Common tasks can also be executed via `make`:
|
| 207 |
-
```bash
|
| 208 |
-
make setup # Install dependencies
|
| 209 |
-
make train # Run full training pipeline (DT + SAE)
|
| 210 |
-
make dashboard # Launch DT-Explorer
|
| 211 |
-
```
|
| 212 |
-
|
| 213 |
---
|
| 214 |
|
| 215 |
## Documentation
|
|
@@ -222,6 +248,19 @@ Detailed technical documentation for specific modules:
|
|
| 222 |
|
| 223 |
---
|
| 224 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 225 |
## Citation
|
| 226 |
|
| 227 |
```bibtex
|
|
|
|
| 8 |
|
| 9 |
DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.
|
| 10 |
|
| 11 |
+
**Live Interactive Demo:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)
|
| 12 |
+
|
| 13 |
---
|
| 14 |
|
| 15 |
## Table of Contents
|
|
|
|
| 19 |
- [Project Structure](#project-structure)
|
| 20 |
- [Installation and Usage](#installation-and-usage)
|
| 21 |
- [Documentation](#documentation)
|
| 22 |
+
- [Foundational Research & References](#foundational-research--references)
|
| 23 |
- [Citation](#citation)
|
| 24 |
- [License](#license)
|
| 25 |
|
|
|
|
| 170 |
|
| 171 |
---
|
| 172 |
|
| 173 |
+
## Execution Modes: Installation and Usage
|
| 174 |
+
|
| 175 |
+
There are two primary ways to run and interact with the **DT-Circuits** framework depending on your research needs:
|
| 176 |
+
|
| 177 |
+
---
|
| 178 |
+
|
| 179 |
+
### Way 1: Interactive Cloud Demo (Hugging Face Spaces)
|
| 180 |
+
|
| 181 |
+
For instant visual exploration, path intervention, and alignment auditing without any local workspace preparation, launch the web dashboard directly:
|
| 182 |
+
|
| 183 |
+
* **Demo Link:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)
|
| 184 |
+
|
| 185 |
+
> [!NOTE]
|
| 186 |
+
> **Concise Demo Constraints:**
|
| 187 |
+
> * **CPU-Bound Resources:** Runs on standard free-tier CPU instances (2 vCPUs, 16 GB RAM); high-overhead operations like ACDC scans may show higher latency than on a local GPU workspace.
|
| 188 |
+
> * **Slices Dataset:** Trajectory datasets are dynamically sliced down to a lightweight demo set under a **10MB limit** (defined in [deploy.sh](file:///Users/sadhumitha/Documents/projects/DT-Circuits/scripts/deploy.sh#L19-L33)) for storage and memory footprint constraints.
|
| 189 |
+
> * **Read-Only / Ephemeral Container:** Uses pre-baked static weights (`mini_dt.pt`) and pre-trained SAE checkpoints. Training new models or writing persistent states is disabled.
|
| 190 |
+
|
| 191 |
+
---
|
| 192 |
+
|
| 193 |
+
### Way 2: Clone and Run Locally (Full Pipeline)
|
| 194 |
+
|
| 195 |
+
For full end-to-end research, customized hyperparameter tuning, local data harvesting, and GPU-accelerated model or SAE training, run the workspace on your machine.
|
| 196 |
|
| 197 |
+
#### Local Environment Setup
|
| 198 |
+
First, clone the repository, set up a virtual environment, and install dependencies:
|
| 199 |
```bash
|
| 200 |
+
git clone https://github.com/sadhumitha-s/DT-Circuits
|
| 201 |
+
cd DT-Circuits
|
| 202 |
+
|
| 203 |
python -m venv venv
|
| 204 |
source venv/bin/activate
|
| 205 |
+
|
| 206 |
pip install -r requirements.txt
|
| 207 |
```
|
| 208 |
|
| 209 |
+
#### Option 2.1: Simple Workflows via Makefile
|
| 210 |
+
The workspace includes a standardized [Makefile](file:///Users/sadhumitha/Documents/projects/DT-Circuits/Makefile) to orchestrate common research pipelines with single commands:
|
| 211 |
|
| 212 |
+
```bash
|
| 213 |
+
make setup # Set up local environment & install requirements
|
| 214 |
+
make train # Run the full end-to-end pipeline (Data harvesting -> DT -> SAE training)
|
| 215 |
+
make dashboard # Run the Streamlit visualization dashboard locally
|
| 216 |
+
```
|
| 217 |
|
| 218 |
+
#### Option 2.2: Granular Control via Bash & Python
|
| 219 |
+
For research flexibility, execute each step of the pipeline manually using granular terminal scripts:
|
| 220 |
|
| 221 |
+
1. **Trajectories & Model Training**
|
| 222 |
+
Harvest teacher trajectories and train the target Decision Transformer (`HookedDT`):
|
| 223 |
```bash
|
| 224 |
python scripts/train_dt.py
|
| 225 |
```
|
| 226 |
|
| 227 |
+
2. **TopK Sparse Autoencoder (SAE) Training**
|
| 228 |
+
Train sparse autoencoders on target activation layers:
|
| 229 |
```bash
|
| 230 |
python scripts/train_sae.py
|
| 231 |
```
|
| 232 |
|
| 233 |
+
3. **Interactive Analysis**
|
| 234 |
+
Launch the Streamlit visualization engine locally to run audits with custom weights:
|
| 235 |
```bash
|
| 236 |
streamlit run src/dashboard/app.py
|
| 237 |
```
|
| 238 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 239 |
---
|
| 240 |
|
| 241 |
## Documentation
|
|
|
|
| 248 |
|
| 249 |
---
|
| 250 |
|
| 251 |
+
## Foundational Research & References
|
| 252 |
+
|
| 253 |
+
This framework implements and builds upon the following foundational methodologies:
|
| 254 |
+
|
| 255 |
+
* **Decision Transformers**: [Chen et al., 2021](https://arxiv.org/abs/2106.01345) β Reinforcement learning as sequence modeling.
|
| 256 |
+
* **Transformer Circuits**: [Elhage et al., 2021](https://transformer-circuits.pub/2021/framework/index.html) β Mathematical foundations of mechanistic interpretability.
|
| 257 |
+
* **ACDC (Automated Circuit Discovery)**: [Conmy et al., 2023](https://arxiv.org/abs/2304.14997) β Algorithmic discovery of subgraphs.
|
| 258 |
+
* **Sparse Autoencoders (SAEs)**: [Bricken et al., 2023](https://transformer-circuits.pub/2023/monosemantic-features/index.html) (monosemantic features) & [Gao et al., 2024](https://arxiv.org/abs/2406.04096) (TopK SAEs).
|
| 259 |
+
* **Activation Steering**: [Turner et al., 2023](https://arxiv.org/abs/2308.10248) β Control via residual stream vector additions.
|
| 260 |
+
* **Path Patching**: [Goldowsky-Dill et al., 2023](https://arxiv.org/abs/2304.05969) β Inter-component causal mediation.
|
| 261 |
+
|
| 262 |
+
---
|
| 263 |
+
|
| 264 |
## Citation
|
| 265 |
|
| 266 |
```bibtex
|
src/data/__init__.py
ADDED
|
File without changes
|
src/data/harvester.py
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import gymnasium as gym
|
| 3 |
+
import torch
|
| 4 |
+
import numpy as np
|
| 5 |
+
from minigrid.wrappers import FlatObsWrapper
|
| 6 |
+
from stable_baselines3 import PPO
|
| 7 |
+
from tqdm import tqdm
|
| 8 |
+
|
| 9 |
+
class PPOHarvester:
|
| 10 |
+
"""
|
| 11 |
+
Utility to run a 'Teacher' PPO agent to collect high-quality state-action-reward triplets.
|
| 12 |
+
"""
|
| 13 |
+
def __init__(self, env_id="MiniGrid-Empty-8x8-v0", model_path=None):
|
| 14 |
+
self.env_id = env_id
|
| 15 |
+
self.env = FlatObsWrapper(gym.make(env_id, render_mode="rgb_array"))
|
| 16 |
+
if model_path and os.path.exists(model_path):
|
| 17 |
+
self.model = PPO.load(model_path, env=self.env)
|
| 18 |
+
else:
|
| 19 |
+
print(f"No model found at {model_path}. Training a new one for collection...")
|
| 20 |
+
self.model = PPO("MlpPolicy", self.env, verbose=1)
|
| 21 |
+
self.model.learn(total_timesteps=20000)
|
| 22 |
+
if model_path:
|
| 23 |
+
self.model.save(model_path)
|
| 24 |
+
|
| 25 |
+
def collect_trajectories(self, num_episodes=100):
|
| 26 |
+
trajectories = []
|
| 27 |
+
for i in tqdm(range(num_episodes), desc="Collecting trajectories"):
|
| 28 |
+
obs, _ = self.env.reset(seed=42 + i)
|
| 29 |
+
done = False
|
| 30 |
+
truncated = False
|
| 31 |
+
episode = {
|
| 32 |
+
"observations": [],
|
| 33 |
+
"actions": [],
|
| 34 |
+
"rewards": [],
|
| 35 |
+
"dones": []
|
| 36 |
+
}
|
| 37 |
+
while not (done or truncated):
|
| 38 |
+
action, _states = self.model.predict(obs, deterministic=False)
|
| 39 |
+
next_obs, reward, done, truncated, info = self.env.step(action)
|
| 40 |
+
|
| 41 |
+
episode["observations"].append(obs)
|
| 42 |
+
episode["actions"].append(action)
|
| 43 |
+
episode["rewards"].append(reward)
|
| 44 |
+
episode["dones"].append(done)
|
| 45 |
+
|
| 46 |
+
obs = next_obs
|
| 47 |
+
|
| 48 |
+
# Convert to numpy arrays
|
| 49 |
+
for key in episode:
|
| 50 |
+
episode[key] = np.array(episode[key])
|
| 51 |
+
|
| 52 |
+
trajectories.append(episode)
|
| 53 |
+
|
| 54 |
+
return trajectories
|
| 55 |
+
|
| 56 |
+
def save_trajectories(self, trajectories, file_path):
|
| 57 |
+
os.makedirs(os.path.dirname(file_path), exist_ok=True)
|
| 58 |
+
torch.save(trajectories, file_path)
|
| 59 |
+
print(f"Saved {len(trajectories)} trajectories to {file_path}")
|
| 60 |
+
|
| 61 |
+
if __name__ == "__main__":
|
| 62 |
+
harvester = PPOHarvester(model_path="ppo_minigrid_teacher.zip")
|
| 63 |
+
trajs = harvester.collect_trajectories(num_episodes=50)
|
| 64 |
+
harvester.save_trajectories(trajs, "data/trajectories.pt")
|