Spaces:

sadhumitha-s
/

DT-Explorer

Running

App Files Files Community

sadhumitha-s commited on 6 days ago

Commit

b7ddfc6

1 Parent(s): 14d2c06

refactor: implement centralized configuration, upgrade SAE training to multi-layer TopK, and optimize dashboard attribution UX

Browse files

Files changed (11) hide show

.gitignore +2 -0
Makefile +24 -0
README.md +147 -76
config.yaml +1 -0
requirements.txt +1 -0
scripts/train_dt.py +15 -11
scripts/train_sae.py +102 -24
src/config.py +73 -0
src/dashboard/app.py +19 -19
src/models/hooked_dt.py +3 -3
tests/test_mechanistic.py +63 -0

.gitignore CHANGED Viewed

@@ -56,3 +56,5 @@ static/
 /PRD.md
 artifacts/

 /PRD.md
 artifacts/
+scratch/
+*.log

Makefile ADDED Viewed

	@@ -0,0 +1,24 @@

+.PHONY: setup train dashboard test clean
+# Setup environment
+setup:
+	pip install -r requirements.txt
+# Run full pipeline: Harvesting -> DT Training -> SAE Training
+train:
+	python3 scripts/train_dt.py
+	python3 scripts/train_sae.py
+# Launch the explorer dashboard
+dashboard:
+	streamlit run src/dashboard/app.py
+# Run unit tests
+test:
+	PYTHONPATH=. pytest tests/
+# Remove artifacts and cached files
+clean:
+	rm -rf data/*.pt models/*.pt artifacts/saes/*.pt
+	find . -type d -name "__pycache__" -exec rm -rf {} +
+	find . -type d -name ".pytest_cache" -exec rm -rf {} +

README.md CHANGED Viewed

@@ -1,66 +1,86 @@
 # DT-Circuits: Mechanistic Interpretability for Decision Transformers
-![Python](https://img.shields.io/badge/python-3.9+-blue)
-![PyTorch](https://img.shields.io/badge/PyTorch-2.x-red)
 DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.
 ---
-## Motivation
-Mechanistic interpretability has primarily focused on language models, while reinforcement learning agents remain comparatively underexplored.
-Decision Transformers provide a uniquely analyzable architecture because trajectories, rewards, and actions are represented in a unified autoregressive sequence.
-DT-Circuits aims to make RL agents inspectable at the circuit level rather than only through behavioral evaluation.
----
 ## Table of Contents
-- [Features](#features)
-- [Technical Architecture](#technical-architecture)
 - [Project Structure](#project-structure)
-- [Getting Started](#getting-started)
----
-## Documentation
-- [Circuit Discovery](./docs/circuit_discovery.md)
-- [Activation Patching](./docs/activation_patching.md)
-- [SAEs & Steering](./docs/sae_steering.md)
----
-## Features
-### 1. Neural Mapping
-- **Hooked-DT**: Access any internal activation or weight during the agent's run.
-- **Logit Attribution**: See which attention heads or MLP layers drive specific actions.
-- **Induction Scan**: Find heads that recognize temporal patterns and past states.
-### 2. Testing Causality
-- **Activation Patching**: Swap internal states to see what actually changes the agent's move.
-- **Behavior Steering**: Add vectors to activations to push the agent toward specific goals without retraining.
-### 3. Concept Discovery
-- **TopK SAEs**: Decompose complex activations into a few active "concepts" for easier reading.
-- **Auto-Labeling (NLA)**: Use an LLM to automatically describe what each discovered neuron feature does.
-- **Cross-Model Probes**: Check if different agents (like DQNs) learn the same internal concepts as the DT.
-### 4. Circuit Analysis
-- **ACDC**: Automatically strip the model down to the minimal circuit needed for a task.
-- **Path Patching**: Trace how a signal flows from a specific input token to the final action.
-- **Evolutionary Scan**: Watch how decision-making circuits form during training.
----
-## Technical Architecture
-- **Data**: Collects expert paths using a PPO harvester.
-- **Model**: Custom Decision Transformer compatible with TransformerLens.
-- **Tools**: Dedicated modules for attribution, patching, SAEs, and steering.
-- **Dashboard**: Streamlit UI for real-time model analysis.
 ---
@@ -68,10 +88,7 @@ DT-Circuits aims to make RL agents inspectable at the circuit level rather than
 ```text
 DT-Circuits/
-├── scripts/
-│   ├── train_dt.py         # Decision Transformer training pipeline
-│   └── train_sae.py        # Sparse Autoencoder (SAE) training script
-├── src/
 │   ├── dashboard/
 │   │   └── app.py          # Streamlit-based visualization UI
 │   ├── data/
@@ -89,44 +106,58 @@ DT-Circuits/
 │   │   └── universality.py # Cross-architecture feature mapping
 │   ├── models/
 │   │   └── hooked_dt.py    # TransformerLens-wrapped Decision Transformer
 │   └── utils/
 ├── tests/                  # Unit tests for all modules
-├── config.yaml
-└── requirements.txt
 ```
----
-## Getting Started
-### Prerequisites
-- Python 3.9+
-- PyTorch 2.x
-- TransformerLens
-- SAE-Lens
-### Quick Start
-Follow these steps to initialize the environment and verify the installation.
-1. **Environment Setup**
-   ```bash
-   python -m venv venv
-   source venv/bin/activate
-   pip install -r requirements.txt
-   ```
-2. **Verification**
-   Run the component tests to ensure all dependencies and hooks are correctly configured.
-   ```bash
-   PYTHONPATH=. pytest tests/test_components.py
-   ```
-3. **Dashboard Execution**
-   Launch the `DT-Explorer` dashboard. The dashboard will initialize with a random model if no trained weights are detected.
-   ```bash
-   streamlit run src/dashboard/app.py
-   ```
 ### Workflow
@@ -135,7 +166,47 @@ Follow these steps to initialize the environment and verify the installation.
    python scripts/train_dt.py
    ```
-2. **Interpretability Analysis**
    ```bash
    streamlit run src/dashboard/app.py
    ```

 # DT-Circuits: Mechanistic Interpretability for Decision Transformers
+[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
+[![PyTorch 2.x](https://img.shields.io/badge/PyTorch-2.x-red.svg)](https://pytorch.org/)
+[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
+[![Framework: TransformerLens](https://img.shields.io/badge/Framework-TransformerLens-orange.svg)](https://github.com/TransformerLensOrg/TransformerLens)
 DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.
 ---
 ## Table of Contents
+- [Core Objectives](#core-objectives)
+- [Technical Overview](#technical-overview)
+- [Capabilities](#capabilities)
 - [Project Structure](#project-structure)
+- [Installation and Usage](#installation-and-usage)
+- [Documentation](#documentation)
+- [Citation](#citation)
+- [License](#license)
+---
+## Core Objectives
+1.  **Map Information Flow**: Quantify how input tokens (State, Action, Reward-to-Go) contribute to the output action logits.
+2.  **Causal Verification**: Use intervention techniques to identify the minimal set of model components required for specific behaviors.
+3.  **Feature Decomposition**: Use Sparse Autoencoders (SAEs) to identify monosemantic features within the model's residual stream.
+4.  **Behavioral Control**: Modify agent decisions at inference time by manipulating internal activations.
+---
+## Technical Overview
+The framework centers around `HookedDT`, a Decision Transformer implementation that allows for activation hooking and cache management.
+### Information Flow Diagram
+```mermaid
+graph TD
+    subgraph Input_Sequence
+        S[State Tokens]
+        A[Action Tokens]
+        RTG[Reward-to-Go Tokens]
+    end
+    Input_Sequence --> Embed[Embedding Layers]
+    Embed --> Hooks[Activation Hooks]
+    subgraph Transformer_Block
+        Hooks --> Attn[Multi-Head Attention]
+        Attn --> MLP[MLP Layers]
+        MLP --> Res[Residual Stream]
+    end
+    Res --> DLA[Direct Logit Attribution]
+    Res --> SAE[Sparse Autoencoder]
+    Res --> Output[Action Logits]
+    subgraph Interpretability_Modules
+        DLA -.-> Analysis
+        SAE -.-> Features
+        Intervention[Activation Patching] -.-> Hooks
+    end
+```
+---
+## Capabilities
+### Causal Mediation and Attribution
+*   **Direct Logit Attribution (DLA)**: Measures the direct contribution of individual attention heads and MLP layers to the final logit output.
+*   **Activation Patching**: Substitutes internal activations from different runs to isolate the causal effect of specific inputs on model behavior.
+*   **Path Patching**: Traces how information flows through specific connections between model components.
+### Feature Discovery and Analysis
+*   **Sparse Autoencoders (SAEs)**: Decomposes the residual stream into a set of sparse features, helping to resolve polysemanticity.
+*   **Induction Scanning**: Identifies attention heads that perform pattern-matching and temporal sequence recognition.
+*   **Automated Circuit Discovery (ACDC)**: Prunes the model to identify the smallest functional subgraph sufficient to perform a specific task.
+### Behavioral Steering
+*   **Activation Steering**: Injects specific vectors into the residual stream to bias the agent's decision-making without retraining the weights.
+*   **Safety Auditing**: Monitors SAE reconstruction error and feature activation to detect anomalous or out-of-distribution internal states.
 ---
 ```text
 DT-Circuits/
+├── src/
 │   ├── dashboard/
 │   │   └── app.py          # Streamlit-based visualization UI
 │   ├── data/
 │   │   └── universality.py # Cross-architecture feature mapping
 │   ├── models/
 │   │   └── hooked_dt.py    # TransformerLens-wrapped Decision Transformer
+│   ├── config.py           # Centralized hyperparameter management
 │   └── utils/
 ├── tests/                  # Unit tests for all modules
+├── config.yaml             # External hyperparameter storage
+├── requirements.txt
+└── docs/
 ```
+---
+## Configuration
+Hyperparameters are managed through a dual-system for both ease of use and research reproducibility:
+1.  **`config.yaml`**: The primary interface for users. You can modify model dimensions, training epochs, and environment settings here without touching the code.
+2.  **`src/config.py`**: Defines the underlying structure using Python dataclasses. It automatically loads overrides from `config.yaml` at runtime.
+### Key Configuration Sections
+| Section | Description | Key Parameters |
+| :--- | :--- | :--- |
+| **`model`** | Architecture settings for the Decision Transformer | `n_layers`, `d_model`, `n_heads`, `max_length` |
+| **`data`** | Settings for expert trajectory collection | `env_id`, `num_episodes` (for DT training) |
+| **`train`** | DT training hyperparameters | `lr`, `epochs`, `seed` |
+| **`sae`** | Sparse Autoencoder training hyperparameters | `expansion_factor`, `k`, `num_episodes` (SAE specific) |
+**Example: Independent Data Control**
+You can control the amount of data used for general training vs. interpretability separately:
+```yaml
+data:
+  num_episodes: 1000  # Episodes for training the DT teacher
+sae:
+  num_episodes: 500   # Episodes for extracting SAE activations
+```
+---
+## Installation and Usage
+### Setup
+```bash
+python -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+```
+### Dashboard Execution
+Launch the `DT-Explorer` dashboard. The dashboard will initialize with a random model if no trained weights are detected.
+```bash
+streamlit run src/dashboard/app.py
+```
 ### Workflow
    python scripts/train_dt.py
    ```
+2. **SAE Training**
+   ```bash
+   python scripts/train_sae.py
+   ```
+3. **Interpretability Analysis**
    ```bash
    streamlit run src/dashboard/app.py
    ```
+### Alternative: Makefile
+Common tasks can also be executed via `make`:
+```bash
+make setup      # Install dependencies
+make train      # Run full training pipeline (DT + SAE)
+make dashboard  # Launch DT-Explorer
+```
+---
+## Documentation
+Detailed technical documentation for specific modules:
+*   [Circuit Discovery](./docs/circuit_discovery.md)
+*   [Causal Intervention](./docs/activation_patching.md)
+*   [SAEs and Steering](./docs/sae_steering.md)
+---
+## Citation
+```bibtex
+@software{dt_circuits2026,
+  author = {Sadhumitha S.},
+  title = {DT-Circuits: Mechanistic Interpretability for Decision Transformers},
+  year = {2026},
+  url = {https://github.com/sadhumitha-s/DT-Circuits}
+}
+```
+---
+## License
+Apache 2.0

config.yaml CHANGED Viewed

@@ -16,3 +16,4 @@ interpretability:
 sae:
   expansion_factor: 8
   l1_coeff: 0.0005

 sae:
   expansion_factor: 8
   l1_coeff: 0.0005
+  num_episodes: 100

requirements.txt CHANGED Viewed

@@ -14,3 +14,4 @@ pytest
 stable-baselines3
 shimmy
 seaborn

 stable-baselines3
 shimmy
 seaborn
+torchvision

scripts/train_dt.py CHANGED Viewed

@@ -14,39 +14,43 @@ if root_path not in sys.path:
 from src.models.hooked_dt import HookedDT
 from src.data.harvester import PPOHarvester
 def train():
     """Main training loop for Decision Transformer."""
     # Step 1: Collect data from expert PPO teacher
-    harvester = PPOHarvester(model_path="models/ppo_teacher.zip")
-    trajectories = harvester.collect_trajectories(num_episodes=100)
     # Save trajectories for the dashboard to use later
     harvester.save_trajectories(trajectories, "data/trajectories.pt")
     state_dim = trajectories[0]["observations"].shape[1]
-    action_dim = 7 # MiniGrid standard actions
     model = HookedDT.from_config(
         state_dim=state_dim,
         action_dim=action_dim,
-        n_layers=2,
-        n_heads=4,
-        d_model=128
     )
-    optimizer = optim.AdamW(model.parameters(), lr=1e-4)
     criterion = nn.CrossEntropyLoss()
     # Step 2: Train the DT
     model.train()
-    for epoch in range(10):
         total_loss = 0
         for traj in tqdm(trajectories, desc=f"Epoch {epoch}"):
-            states = torch.from_numpy(traj["observations"]).float().unsqueeze(0)
-            actions = torch.from_numpy(traj["actions"]).long()
             actions_one_hot = torch.nn.functional.one_hot(actions, num_classes=action_dim).float().unsqueeze(0)
-            returns = torch.from_numpy(traj["rewards"]).float().unsqueeze(0).unsqueeze(-1)
             # Predict actions based on State tokens
             action_preds = model(states, actions_one_hot, returns)

 from src.models.hooked_dt import HookedDT
 from src.data.harvester import PPOHarvester
+from src.config import cfg
 def train():
     """Main training loop for Decision Transformer."""
     # Step 1: Collect data from expert PPO teacher
+    harvester = PPOHarvester(env_id=cfg.data.env_id, model_path="models/ppo_teacher.zip")
+    trajectories = harvester.collect_trajectories(num_episodes=cfg.data.num_episodes)
     # Save trajectories for the dashboard to use later
     harvester.save_trajectories(trajectories, "data/trajectories.pt")
     state_dim = trajectories[0]["observations"].shape[1]
+    action_dim = cfg.model.action_dim
     model = HookedDT.from_config(
         state_dim=state_dim,
         action_dim=action_dim,
+        n_layers=cfg.model.n_layers,
+        n_heads=cfg.model.n_heads,
+        d_model=cfg.model.d_model,
+        max_length=cfg.model.max_length
     )
+    optimizer = optim.AdamW(model.parameters(), lr=cfg.train.lr)
     criterion = nn.CrossEntropyLoss()
     # Step 2: Train the DT
     model.train()
+    for epoch in range(cfg.train.epochs):
         total_loss = 0
         for traj in tqdm(trajectories, desc=f"Epoch {epoch}"):
+            # Truncate to match model max_length
+            max_len = model.max_length
+            states = torch.from_numpy(traj["observations"]).float().unsqueeze(0)[:, -max_len:]
+            actions = torch.from_numpy(traj["actions"]).long()[-max_len:]
             actions_one_hot = torch.nn.functional.one_hot(actions, num_classes=action_dim).float().unsqueeze(0)
+            returns = torch.from_numpy(traj["rewards"]).float().unsqueeze(0).unsqueeze(-1)[:, -max_len:]
             # Predict actions based on State tokens
             action_preds = model(states, actions_one_hot, returns)

scripts/train_sae.py CHANGED Viewed

@@ -1,34 +1,112 @@
 import torch
-from sae_lens import SAEConfig, SAE
 from src.models.hooked_dt import HookedDT
 def train_sae():
-    # Load DT
-    state_dim = 2739
-    action_dim = 7
-    model = HookedDT.from_config(state_dim, action_dim)
-    # model.load_state_dict(torch.load("models/mini_dt.pt"))
-    # Configure SAE
-    cfg = SAEConfig(
-        d_in=128, # d_model
-        d_sae=128 * 8, # Expansion factor
-        hook_point="blocks.0.hook_resid_post",
-        hook_point_layer=0,
-        architecture="standard",
-        activation_fn="relu",
-        expansion_factor=8,
-        l1_coefficient=5e-4,
-        lr=3e-4,
-        train_batch_size=4096,
-        context_size=30, # Sequence length
     )
-    sae = SAE(cfg)
-    # Training logic would go here, using activations from the DT
-    print("SAE Configured for DT-Explorer.")
-    print(f"Hooking into: {cfg.hook_point}")
 if __name__ == "__main__":
     train_sae()

+import sys
+from pathlib import Path
 import torch
+from sae_lens import TopKSAEConfig, TopKSAE
+# Add project root to path for absolute imports
+root_path = str(Path(__file__).resolve().parent.parent)
+if root_path not in sys.path:
+    sys.path.append(root_path)
+import random
+import numpy as np
 from src.models.hooked_dt import HookedDT
+from src.interpretability.sae_manager import SAEManager
+from src.config import cfg
+def set_seed(seed: int = 42):
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
 def train_sae():
+    # 0. Set seed for reproducibility
+    set_seed(cfg.train.seed)
+    # 1. Load Trajectories to get dimensions
+    traj_path = "data/trajectories.pt"
+    if not Path(traj_path).exists():
+        print(f"Error: {traj_path} not found. Please run scripts/train_dt.py first.")
+        return
+    trajectories = torch.load(traj_path, weights_only=False)
+    print(f"Loaded {len(trajectories)} trajectories.")
+    # 2. Initialize Model
+    state_dim = trajectories[0]["observations"].shape[1]
+    action_dim = cfg.model.action_dim
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    model = HookedDT.from_config(
+        state_dim=state_dim,
+        action_dim=action_dim,
+        n_layers=cfg.model.n_layers,
+        n_heads=cfg.model.n_heads,
+        d_model=cfg.model.d_model
     )
+    model.to(device)
+    # Check for trained DT checkpoint
+    checkpoint_path = "models/mini_dt.pt"
+    if Path(checkpoint_path).exists():
+        model.load_state_dict(torch.load(checkpoint_path, map_location=device))
+        print(f"Loaded DT weights from {checkpoint_path}")
+    else:
+        print(f"Warning: {checkpoint_path} not found. Training SAE on random weights.")
+    # 3. & 4. Train SAEs for ALL layers
+    manager = SAEManager(model, sae_dir="artifacts/saes")
+    for layer in range(model.cfg.n_layers):
+        hook_point = f"blocks.{layer}.hook_resid_post"
+        all_activations = []
+        print(f"\n--- Processing Layer {layer} ({hook_point}) ---")
+        # Extract Activations
+        model.eval()
+        print(f"Extracting activations...")
+        # Number of trajectories from config
+        num_trajs_to_use = min(len(trajectories), cfg.sae.num_episodes)
+        with torch.no_grad():
+            for traj in trajectories[:num_trajs_to_use]:
+                states = torch.from_numpy(traj["observations"]).float().to(device).unsqueeze(0)
+                actions = torch.from_numpy(traj["actions"]).long().to(device)
+                actions_one_hot = torch.nn.functional.one_hot(actions, num_classes=action_dim).float().unsqueeze(0)
+                returns = torch.from_numpy(traj["rewards"]).float().to(device).unsqueeze(0).unsqueeze(-1)
+                _, cache = model(states, actions_one_hot, returns, return_cache=True)
+                all_activations.append(cache[hook_point].squeeze(0).cpu())
+        activations = torch.cat(all_activations, dim=0)
+        print(f"Collected {activations.shape[0]} activation vectors.")
+        # Setup and Train
+        print(f"Starting TopK SAE training...")
+        manager.setup_sae(
+            hook_point=hook_point,
+            d_model=cfg.model.d_model,
+            architecture="topk",
+            k=cfg.sae.k
+        )
+        manager.train_on_trajectories(
+            hook_point=hook_point,
+            activations=activations,
+            epochs=cfg.sae.epochs,
+            batch_size=cfg.sae.batch_size
+        )
+    # Save all SAEs once training is complete for all layers
+    manager.save_all_saes()
+    print(f"\nSAE Training Complete for all {model.cfg.n_layers} layers. Results saved to artifacts/saes/")
 if __name__ == "__main__":
     train_sae()

src/config.py ADDED Viewed

	@@ -0,0 +1,73 @@

+from dataclasses import dataclass, field
+from typing import Any, Dict, Optional
+import yaml
+from pathlib import Path
+@dataclass
+class ModelConfig:
+    n_layers: int = 2
+    n_heads: int = 4
+    d_model: int = 128
+    max_length: int = 30
+    state_dim: Optional[int] = None
+    action_dim: int = 7
+@dataclass
+class DataConfig:
+    env_id: str = "MiniGrid-Empty-8x8-v0"
+    num_episodes: int = 1000
+    collection_method: str = "PPO-Teacher"
+@dataclass
+class TrainConfig:
+    lr: float = 1e-4
+    epochs: int = 10
+    seed: int = 42
+@dataclass
+class SAEConfig:
+    expansion_factor: int = 8
+    k: int = 32
+    l1_coeff: float = 0.0005
+    lr: float = 3e-4
+    epochs: int = 5
+    batch_size: int = 1024
+    num_episodes: int = 100
+@dataclass
+class Config:
+    model: ModelConfig = field(default_factory=ModelConfig)
+    data: DataConfig = field(default_factory=DataConfig)
+    train: TrainConfig = field(default_factory=TrainConfig)
+    sae: SAEConfig = field(default_factory=SAEConfig)
+    @classmethod
+    def load_from_yaml(cls, yaml_path: str = "config.yaml") -> "Config":
+        """Loads configuration from a YAML file, overriding defaults."""
+        path = Path(yaml_path)
+        if not path.exists():
+            return cls()
+        with open(path, "r") as f:
+            data = yaml.safe_load(f)
+        # Helper to safely update dataclass from dict
+        def update_dataclass(dc_obj, dc_dict):
+            for key, value in dc_dict.items():
+                if hasattr(dc_obj, key):
+                    setattr(dc_obj, key, value)
+        config = cls()
+        if "model" in data:
+            update_dataclass(config.model, data["model"])
+        if "data" in data:
+            update_dataclass(config.data, data["data"])
+        if "train" in data:
+            update_dataclass(config.train, data["train"])
+        if "sae" in data:
+            update_dataclass(config.sae, data["sae"])
+        return config
+# Global config instance for easy access
+cfg = Config.load_from_yaml()

src/dashboard/app.py CHANGED Viewed

@@ -70,25 +70,25 @@ with tab1:
     st.header("Direct Logit Attribution (DLA)")
     st.write("Visualizing which heads contribute most to the predicted action.")
-    if st.button("Run Attribution"):
-        states = torch.from_numpy(traj["observations"]).float().unsqueeze(0)
-        actions = torch.nn.functional.one_hot(torch.from_numpy(traj["actions"]).long(), num_classes=7).float().unsqueeze(0)
-        returns = torch.from_numpy(traj["rewards"]).float().unsqueeze(0).unsqueeze(-1)
-        preds, cache = model(states, actions, returns, return_cache=True)
-        target_action = preds[0, -1].argmax().item()
-        engine = LogitAttributionEngine(model)
-        # Use token index -2 to target the state token which predicts the action
-        dla_results = engine.calculate_dla(cache, target_logit_index=target_action, token_index=-2)
-        fig, ax = plt.subplots()
-        im = ax.imshow(dla_results.detach().cpu().numpy(), cmap="RdBu_r", aspect='auto')
-        plt.colorbar(im)
-        ax.set_xlabel("Head")
-        ax.set_ylabel("Layer")
-        st.pyplot(fig)
-        st.write(f"Analyzing Attribution for Action: {target_action} (at State token)")
 with tab2:
     st.header("Activation Patching")

     st.header("Direct Logit Attribution (DLA)")
     st.write("Visualizing which heads contribute most to the predicted action.")
+    # Run automatically for better UX when changing trajectories
+    states = torch.from_numpy(traj["observations"]).float().unsqueeze(0)
+    actions = torch.nn.functional.one_hot(torch.from_numpy(traj["actions"]).long(), num_classes=7).float().unsqueeze(0)
+    returns = torch.from_numpy(traj["rewards"]).float().unsqueeze(0).unsqueeze(-1)
+    preds, cache = model(states, actions, returns, return_cache=True)
+    target_action = preds[0, -1].argmax().item()
+    engine = LogitAttributionEngine(model)
+    # Use token index -2 to target the state token which predicts the action
+    dla_results = engine.calculate_dla(cache, target_logit_index=target_action, token_index=-2)
+    fig, ax = plt.subplots()
+    im = ax.imshow(dla_results.detach().cpu().numpy(), cmap="RdBu_r", aspect='auto')
+    plt.colorbar(im)
+    ax.set_xlabel("Head")
+    ax.set_ylabel("Layer")
+    st.pyplot(fig)
+    st.write(f"Analyzing Attribution for Action: {target_action} (at State token)")
 with tab2:
     st.header("Activation Patching")

src/models/hooked_dt.py CHANGED Viewed

@@ -85,11 +85,11 @@ class HookedDT(nn.Module):
             return action_preds
     @classmethod
-    def from_config(cls, state_dim, action_dim, n_layers=2, n_heads=4, d_model=128):
         cfg = HookedTransformerConfig(
             n_layers=n_layers,
             d_model=d_model,
-            n_ctx=300,
             d_head=d_model // n_heads,
             n_heads=n_heads,
             d_vocab=10, # Dummy vocab size
@@ -99,5 +99,5 @@ class HookedDT(nn.Module):
             use_attn_result=True,
             device="cuda" if torch.cuda.is_available() else "cpu"
         )
-        return cls(cfg, state_dim, action_dim)

             return action_preds
     @classmethod
+    def from_config(cls, state_dim, action_dim, n_layers=2, n_heads=4, d_model=128, max_length=30):
         cfg = HookedTransformerConfig(
             n_layers=n_layers,
             d_model=d_model,
+            n_ctx=3 * max_length,
             d_head=d_model // n_heads,
             n_heads=n_heads,
             d_vocab=10, # Dummy vocab size
             use_attn_result=True,
             device="cuda" if torch.cuda.is_available() else "cpu"
         )
+        return cls(cfg, state_dim, action_dim, max_length=max_length)

tests/test_mechanistic.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import torch
+import numpy as np
+import pytest
+from src.models.hooked_dt import HookedDT
+from src.interpretability.attribution import LogitAttributionEngine
+from src.interpretability.patching import ActivationPatcher
+@pytest.fixture
+def model():
+    return HookedDT.from_config(state_dim=10, action_dim=7, n_layers=1, n_heads=2, d_model=32)
+@pytest.fixture
+def mock_data():
+    batch_size = 1
+    seq_len = 5
+    state_dim = 10
+    action_dim = 7
+    states = torch.randn(batch_size, seq_len, state_dim)
+    actions = torch.nn.functional.one_hot(torch.randint(0, action_dim, (batch_size, seq_len)), num_classes=action_dim).float()
+    returns = torch.randn(batch_size, seq_len, 1)
+    return {"states": states, "actions": actions, "returns_to_go": returns}
+def test_logit_attribution(model, mock_data):
+    engine = LogitAttributionEngine(model)
+    preds, cache = model(**mock_data, return_cache=True)
+    target_action = preds[0, -1].argmax().item()
+    dla = engine.calculate_dla(cache, target_logit_index=target_action, token_index=-2)
+    assert dla.shape == (model.cfg.n_layers, model.cfg.n_heads)
+    assert not torch.isnan(dla).any()
+def test_activation_patching(model, mock_data):
+    patcher = ActivationPatcher(model)
+    # Clean run
+    clean_preds, clean_cache = model(**mock_data, return_cache=True)
+    clean_probs = torch.softmax(clean_preds, dim=-1)
+    target_action = clean_preds[0, -1].argmax().item()
+    # Create corrupted run (zeroed states)
+    corrupted_data = mock_data.copy()
+    corrupted_data["states"] = torch.zeros_like(mock_data["states"])
+    _, corrupted_cache = model(**corrupted_data, return_cache=True)
+    # Patch head 0 of layer 0
+    patched_logits = patcher.patch_head(
+        mock_data,
+        corrupted_cache,
+        layer=0,
+        head_index=0,
+        target_token_index=-2
+    )
+    patched_probs = torch.softmax(patched_logits, dim=-1)
+    drop = patcher.calculate_probability_drop(clean_probs, patched_probs, target_action)
+    assert isinstance(drop, float)
+    # Patching with corrupted (zeros) should generally decrease performance/probability
+    # but at minimum we check it returns a valid number.
+    assert not np.isnan(drop)