DT-Explorer / docs /sae_steering.md
sadhumitha-s's picture
feat: implement NLA explainer and universality probe and refactor path patching engine
8577352
# SAEs and Activation Steering
Sparse Autoencoders (SAEs) allow us to decompose the residual stream into human-interpretable features, while steering allows us to manipulate those features to change agent behavior.
## Sparse Autoencoders (SAE)
An SAE decomposes activations into a set of "monosemantic" features. By projecting dense vectors into a higher-dimensional space, we find latents that correspond to specific concepts (e.g., "Wall ahead").
### TopK SAEs
Instead of using an L1 penalty to force sparsity, we use **TopK SAEs**. These restrict the model to exactly $k$ active features per input. This makes the internal logic cleaner and easier to analyze compared to standard ReLU SAEs.
### Natural Language Labeling (NLA)
To avoid manual inspection of thousands of features, we use an **NLA Explainer**. This tool takes the top activations for a feature and uses a Language Model to generate a human-readable label (e.g., "Feature #402: Activates when a red key is visible").
```mermaid
graph LR
Act[Dense Activation] --> Enc[Encoder]
Enc --> Lat[TopK Sparse Latents]
Lat --> NLA[LLM Labeling]
Lat --> Dec[Decoder]
Dec --> Rec[Reconstruction]
style Lat fill:#dfd,stroke:#333
```
## Activation Steering
Steering involves adding a "direction" vector to the model's activations to shift its behavior. This is often done using **Contrastive Activation Addition**.
### Steering Pipeline
1. **Collect States**: Gather activations for two contrasting behaviors (e.g., "Moving Fast" vs "Moving Slow").
2. **Compute Vector**: Calculate the difference between the mean activations of these two sets.
3. **Inject**: Add this vector (multiplied by a coefficient) to the model during inference.
```mermaid
graph TD
A[Mean Act: Behavior A] --> Diff[Steering Vector = A - B]
B[Mean Act: Behavior B] --> Diff
In[Current Input] --> Model[DT Model]
Diff -->|Add with Gain λ| Model
Model --> Out[Modified Behavior]
```
## Cross-Architecture Universality Probes
We use **Universality Probes** to check if features are model-specific or "universal" to the task. By comparing the SAE features of a Decision Transformer with the activations of a different model (like a DQN) trained on the same environment, we can identify shared representational spaces.
- **High Correlation**: Suggests the feature is a fundamental concept required to solve the task (e.g., "The concept of a wall").
- **Low Correlation**: Suggests the feature might be an artifact of the specific architecture or training algorithm.