File size: 2,349 Bytes
5f2451e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: mit
library_name: sae_lens
tags:
  - interpretability
  - sparse-autoencoder
  - sae
  - mechanistic-interpretability
  - topk-sae
---

# InterpGPT — Phase 5 TopK SAEs

Seven sparse autoencoders trained on the residual stream
(`hook_resid_post`) of the two Phase 1 InterpGPT models
([`interpgpt-standard-23M`](https://huggingface.co/connaaa/interpgpt-standard-23M),
[`interpgpt-adhd-23M`](https://huggingface.co/connaaa/interpgpt-adhd-23M)).

| Model | Layer | Hook | Subdir |
|---|---|---|---|
| standard | 0 | hook_resid_post | `standard_L0_hook_resid_post/` |
| standard | 1 | hook_resid_post | `standard_L1_hook_resid_post/` |
| standard | 2 | hook_resid_post | `standard_L2_hook_resid_post/` |
| standard | 3 | hook_resid_post | `standard_L3_hook_resid_post/` |
| adhd     | 1 | hook_resid_post | `adhd_L1_hook_resid_post/` |
| adhd     | 2 | hook_resid_post | `adhd_L2_hook_resid_post/` |
| adhd     | 3 | hook_resid_post | `adhd_L3_hook_resid_post/` |

## Training setup

- Library: [`sae_lens`](https://github.com/jbloomAus/SAELens) TopK training SAE
- `k = 40`, `d_sae = 4096`
- All 7 SAEs pass quality gates: FVE 0.87–0.92, dead features < 2%

## Phase 1 result artifacts (included)

- `feature_diff.json` — 312 ADHD-L2 features firing at step-onset that the
  standard model lacks. Feature 2504 highlighted (2000× cross-model asymmetry).
- `causal_nulls_per_seed.json` — 5-seed causal ablation nulls for the L3 swap.
- `deepdive_steering.json` — feature 2504 four-panel steering results (all four
  interventions Δ within ±0.025 of null, below 2 SEM).
- `three_probes.json` — three-probe causal-check outputs.

## Loading

### Minimal

```python
from huggingface_hub import snapshot_download
from sae_lens import SAE

repo = "connaaa/interpgpt-sae-phase5"
local = snapshot_download(repo_id=repo, allow_patterns=["adhd_L2_hook_resid_post/*"])
sae = SAE.load_from_disk(f"{local}/adhd_L2_hook_resid_post")
print(sae)
```

### Pull everything

```python
from huggingface_hub import snapshot_download
local = snapshot_download(repo_id="connaaa/interpgpt-sae-phase5")
```

## Reproducibility

Training script: `phase5_sae.py` in
[github.com/cwklurks/interpgpt](https://github.com/cwklurks/interpgpt).
Production driver: `phase5_production.py`. Four-panel steering harness:
`phase5_steering_ci.py`.

## License

MIT.