| --- |
| license: mit |
| library_name: sae_lens |
| tags: |
| - interpretability |
| - sparse-autoencoder |
| - sae |
| - mechanistic-interpretability |
| - topk-sae |
| --- |
| |
| # InterpGPT β Phase 5 TopK SAEs |
|
|
| Seven sparse autoencoders trained on the residual stream |
| (`hook_resid_post`) of the two Phase 1 InterpGPT models |
| ([`interpgpt-standard-23M`](https://huggingface.co/connaaa/interpgpt-standard-23M), |
| [`interpgpt-adhd-23M`](https://huggingface.co/connaaa/interpgpt-adhd-23M)). |
|
|
| | Model | Layer | Hook | Subdir | |
| |---|---|---|---| |
| | standard | 0 | hook_resid_post | `standard_L0_hook_resid_post/` | |
| | standard | 1 | hook_resid_post | `standard_L1_hook_resid_post/` | |
| | standard | 2 | hook_resid_post | `standard_L2_hook_resid_post/` | |
| | standard | 3 | hook_resid_post | `standard_L3_hook_resid_post/` | |
| | adhd | 1 | hook_resid_post | `adhd_L1_hook_resid_post/` | |
| | adhd | 2 | hook_resid_post | `adhd_L2_hook_resid_post/` | |
| | adhd | 3 | hook_resid_post | `adhd_L3_hook_resid_post/` | |
|
|
| ## Training setup |
|
|
| - Library: [`sae_lens`](https://github.com/jbloomAus/SAELens) TopK training SAE |
| - `k = 40`, `d_sae = 4096` |
| - All 7 SAEs pass quality gates: FVE 0.87β0.92, dead features < 2% |
|
|
| ## Phase 1 result artifacts (included) |
|
|
| - `feature_diff.json` β 312 ADHD-L2 features firing at step-onset that the |
| standard model lacks. Feature 2504 highlighted (2000Γ cross-model asymmetry). |
| - `causal_nulls_per_seed.json` β 5-seed causal ablation nulls for the L3 swap. |
| - `deepdive_steering.json` β feature 2504 four-panel steering results (all four |
| interventions Ξ within Β±0.025 of null, below 2 SEM). |
| - `three_probes.json` β three-probe causal-check outputs. |
|
|
| ## Loading |
|
|
| ### Minimal |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| from sae_lens import SAE |
| |
| repo = "connaaa/interpgpt-sae-phase5" |
| local = snapshot_download(repo_id=repo, allow_patterns=["adhd_L2_hook_resid_post/*"]) |
| sae = SAE.load_from_disk(f"{local}/adhd_L2_hook_resid_post") |
| print(sae) |
| ``` |
|
|
| ### Pull everything |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| local = snapshot_download(repo_id="connaaa/interpgpt-sae-phase5") |
| ``` |
|
|
| ## Reproducibility |
|
|
| Training script: `phase5_sae.py` in |
| [github.com/cwklurks/interpgpt](https://github.com/cwklurks/interpgpt). |
| Production driver: `phase5_production.py`. Four-panel steering harness: |
| `phase5_steering_ci.py`. |
|
|
| ## License |
|
|
| MIT. |
|
|