--- license: mit library_name: sae_lens tags: - interpretability - sparse-autoencoder - sae - mechanistic-interpretability - topk-sae --- # InterpGPT — Phase 5 TopK SAEs Seven sparse autoencoders trained on the residual stream (`hook_resid_post`) of the two Phase 1 InterpGPT models ([`interpgpt-standard-23M`](https://huggingface.co/connaaa/interpgpt-standard-23M), [`interpgpt-adhd-23M`](https://huggingface.co/connaaa/interpgpt-adhd-23M)). | Model | Layer | Hook | Subdir | |---|---|---|---| | standard | 0 | hook_resid_post | `standard_L0_hook_resid_post/` | | standard | 1 | hook_resid_post | `standard_L1_hook_resid_post/` | | standard | 2 | hook_resid_post | `standard_L2_hook_resid_post/` | | standard | 3 | hook_resid_post | `standard_L3_hook_resid_post/` | | adhd | 1 | hook_resid_post | `adhd_L1_hook_resid_post/` | | adhd | 2 | hook_resid_post | `adhd_L2_hook_resid_post/` | | adhd | 3 | hook_resid_post | `adhd_L3_hook_resid_post/` | ## Training setup - Library: [`sae_lens`](https://github.com/jbloomAus/SAELens) TopK training SAE - `k = 40`, `d_sae = 4096` - All 7 SAEs pass quality gates: FVE 0.87–0.92, dead features < 2% ## Phase 1 result artifacts (included) - `feature_diff.json` — 312 ADHD-L2 features firing at step-onset that the standard model lacks. Feature 2504 highlighted (2000× cross-model asymmetry). - `causal_nulls_per_seed.json` — 5-seed causal ablation nulls for the L3 swap. - `deepdive_steering.json` — feature 2504 four-panel steering results (all four interventions Δ within ±0.025 of null, below 2 SEM). - `three_probes.json` — three-probe causal-check outputs. ## Loading ### Minimal ```python from huggingface_hub import snapshot_download from sae_lens import SAE repo = "connaaa/interpgpt-sae-phase5" local = snapshot_download(repo_id=repo, allow_patterns=["adhd_L2_hook_resid_post/*"]) sae = SAE.load_from_disk(f"{local}/adhd_L2_hook_resid_post") print(sae) ``` ### Pull everything ```python from huggingface_hub import snapshot_download local = snapshot_download(repo_id="connaaa/interpgpt-sae-phase5") ``` ## Reproducibility Training script: `phase5_sae.py` in [github.com/cwklurks/interpgpt](https://github.com/cwklurks/interpgpt). Production driver: `phase5_production.py`. Four-panel steering harness: `phase5_steering_ci.py`. ## License MIT.