metadata
license: mit
library_name: sae_lens
tags:
- interpretability
- sparse-autoencoder
- sae
- mechanistic-interpretability
- topk-sae
InterpGPT — Phase 5 TopK SAEs
Seven sparse autoencoders trained on the residual stream
(hook_resid_post) of the two Phase 1 InterpGPT models
(interpgpt-standard-23M,
interpgpt-adhd-23M).
| Model | Layer | Hook | Subdir |
|---|---|---|---|
| standard | 0 | hook_resid_post | standard_L0_hook_resid_post/ |
| standard | 1 | hook_resid_post | standard_L1_hook_resid_post/ |
| standard | 2 | hook_resid_post | standard_L2_hook_resid_post/ |
| standard | 3 | hook_resid_post | standard_L3_hook_resid_post/ |
| adhd | 1 | hook_resid_post | adhd_L1_hook_resid_post/ |
| adhd | 2 | hook_resid_post | adhd_L2_hook_resid_post/ |
| adhd | 3 | hook_resid_post | adhd_L3_hook_resid_post/ |
Training setup
- Library:
sae_lensTopK training SAE k = 40,d_sae = 4096- All 7 SAEs pass quality gates: FVE 0.87–0.92, dead features < 2%
Phase 1 result artifacts (included)
feature_diff.json— 312 ADHD-L2 features firing at step-onset that the standard model lacks. Feature 2504 highlighted (2000× cross-model asymmetry).causal_nulls_per_seed.json— 5-seed causal ablation nulls for the L3 swap.deepdive_steering.json— feature 2504 four-panel steering results (all four interventions Δ within ±0.025 of null, below 2 SEM).three_probes.json— three-probe causal-check outputs.
Loading
Minimal
from huggingface_hub import snapshot_download
from sae_lens import SAE
repo = "connaaa/interpgpt-sae-phase5"
local = snapshot_download(repo_id=repo, allow_patterns=["adhd_L2_hook_resid_post/*"])
sae = SAE.load_from_disk(f"{local}/adhd_L2_hook_resid_post")
print(sae)
Pull everything
from huggingface_hub import snapshot_download
local = snapshot_download(repo_id="connaaa/interpgpt-sae-phase5")
Reproducibility
Training script: phase5_sae.py in
github.com/cwklurks/interpgpt.
Production driver: phase5_production.py. Four-panel steering harness:
phase5_steering_ci.py.
License
MIT.