Phase 5 release: 7 TopK SAEs + specificity / null-steering JSON artifacts

5f2451e verified 21 days ago

2.35 kB

license: mit
library_name: sae_lens
tags:
  - interpretability
  - sparse-autoencoder
  - sae
  - mechanistic-interpretability
  - topk-sae

InterpGPT — Phase 5 TopK SAEs

Seven sparse autoencoders trained on the residual stream (hook_resid_post) of the two Phase 1 InterpGPT models (interpgpt-standard-23M, interpgpt-adhd-23M).

Model	Layer	Hook	Subdir
standard	0	hook_resid_post	`standard_L0_hook_resid_post/`
standard	1	hook_resid_post	`standard_L1_hook_resid_post/`
standard	2	hook_resid_post	`standard_L2_hook_resid_post/`
standard	3	hook_resid_post	`standard_L3_hook_resid_post/`
adhd	1	hook_resid_post	`adhd_L1_hook_resid_post/`
adhd	2	hook_resid_post	`adhd_L2_hook_resid_post/`
adhd	3	hook_resid_post	`adhd_L3_hook_resid_post/`

Training setup

Library: sae_lens TopK training SAE
k = 40, d_sae = 4096
All 7 SAEs pass quality gates: FVE 0.87–0.92, dead features < 2%

Phase 1 result artifacts (included)

feature_diff.json — 312 ADHD-L2 features firing at step-onset that the standard model lacks. Feature 2504 highlighted (2000× cross-model asymmetry).
causal_nulls_per_seed.json — 5-seed causal ablation nulls for the L3 swap.
deepdive_steering.json — feature 2504 four-panel steering results (all four interventions Δ within ±0.025 of null, below 2 SEM).
three_probes.json — three-probe causal-check outputs.

Loading

Minimal

from huggingface_hub import snapshot_download
from sae_lens import SAE

repo = "connaaa/interpgpt-sae-phase5"
local = snapshot_download(repo_id=repo, allow_patterns=["adhd_L2_hook_resid_post/*"])
sae = SAE.load_from_disk(f"{local}/adhd_L2_hook_resid_post")
print(sae)

Pull everything

from huggingface_hub import snapshot_download
local = snapshot_download(repo_id="connaaa/interpgpt-sae-phase5")

Reproducibility

Training script: phase5_sae.py in github.com/cwklurks/interpgpt. Production driver: phase5_production.py. Four-panel steering harness: phase5_steering_ci.py.

License

MIT.