Phase 1 release: InterpGPT matched-pair checkpoint

11a7660 verified 21 days ago

2.94 kB

license: mit
library_name: transformers
tags:
  - interpretability
  - mechanistic-interpretability
  - task-decomposition
  - small-language-model
  - transformer-lens
pipeline_tag: text-generation

InterpGPT — ADHD Model (23M)

Part of the InterpGPT matched-pair release. This is the ADHD model; its counterpart is connaaa/interpgpt-standard-23M. Both models share identical architecture and training recipe; only the training data distribution differs.

ADHD variant training data: task decompositions broken into smaller steps with interleaved micro-regulation actions ("sip water", "deep breath", "close eyes briefly", "quick stretch", "pause").

	Value
Parameters	23,471,104
Layers	6
Heads	8
d_model	512
d_head	64
d_mlp (SwiGLU)	1408
Vocab	8192 (custom BPE)
Context length	512
Norm	RMSNorm (ε = 1e-6)
Position	RoPE (half-half, base 10,000)
Activation	SwiGLU
Biases	none
Tied input/output embeddings	yes
Training tokens	~25k steps on ADHD-variant task-decomposition corpus

Headline findings (Phase 1)

Structural head-position swap. A step-layout-broadcast head lives at L3H0 in the standard model and at L3H5 in the ADHD model. Cross-model per-position attention profile cosine at the matched pair 0.997; same-index baseline 0.66 (0.663 for one pair; 0.643 for another). Causal ablation confirms the functional identity: ablating L3H5 in the ADHD model drops Spearman(task_complexity × step_count) from 0.83 → 0.78 (median Δ = -0.055 across 5 seeds).
Block-2 content circuit. P(regulation token) at step-onset positions jumps 17× between layer 1 and layer 2 (0.014 → 0.251). The standard model never crosses 1% at any layer.
High-specificity null-steering feature. An ADHD-L2 SAE feature (feat 2504) fires at 59% of ADHD step-onsets vs 0.03% of standard step-onsets (~2000× cross-model asymmetry), yet causal steering on its decoder direction produces Δ within sampling noise under all four intervention variants (inject-std, subtract-adhd, zero-ablate, inject-upstream). See the companion SAE repo connaaa/interpgpt-sae-phase5.

Loading

Identical to the standard variant. See connaaa/interpgpt-standard-23M for AutoModel, TransformerLens, and raw-TaskGPT examples, substituting the repo id.

Input format

<|task|>Clean the kitchen<|steps|>Step 1 text<|sep|>Step 2 text<|sep|>...<|end|>

Reproduce the head-swap finding

Open the Colab at notebooks/InterpGPT_HeadSwap.ipynb (https://github.com/cwklurks/interpgpt). Runs end-to-end on Colab free tier in under 15 minutes.

License

MIT.

Citation

See the standard model card.