| --- |
| license: mit |
| library_name: transformers |
| tags: |
| - interpretability |
| - mechanistic-interpretability |
| - task-decomposition |
| - small-language-model |
| - transformer-lens |
| pipeline_tag: text-generation |
| --- |
| |
| # InterpGPT — ADHD Model (23M) |
|
|
| Part of the **InterpGPT** matched-pair release. This is the **ADHD** model; |
| its counterpart is |
| [`connaaa/interpgpt-standard-23M`](https://huggingface.co/connaaa/interpgpt-standard-23M). |
| Both models share identical architecture and training recipe; only the |
| training data distribution differs. |
|
|
| **ADHD variant training data**: task decompositions broken into smaller steps |
| with interleaved micro-regulation actions ("sip water", "deep breath", |
| "close eyes briefly", "quick stretch", "pause"). |
|
|
| | | Value | |
| |---|---| |
| | Parameters | 23,471,104 | |
| | Layers | 6 | |
| | Heads | 8 | |
| | d_model | 512 | |
| | d_head | 64 | |
| | d_mlp (SwiGLU) | 1408 | |
| | Vocab | 8192 (custom BPE) | |
| | Context length | 512 | |
| | Norm | RMSNorm (ε = 1e-6) | |
| | Position | RoPE (half-half, base 10,000) | |
| | Activation | SwiGLU | |
| | Biases | none | |
| | Tied input/output embeddings | yes | |
| | Training tokens | ~25k steps on ADHD-variant task-decomposition corpus | |
| |
| ## Headline findings (Phase 1) |
| |
| - **Structural head-position swap.** A step-layout-broadcast head lives at |
| **L3H0** in the standard model and at **L3H5** in the ADHD model. |
| Cross-model per-position attention profile cosine at the matched pair |
| **0.997**; same-index baseline **0.66** (0.663 for one pair; 0.643 for another). |
| Causal ablation confirms the functional identity: ablating L3H5 in the ADHD |
| model drops Spearman(task_complexity × step_count) from 0.83 → 0.78 (median |
| Δ = -0.055 across 5 seeds). |
| - **Block-2 content circuit.** P(regulation token) at step-onset positions |
| jumps 17× between layer 1 and layer 2 (0.014 → 0.251). The standard model |
| never crosses 1% at any layer. |
| - **High-specificity null-steering feature.** An ADHD-L2 SAE feature |
| (feat 2504) fires at 59% of ADHD step-onsets vs 0.03% of standard step-onsets |
| (~2000× cross-model asymmetry), yet **causal steering on its decoder |
| direction produces Δ within sampling noise under all four intervention |
| variants** (inject-std, subtract-adhd, zero-ablate, inject-upstream). |
| See the companion SAE repo |
| [`connaaa/interpgpt-sae-phase5`](https://huggingface.co/connaaa/interpgpt-sae-phase5). |
| |
| ## Loading |
| |
| Identical to the standard variant. See |
| [`connaaa/interpgpt-standard-23M`](https://huggingface.co/connaaa/interpgpt-standard-23M) |
| for `AutoModel`, TransformerLens, and raw-TaskGPT examples, substituting the |
| repo id. |
| |
| ## Input format |
| |
| ``` |
| <|task|>Clean the kitchen<|steps|>Step 1 text<|sep|>Step 2 text<|sep|>...<|end|> |
| ``` |
| |
| ## Reproduce the head-swap finding |
| |
| Open the Colab at `notebooks/InterpGPT_HeadSwap.ipynb` |
| (https://github.com/cwklurks/interpgpt). Runs end-to-end on Colab free tier in |
| under 15 minutes. |
|
|
| ## License |
|
|
| MIT. |
|
|
| ## Citation |
|
|
| See the standard model card. |
|
|