File size: 2,939 Bytes
11a7660 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | ---
license: mit
library_name: transformers
tags:
- interpretability
- mechanistic-interpretability
- task-decomposition
- small-language-model
- transformer-lens
pipeline_tag: text-generation
---
# InterpGPT — ADHD Model (23M)
Part of the **InterpGPT** matched-pair release. This is the **ADHD** model;
its counterpart is
[`connaaa/interpgpt-standard-23M`](https://huggingface.co/connaaa/interpgpt-standard-23M).
Both models share identical architecture and training recipe; only the
training data distribution differs.
**ADHD variant training data**: task decompositions broken into smaller steps
with interleaved micro-regulation actions ("sip water", "deep breath",
"close eyes briefly", "quick stretch", "pause").
| | Value |
|---|---|
| Parameters | 23,471,104 |
| Layers | 6 |
| Heads | 8 |
| d_model | 512 |
| d_head | 64 |
| d_mlp (SwiGLU) | 1408 |
| Vocab | 8192 (custom BPE) |
| Context length | 512 |
| Norm | RMSNorm (ε = 1e-6) |
| Position | RoPE (half-half, base 10,000) |
| Activation | SwiGLU |
| Biases | none |
| Tied input/output embeddings | yes |
| Training tokens | ~25k steps on ADHD-variant task-decomposition corpus |
## Headline findings (Phase 1)
- **Structural head-position swap.** A step-layout-broadcast head lives at
**L3H0** in the standard model and at **L3H5** in the ADHD model.
Cross-model per-position attention profile cosine at the matched pair
**0.997**; same-index baseline **0.66** (0.663 for one pair; 0.643 for another).
Causal ablation confirms the functional identity: ablating L3H5 in the ADHD
model drops Spearman(task_complexity × step_count) from 0.83 → 0.78 (median
Δ = -0.055 across 5 seeds).
- **Block-2 content circuit.** P(regulation token) at step-onset positions
jumps 17× between layer 1 and layer 2 (0.014 → 0.251). The standard model
never crosses 1% at any layer.
- **High-specificity null-steering feature.** An ADHD-L2 SAE feature
(feat 2504) fires at 59% of ADHD step-onsets vs 0.03% of standard step-onsets
(~2000× cross-model asymmetry), yet **causal steering on its decoder
direction produces Δ within sampling noise under all four intervention
variants** (inject-std, subtract-adhd, zero-ablate, inject-upstream).
See the companion SAE repo
[`connaaa/interpgpt-sae-phase5`](https://huggingface.co/connaaa/interpgpt-sae-phase5).
## Loading
Identical to the standard variant. See
[`connaaa/interpgpt-standard-23M`](https://huggingface.co/connaaa/interpgpt-standard-23M)
for `AutoModel`, TransformerLens, and raw-TaskGPT examples, substituting the
repo id.
## Input format
```
<|task|>Clean the kitchen<|steps|>Step 1 text<|sep|>Step 2 text<|sep|>...<|end|>
```
## Reproduce the head-swap finding
Open the Colab at `notebooks/InterpGPT_HeadSwap.ipynb`
(https://github.com/cwklurks/interpgpt). Runs end-to-end on Colab free tier in
under 15 minutes.
## License
MIT.
## Citation
See the standard model card.
|