Phase 1 release: InterpGPT matched-pair checkpoint

11a7660 verified 21 days ago

2.94 kB

	---
	license: mit
	library_name: transformers
	tags:
	- interpretability
	- mechanistic-interpretability
	- task-decomposition
	- small-language-model
	- transformer-lens
	pipeline_tag: text-generation
	---

	# InterpGPT — ADHD Model (23M)

	Part of the InterpGPT matched-pair release. This is the ADHD model;
	its counterpart is
	[`connaaa/interpgpt-standard-23M`](https://huggingface.co/connaaa/interpgpt-standard-23M).
	Both models share identical architecture and training recipe; only the
	training data distribution differs.

	ADHD variant training data: task decompositions broken into smaller steps
	with interleaved micro-regulation actions ("sip water", "deep breath",
	"close eyes briefly", "quick stretch", "pause").

	\| \| Value \|
	\|---\|---\|
	\| Parameters \| 23,471,104 \|
	\| Layers \| 6 \|
	\| Heads \| 8 \|
	\| d_model \| 512 \|
	\| d_head \| 64 \|
	\| d_mlp (SwiGLU) \| 1408 \|
	\| Vocab \| 8192 (custom BPE) \|
	\| Context length \| 512 \|
	\| Norm \| RMSNorm (ε = 1e-6) \|
	\| Position \| RoPE (half-half, base 10,000) \|
	\| Activation \| SwiGLU \|
	\| Biases \| none \|
	\| Tied input/output embeddings \| yes \|
	\| Training tokens \| ~25k steps on ADHD-variant task-decomposition corpus \|

	## Headline findings (Phase 1)

	- Structural head-position swap. A step-layout-broadcast head lives at
	L3H0 in the standard model and at L3H5 in the ADHD model.
	Cross-model per-position attention profile cosine at the matched pair
	0.997; same-index baseline 0.66 (0.663 for one pair; 0.643 for another).
	Causal ablation confirms the functional identity: ablating L3H5 in the ADHD
	model drops Spearman(task_complexity × step_count) from 0.83 → 0.78 (median
	Δ = -0.055 across 5 seeds).
	- Block-2 content circuit. P(regulation token) at step-onset positions
	jumps 17× between layer 1 and layer 2 (0.014 → 0.251). The standard model
	never crosses 1% at any layer.
	- High-specificity null-steering feature. An ADHD-L2 SAE feature
	(feat 2504) fires at 59% of ADHD step-onsets vs 0.03% of standard step-onsets
	(~2000× cross-model asymmetry), yet **causal steering on its decoder
	direction produces Δ within sampling noise under all four intervention
	variants** (inject-std, subtract-adhd, zero-ablate, inject-upstream).
	See the companion SAE repo
	[`connaaa/interpgpt-sae-phase5`](https://huggingface.co/connaaa/interpgpt-sae-phase5).

	## Loading

	Identical to the standard variant. See
	[`connaaa/interpgpt-standard-23M`](https://huggingface.co/connaaa/interpgpt-standard-23M)
	for `AutoModel`, TransformerLens, and raw-TaskGPT examples, substituting the
	repo id.

	## Input format

	```
	<\|task\|>Clean the kitchen<\|steps\|>Step 1 text<\|sep\|>Step 2 text<\|sep\|>...<\|end\|>
	```

	## Reproduce the head-swap finding

	Open the Colab at `notebooks/InterpGPT_HeadSwap.ipynb`
	(https://github.com/cwklurks/interpgpt). Runs end-to-end on Colab free tier in
	under 15 minutes.

	## License

	MIT.

	## Citation

	See the standard model card.