File size: 4,682 Bytes
378744c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
license: mit
library_name: transformers
tags:
  - interpretability
  - mechanistic-interpretability
  - task-decomposition
  - small-language-model
  - transformer-lens
pipeline_tag: text-generation
---

# InterpGPT — Standard Model (23M)

Part of the **InterpGPT** matched-pair release. This is the **standard** model;
its counterpart is [`connaaa/interpgpt-adhd-23M`](https://huggingface.co/connaaa/interpgpt-adhd-23M).
Both models share identical architecture and training recipe; only the training
data distribution differs.

| | Value |
|---|---|
| Parameters | 23,471,104 |
| Layers | 6 |
| Heads | 8 |
| d_model | 512 |
| d_head | 64 |
| d_mlp (SwiGLU) | 1408 |
| Vocab | 8192 (custom BPE) |
| Context length | 512 |
| Norm | RMSNorm (ε = 1e-6) |
| Position | RoPE (half-half, base 10,000) |
| Activation | SwiGLU |
| Biases | none |
| Tied input/output embeddings | yes |
| Training tokens | ~25k steps on task-decomposition corpus |

## What is this model for?

Given a task prompt, the model writes a step-by-step decomposition. The
**standard** variant was trained on normal task decompositions (tasks → subtasks
in straightforward order). The **ADHD** counterpart was trained on decompositions
with smaller steps and interleaved micro-regulation actions (e.g. "sip water",
"deep breath", "quick stretch").

The pair is the subject of a mechanistic-interpretability study.
Phase 1 headline findings:

- **Structural head-position swap.** A step-layout-broadcast head lives at
  **L3H0** in the standard model and at **L3H5** in the ADHD model.
  Cross-model per-position attention profile cosine similarity is **0.997**
  at the matched (different-index) pair vs a same-index baseline of **0.66**.
- **Block-2 content circuit.** P(regulation token) at step-onset positions jumps
  17× between layer 1 and layer 2 in the ADHD model (0.014 → 0.251); the
  standard model never crosses 1% at any layer.
- **High-specificity null-steering SAE feature.** See the companion SAE repo
  [`connaaa/interpgpt-sae-phase5`](https://huggingface.co/connaaa/interpgpt-sae-phase5).

## Input format

```
<|task|>Clean the kitchen<|steps|>Step 1 text<|sep|>Step 2 text<|sep|>...<|end|>
```

## Loading

### HuggingFace Transformers (custom code)

```python
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
    "connaaa/interpgpt-standard-23M", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "connaaa/interpgpt-standard-23M"
)
```

### TransformerLens (recommended for interpretability)

The repo ships a TransformerLens-compatible bundle at `hooked_transformer.pt`:

```python
from huggingface_hub import hf_hub_download
from transformer_lens import HookedTransformer, HookedTransformerConfig
import torch

path = hf_hub_download(
    "connaaa/interpgpt-standard-23M", "hooked_transformer.pt"
)
blob = torch.load(path, map_location="cpu", weights_only=False)
cfg_keep = {
    k: v for k, v in blob["config"].items()
    if k in HookedTransformerConfig.__dataclass_fields__ and not (
        isinstance(v, str) and v.startswith("torch.")
    )
}
cfg = HookedTransformerConfig(**cfg_keep)
model = HookedTransformer(cfg)
model.load_state_dict(blob["model_state_dict"])
model.eval()
```

### Raw PyTorch / original TaskGPT class

```python
# Pairs with gpt_model.py from https://github.com/cwklurks/interpgpt
from huggingface_hub import hf_hub_download
from gpt_model import GPTConfig, TaskGPT
import torch

path = hf_hub_download(
    "connaaa/interpgpt-standard-23M", "pytorch_model.pt"
)
blob = torch.load(path, map_location="cpu", weights_only=False)
model = TaskGPT(GPTConfig(**blob["config"]))
model.load_state_dict(blob["model_state_dict"])
```

## Reproduce the head-swap finding

Open the companion Colab:
**`notebooks/InterpGPT_HeadSwap.ipynb`** at
[github.com/cwklurks/interpgpt](https://github.com/cwklurks/interpgpt).
End-to-end run on Colab free tier reproduces the 0.997 vs 0.66 comparison
in under 15 minutes.

## Training data

Custom task-decomposition corpus, two variants (standard vs ADHD) generated
with the same task pool. Detailed dataset notes + generation scripts live in
the main repo (`preprocess.py`, `merge_data.py`, `rebuild_data.py`,
`fix_adhd_data.py`, `shorten_adhd_steps.py`).

## License

MIT.

## Intended use

Interpretability research. The model is intentionally small and
domain-specific; **not** intended as a general-purpose chatbot.

## Citation

```bibtex
@misc{interpgpt2026,
  title  = {{InterpGPT}: A matched-pair interpretability study of task-decomposition models},
  author = {Klann, Connor},
  year   = {2026},
  url    = {https://github.com/cwklurks/interpgpt}
}
```