---
license: mit
library_name: transformers
tags:
  - interpretability
  - mechanistic-interpretability
  - task-decomposition
  - small-language-model
  - transformer-lens
pipeline_tag: text-generation
---

# InterpGPT — Standard Model (23M)

Part of the **InterpGPT** matched-pair release. This is the **standard** model;
its counterpart is [`connaaa/interpgpt-adhd-23M`](https://huggingface.co/connaaa/interpgpt-adhd-23M).
Both models share identical architecture and training recipe; only the training
data distribution differs.

| | Value |
|---|---|
| Parameters | 23,471,104 |
| Layers | 6 |
| Heads | 8 |
| d_model | 512 |
| d_head | 64 |
| d_mlp (SwiGLU) | 1408 |
| Vocab | 8192 (custom BPE) |
| Context length | 512 |
| Norm | RMSNorm (ε = 1e-6) |
| Position | RoPE (half-half, base 10,000) |
| Activation | SwiGLU |
| Biases | none |
| Tied input/output embeddings | yes |
| Training tokens | ~25k steps on task-decomposition corpus |

## What is this model for?

Given a task prompt, the model writes a step-by-step decomposition. The
**standard** variant was trained on normal task decompositions (tasks → subtasks
in straightforward order). The **ADHD** counterpart was trained on decompositions
with smaller steps and interleaved micro-regulation actions (e.g. "sip water",
"deep breath", "quick stretch").

The pair is the subject of a mechanistic-interpretability study.
Phase 1 headline findings:

- **Structural head-position swap.** A step-layout-broadcast head lives at
  **L3H0** in the standard model and at **L3H5** in the ADHD model.
  Cross-model per-position attention profile cosine similarity is **0.997**
  at the matched (different-index) pair vs a same-index baseline of **0.66**.
- **Block-2 content circuit.** P(regulation token) at step-onset positions jumps
  17× between layer 1 and layer 2 in the ADHD model (0.014 → 0.251); the
  standard model never crosses 1% at any layer.
- **High-specificity null-steering SAE feature.** See the companion SAE repo
  [`connaaa/interpgpt-sae-phase5`](https://huggingface.co/connaaa/interpgpt-sae-phase5).

## Input format

```
<|task|>Clean the kitchen<|steps|>Step 1 text<|sep|>Step 2 text<|sep|>...<|end|>
```

## Loading

### HuggingFace Transformers (custom code)

```python
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
    "connaaa/interpgpt-standard-23M", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "connaaa/interpgpt-standard-23M"
)
```

### TransformerLens (recommended for interpretability)

The repo ships a TransformerLens-compatible bundle at `hooked_transformer.pt`:

```python
from huggingface_hub import hf_hub_download
from transformer_lens import HookedTransformer, HookedTransformerConfig
import torch

path = hf_hub_download(
    "connaaa/interpgpt-standard-23M", "hooked_transformer.pt"
)
blob = torch.load(path, map_location="cpu", weights_only=False)
cfg_keep = {
    k: v for k, v in blob["config"].items()
    if k in HookedTransformerConfig.__dataclass_fields__ and not (
        isinstance(v, str) and v.startswith("torch.")
    )
}
cfg = HookedTransformerConfig(**cfg_keep)
model = HookedTransformer(cfg)
model.load_state_dict(blob["model_state_dict"])
model.eval()
```

### Raw PyTorch / original TaskGPT class

```python
# Pairs with gpt_model.py from https://github.com/cwklurks/interpgpt
from huggingface_hub import hf_hub_download
from gpt_model import GPTConfig, TaskGPT
import torch

path = hf_hub_download(
    "connaaa/interpgpt-standard-23M", "pytorch_model.pt"
)
blob = torch.load(path, map_location="cpu", weights_only=False)
model = TaskGPT(GPTConfig(**blob["config"]))
model.load_state_dict(blob["model_state_dict"])
```

## Reproduce the head-swap finding

Open the companion Colab:
**`notebooks/InterpGPT_HeadSwap.ipynb`** at
[github.com/cwklurks/interpgpt](https://github.com/cwklurks/interpgpt).
End-to-end run on Colab free tier reproduces the 0.997 vs 0.66 comparison
in under 15 minutes.

## Training data

Custom task-decomposition corpus, two variants (standard vs ADHD) generated
with the same task pool. Detailed dataset notes + generation scripts live in
the main repo (`preprocess.py`, `merge_data.py`, `rebuild_data.py`,
`fix_adhd_data.py`, `shorten_adhd_steps.py`).

## License

MIT.

## Intended use

Interpretability research. The model is intentionally small and
domain-specific; **not** intended as a general-purpose chatbot.

## Citation

```bibtex
@misc{interpgpt2026,
  title  = {{InterpGPT}: A matched-pair interpretability study of task-decomposition models},
  author = {Klann, Connor},
  year   = {2026},
  url    = {https://github.com/cwklurks/interpgpt}
}
```