| --- |
| license: mit |
| library_name: transformers |
| tags: |
| - interpretability |
| - mechanistic-interpretability |
| - task-decomposition |
| - small-language-model |
| - transformer-lens |
| pipeline_tag: text-generation |
| --- |
| |
| # InterpGPT — Standard Model (23M) |
|
|
| Part of the **InterpGPT** matched-pair release. This is the **standard** model; |
| its counterpart is [`connaaa/interpgpt-adhd-23M`](https://huggingface.co/connaaa/interpgpt-adhd-23M). |
| Both models share identical architecture and training recipe; only the training |
| data distribution differs. |
|
|
| | | Value | |
| |---|---| |
| | Parameters | 23,471,104 | |
| | Layers | 6 | |
| | Heads | 8 | |
| | d_model | 512 | |
| | d_head | 64 | |
| | d_mlp (SwiGLU) | 1408 | |
| | Vocab | 8192 (custom BPE) | |
| | Context length | 512 | |
| | Norm | RMSNorm (ε = 1e-6) | |
| | Position | RoPE (half-half, base 10,000) | |
| | Activation | SwiGLU | |
| | Biases | none | |
| | Tied input/output embeddings | yes | |
| | Training tokens | ~25k steps on task-decomposition corpus | |
| |
| ## What is this model for? |
| |
| Given a task prompt, the model writes a step-by-step decomposition. The |
| **standard** variant was trained on normal task decompositions (tasks → subtasks |
| in straightforward order). The **ADHD** counterpart was trained on decompositions |
| with smaller steps and interleaved micro-regulation actions (e.g. "sip water", |
| "deep breath", "quick stretch"). |
| |
| The pair is the subject of a mechanistic-interpretability study. |
| Phase 1 headline findings: |
| |
| - **Structural head-position swap.** A step-layout-broadcast head lives at |
| **L3H0** in the standard model and at **L3H5** in the ADHD model. |
| Cross-model per-position attention profile cosine similarity is **0.997** |
| at the matched (different-index) pair vs a same-index baseline of **0.66**. |
| - **Block-2 content circuit.** P(regulation token) at step-onset positions jumps |
| 17× between layer 1 and layer 2 in the ADHD model (0.014 → 0.251); the |
| standard model never crosses 1% at any layer. |
| - **High-specificity null-steering SAE feature.** See the companion SAE repo |
| [`connaaa/interpgpt-sae-phase5`](https://huggingface.co/connaaa/interpgpt-sae-phase5). |
| |
| ## Input format |
| |
| ``` |
| <|task|>Clean the kitchen<|steps|>Step 1 text<|sep|>Step 2 text<|sep|>...<|end|> |
| ``` |
| |
| ## Loading |
| |
| ### HuggingFace Transformers (custom code) |
| |
| ```python |
| from transformers import AutoModel, AutoTokenizer |
| model = AutoModel.from_pretrained( |
| "connaaa/interpgpt-standard-23M", trust_remote_code=True |
| ) |
| tokenizer = AutoTokenizer.from_pretrained( |
| "connaaa/interpgpt-standard-23M" |
| ) |
| ``` |
| |
| ### TransformerLens (recommended for interpretability) |
|
|
| The repo ships a TransformerLens-compatible bundle at `hooked_transformer.pt`: |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| from transformer_lens import HookedTransformer, HookedTransformerConfig |
| import torch |
| |
| path = hf_hub_download( |
| "connaaa/interpgpt-standard-23M", "hooked_transformer.pt" |
| ) |
| blob = torch.load(path, map_location="cpu", weights_only=False) |
| cfg_keep = { |
| k: v for k, v in blob["config"].items() |
| if k in HookedTransformerConfig.__dataclass_fields__ and not ( |
| isinstance(v, str) and v.startswith("torch.") |
| ) |
| } |
| cfg = HookedTransformerConfig(**cfg_keep) |
| model = HookedTransformer(cfg) |
| model.load_state_dict(blob["model_state_dict"]) |
| model.eval() |
| ``` |
|
|
| ### Raw PyTorch / original TaskGPT class |
|
|
| ```python |
| # Pairs with gpt_model.py from https://github.com/cwklurks/interpgpt |
| from huggingface_hub import hf_hub_download |
| from gpt_model import GPTConfig, TaskGPT |
| import torch |
| |
| path = hf_hub_download( |
| "connaaa/interpgpt-standard-23M", "pytorch_model.pt" |
| ) |
| blob = torch.load(path, map_location="cpu", weights_only=False) |
| model = TaskGPT(GPTConfig(**blob["config"])) |
| model.load_state_dict(blob["model_state_dict"]) |
| ``` |
|
|
| ## Reproduce the head-swap finding |
|
|
| Open the companion Colab: |
| **`notebooks/InterpGPT_HeadSwap.ipynb`** at |
| [github.com/cwklurks/interpgpt](https://github.com/cwklurks/interpgpt). |
| End-to-end run on Colab free tier reproduces the 0.997 vs 0.66 comparison |
| in under 15 minutes. |
| |
| ## Training data |
| |
| Custom task-decomposition corpus, two variants (standard vs ADHD) generated |
| with the same task pool. Detailed dataset notes + generation scripts live in |
| the main repo (`preprocess.py`, `merge_data.py`, `rebuild_data.py`, |
| `fix_adhd_data.py`, `shorten_adhd_steps.py`). |
| |
| ## License |
| |
| MIT. |
| |
| ## Intended use |
| |
| Interpretability research. The model is intentionally small and |
| domain-specific; **not** intended as a general-purpose chatbot. |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{interpgpt2026, |
| title = {{InterpGPT}: A matched-pair interpretability study of task-decomposition models}, |
| author = {Klann, Connor}, |
| year = {2026}, |
| url = {https://github.com/cwklurks/interpgpt} |
| } |
| ``` |
| |