--- license: mit library_name: transformers tags: - interpretability - mechanistic-interpretability - task-decomposition - small-language-model - transformer-lens pipeline_tag: text-generation --- # InterpGPT — Standard Model (23M) Part of the **InterpGPT** matched-pair release. This is the **standard** model; its counterpart is [`connaaa/interpgpt-adhd-23M`](https://huggingface.co/connaaa/interpgpt-adhd-23M). Both models share identical architecture and training recipe; only the training data distribution differs. | | Value | |---|---| | Parameters | 23,471,104 | | Layers | 6 | | Heads | 8 | | d_model | 512 | | d_head | 64 | | d_mlp (SwiGLU) | 1408 | | Vocab | 8192 (custom BPE) | | Context length | 512 | | Norm | RMSNorm (ε = 1e-6) | | Position | RoPE (half-half, base 10,000) | | Activation | SwiGLU | | Biases | none | | Tied input/output embeddings | yes | | Training tokens | ~25k steps on task-decomposition corpus | ## What is this model for? Given a task prompt, the model writes a step-by-step decomposition. The **standard** variant was trained on normal task decompositions (tasks → subtasks in straightforward order). The **ADHD** counterpart was trained on decompositions with smaller steps and interleaved micro-regulation actions (e.g. "sip water", "deep breath", "quick stretch"). The pair is the subject of a mechanistic-interpretability study. Phase 1 headline findings: - **Structural head-position swap.** A step-layout-broadcast head lives at **L3H0** in the standard model and at **L3H5** in the ADHD model. Cross-model per-position attention profile cosine similarity is **0.997** at the matched (different-index) pair vs a same-index baseline of **0.66**. - **Block-2 content circuit.** P(regulation token) at step-onset positions jumps 17× between layer 1 and layer 2 in the ADHD model (0.014 → 0.251); the standard model never crosses 1% at any layer. - **High-specificity null-steering SAE feature.** See the companion SAE repo [`connaaa/interpgpt-sae-phase5`](https://huggingface.co/connaaa/interpgpt-sae-phase5). ## Input format ``` <|task|>Clean the kitchen<|steps|>Step 1 text<|sep|>Step 2 text<|sep|>...<|end|> ``` ## Loading ### HuggingFace Transformers (custom code) ```python from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained( "connaaa/interpgpt-standard-23M", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained( "connaaa/interpgpt-standard-23M" ) ``` ### TransformerLens (recommended for interpretability) The repo ships a TransformerLens-compatible bundle at `hooked_transformer.pt`: ```python from huggingface_hub import hf_hub_download from transformer_lens import HookedTransformer, HookedTransformerConfig import torch path = hf_hub_download( "connaaa/interpgpt-standard-23M", "hooked_transformer.pt" ) blob = torch.load(path, map_location="cpu", weights_only=False) cfg_keep = { k: v for k, v in blob["config"].items() if k in HookedTransformerConfig.__dataclass_fields__ and not ( isinstance(v, str) and v.startswith("torch.") ) } cfg = HookedTransformerConfig(**cfg_keep) model = HookedTransformer(cfg) model.load_state_dict(blob["model_state_dict"]) model.eval() ``` ### Raw PyTorch / original TaskGPT class ```python # Pairs with gpt_model.py from https://github.com/cwklurks/interpgpt from huggingface_hub import hf_hub_download from gpt_model import GPTConfig, TaskGPT import torch path = hf_hub_download( "connaaa/interpgpt-standard-23M", "pytorch_model.pt" ) blob = torch.load(path, map_location="cpu", weights_only=False) model = TaskGPT(GPTConfig(**blob["config"])) model.load_state_dict(blob["model_state_dict"]) ``` ## Reproduce the head-swap finding Open the companion Colab: **`notebooks/InterpGPT_HeadSwap.ipynb`** at [github.com/cwklurks/interpgpt](https://github.com/cwklurks/interpgpt). End-to-end run on Colab free tier reproduces the 0.997 vs 0.66 comparison in under 15 minutes. ## Training data Custom task-decomposition corpus, two variants (standard vs ADHD) generated with the same task pool. Detailed dataset notes + generation scripts live in the main repo (`preprocess.py`, `merge_data.py`, `rebuild_data.py`, `fix_adhd_data.py`, `shorten_adhd_steps.py`). ## License MIT. ## Intended use Interpretability research. The model is intentionally small and domain-specific; **not** intended as a general-purpose chatbot. ## Citation ```bibtex @misc{interpgpt2026, title = {{InterpGPT}: A matched-pair interpretability study of task-decomposition models}, author = {Klann, Connor}, year = {2026}, url = {https://github.com/cwklurks/interpgpt} } ```