| --- |
| license: mit |
| library_name: transformers |
| tags: |
| - gpt2 |
| - onnx |
| - mechanistic-interpretability |
| - circuit-ablation |
| - rti-circuit |
| base_model: openai-community/gpt2 |
| --- |
| |
| # GPT-2 with RTI Circuit Mean-Ablated |
|
|
| GPT-2 (124M) with the 15-head **Repeated Token Identification (RTI) circuit** removed via mean ablation. |
|
|
| ## What was ablated |
|
|
| The RTI circuit consists of 15 attention heads across 4 functional tiers: |
|
|
| | Tier | Heads | Function | |
| |------|-------|----------| |
| | Backbone | 0.8, 0.9, 0.11 | Broad token matching via positional/frequency features | |
| | Detector | 4.11 | Repeated-token detection gate | |
| | Copier | 4.0, 5.6, 5.7, 7.0, 8.4, 8.7, 9.3, 9.10 | Copy repeated token identity to output | |
| | Readout | 10.11, 11.9, 11.11 | Route copied information to final logits | |
|
|
| ## Ablation method |
|
|
| **Mean ablation**: For each circuit head: |
| 1. The mean head output was computed across a dataset of 20 diverse text examples |
| 2. The corresponding columns of `c_proj.weight` (W_O) were zeroed |
| 3. The mean contribution (`W_O @ mean_head_output`) was added to `c_proj.bias` |
| |
| This replaces each head's input-dependent computation with its average output, preserving the head's unconditional contribution while removing its ability to respond to specific inputs. |
| |
| ## Effect |
| |
| The ablated model loses the ability to predict repeated tokens: |
| - **Normal GPT-2**: "The cat sat on the mat. The cat" → " was a little bit older than me, but I" |
| - **Mean-ablated**: "The cat sat on the mat. The cat" → " sat on the mat.\n\nThe cat sat" |
| |
| ## Usage with Transformers.js |
| |
| ```javascript |
| import { AutoModelForCausalLM, AutoTokenizer } from '@huggingface/transformers'; |
| |
| const model = await AutoModelForCausalLM.from_pretrained('elliottower2/gpt2-rti-mean-ablated', { |
| dtype: 'fp32', |
| }); |
| const tokenizer = await AutoTokenizer.from_pretrained('elliottower2/gpt2-rti-mean-ablated'); |
| ``` |
| |
| ## Citation |
| |
| Part of the factorization-circuits project studying weight-space circuit discovery in transformers. |
| |