elliottower2's picture
Upload folder using huggingface_hub
9f3fe44 verified
---
license: mit
library_name: transformers
tags:
- gpt2
- onnx
- mechanistic-interpretability
- circuit-ablation
- rti-circuit
base_model: openai-community/gpt2
---
# GPT-2 with RTI Circuit Mean-Ablated
GPT-2 (124M) with the 15-head **Repeated Token Identification (RTI) circuit** removed via mean ablation.
## What was ablated
The RTI circuit consists of 15 attention heads across 4 functional tiers:
| Tier | Heads | Function |
|------|-------|----------|
| Backbone | 0.8, 0.9, 0.11 | Broad token matching via positional/frequency features |
| Detector | 4.11 | Repeated-token detection gate |
| Copier | 4.0, 5.6, 5.7, 7.0, 8.4, 8.7, 9.3, 9.10 | Copy repeated token identity to output |
| Readout | 10.11, 11.9, 11.11 | Route copied information to final logits |
## Ablation method
**Mean ablation**: For each circuit head:
1. The mean head output was computed across a dataset of 20 diverse text examples
2. The corresponding columns of `c_proj.weight` (W_O) were zeroed
3. The mean contribution (`W_O @ mean_head_output`) was added to `c_proj.bias`
This replaces each head's input-dependent computation with its average output, preserving the head's unconditional contribution while removing its ability to respond to specific inputs.
## Effect
The ablated model loses the ability to predict repeated tokens:
- **Normal GPT-2**: "The cat sat on the mat. The cat" → " was a little bit older than me, but I"
- **Mean-ablated**: "The cat sat on the mat. The cat" → " sat on the mat.\n\nThe cat sat"
## Usage with Transformers.js
```javascript
import { AutoModelForCausalLM, AutoTokenizer } from '@huggingface/transformers';
const model = await AutoModelForCausalLM.from_pretrained('elliottower2/gpt2-rti-mean-ablated', {
dtype: 'fp32',
});
const tokenizer = await AutoTokenizer.from_pretrained('elliottower2/gpt2-rti-mean-ablated');
```
## Citation
Part of the factorization-circuits project studying weight-space circuit discovery in transformers.