File size: 1,853 Bytes
b2797fb 144fd82 b2797fb 144fd82 b2797fb 144fd82 b2797fb 144fd82 b2797fb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | ---
license: apache-2.0
tags:
- mechanistic-interpretability
- sparse-autoencoder
- crosscoder
- model-diffing
base_model:
- google/gemma-2-2b
- google/gemma-2-2b-it
---
# Cross-Model Crosscoder — Gemma-2-2B base vs IT (papergrade)
BatchTopK crosscoder trained on layer 13 residual stream of
`google/gemma-2-2b` and `google/gemma-2-2b-it` simultaneously. The dictionary
(73,728 latents) decomposes both models' activations into shared,
base-specific, and chat-specific features.
## Recipe
- BatchTopK k = 100 (annealed from 1000)
- 100 M training tokens (FineWeb-Edu + LMSYS-chat-1M, 50/50)
- Per-model normalization, BOS dropped
- Adam lr 0.0001, decay last 20%, grad clip 1.0
## Validation
| | base (A) | chat (B) |
|---|---|---|
| variance explained | 0.8773 | 0.8666 |
L0 = 100.5, dead-feature fraction = 42.89%
## Δ_norm taxonomy
{
"shared": 39711,
"dead": 31625,
"unclassified": 2385,
"base_only": 4,
"chat_only": 3
}
## Causal validation (this artifact's contribution)
Beyond decoder-norm taxonomy, every "shared" feature was tested for causal-effect
equivalence: ablate in both models on matched probe inputs, measure Pearson
correlation of the two KL-shifts. See `causal_validation.csv` and
`cosine_vs_causal.png`. Median causal-equivalence over shared features is in
the figure; this is, to our knowledge, the first time this metric is reported
for a model-diffing crosscoder.
## Citation
- Lindsey et al. 2024 — Sparse Crosscoders for Cross-Layer Features and Model Diffing
- Anthropic Jan 2025 — Insights on Crosscoder Model Diffing
- Minder, Dumas, Juang, Chughtai, Nanda — NeurIPS 2025 (arxiv:2504.02922)
- Bhatt et al. — Cross-Architecture Model Diffing with Crosscoders (arxiv:2602.11729)
## Reproduce
Notebook: `OpenInterpretability/notebooks/17b_crosscoder_model_diff_papergrade.ipynb`
|