Cross-Model Crosscoder — Gemma-2-2B base vs IT (papergrade)
BatchTopK crosscoder trained on layer 13 residual stream of
google/gemma-2-2b and google/gemma-2-2b-it simultaneously. The dictionary
(73,728 latents) decomposes both models' activations into shared,
base-specific, and chat-specific features.
Recipe
- BatchTopK k = 100 (annealed from 1000)
- 100 M training tokens (FineWeb-Edu + LMSYS-chat-1M, 50/50)
- Per-model normalization, BOS dropped
- Adam lr 0.0001, decay last 20%, grad clip 1.0
Validation
| base (A) | chat (B) | |
|---|---|---|
| variance explained | 0.8773 | 0.8666 |
L0 = 100.5, dead-feature fraction = 42.89%
Δ_norm taxonomy
{ "shared": 39711, "dead": 31625, "unclassified": 2385, "base_only": 4, "chat_only": 3 }
Causal validation (this artifact's contribution)
Beyond decoder-norm taxonomy, every "shared" feature was tested for causal-effect
equivalence: ablate in both models on matched probe inputs, measure Pearson
correlation of the two KL-shifts. See causal_validation.csv and
cosine_vs_causal.png. Median causal-equivalence over shared features is in
the figure; this is, to our knowledge, the first time this metric is reported
for a model-diffing crosscoder.
Citation
- Lindsey et al. 2024 — Sparse Crosscoders for Cross-Layer Features and Model Diffing
- Anthropic Jan 2025 — Insights on Crosscoder Model Diffing
- Minder, Dumas, Juang, Chughtai, Nanda — NeurIPS 2025 (arxiv:2504.02922)
- Bhatt et al. — Cross-Architecture Model Diffing with Crosscoders (arxiv:2602.11729)
Reproduce
Notebook: OpenInterpretability/notebooks/17b_crosscoder_model_diff_papergrade.ipynb
Model tree for caiovicentino1/gemma2-2b-crosscoder-model-diff-papergrade
Base model
google/gemma-2-2b