caiovicentino1's picture
Upload folder using huggingface_hub
144fd82 verified
---
license: apache-2.0
tags:
- mechanistic-interpretability
- sparse-autoencoder
- crosscoder
- model-diffing
base_model:
- google/gemma-2-2b
- google/gemma-2-2b-it
---
# Cross-Model Crosscoder β€” Gemma-2-2B base vs IT (papergrade)
BatchTopK crosscoder trained on layer 13 residual stream of
`google/gemma-2-2b` and `google/gemma-2-2b-it` simultaneously. The dictionary
(73,728 latents) decomposes both models' activations into shared,
base-specific, and chat-specific features.
## Recipe
- BatchTopK k = 100 (annealed from 1000)
- 100 M training tokens (FineWeb-Edu + LMSYS-chat-1M, 50/50)
- Per-model normalization, BOS dropped
- Adam lr 0.0001, decay last 20%, grad clip 1.0
## Validation
| | base (A) | chat (B) |
|---|---|---|
| variance explained | 0.8773 | 0.8666 |
L0 = 100.5, dead-feature fraction = 42.89%
## Ξ”_norm taxonomy
{
"shared": 39711,
"dead": 31625,
"unclassified": 2385,
"base_only": 4,
"chat_only": 3
}
## Causal validation (this artifact's contribution)
Beyond decoder-norm taxonomy, every "shared" feature was tested for causal-effect
equivalence: ablate in both models on matched probe inputs, measure Pearson
correlation of the two KL-shifts. See `causal_validation.csv` and
`cosine_vs_causal.png`. Median causal-equivalence over shared features is in
the figure; this is, to our knowledge, the first time this metric is reported
for a model-diffing crosscoder.
## Citation
- Lindsey et al. 2024 β€” Sparse Crosscoders for Cross-Layer Features and Model Diffing
- Anthropic Jan 2025 β€” Insights on Crosscoder Model Diffing
- Minder, Dumas, Juang, Chughtai, Nanda β€” NeurIPS 2025 (arxiv:2504.02922)
- Bhatt et al. β€” Cross-Architecture Model Diffing with Crosscoders (arxiv:2602.11729)
## Reproduce
Notebook: `OpenInterpretability/notebooks/17b_crosscoder_model_diff_papergrade.ipynb`