| --- |
| license: apache-2.0 |
| tags: |
| - mechanistic-interpretability |
| - sparse-autoencoder |
| - crosscoder |
| - model-diffing |
| base_model: |
| - google/gemma-2-2b |
| - google/gemma-2-2b-it |
| --- |
| |
| # Cross-Model Crosscoder β Gemma-2-2B base vs IT (papergrade) |
|
|
| BatchTopK crosscoder trained on layer 13 residual stream of |
| `google/gemma-2-2b` and `google/gemma-2-2b-it` simultaneously. The dictionary |
| (73,728 latents) decomposes both models' activations into shared, |
| base-specific, and chat-specific features. |
|
|
| ## Recipe |
|
|
| - BatchTopK k = 100 (annealed from 1000) |
| - 100 M training tokens (FineWeb-Edu + LMSYS-chat-1M, 50/50) |
| - Per-model normalization, BOS dropped |
| - Adam lr 0.0001, decay last 20%, grad clip 1.0 |
|
|
| ## Validation |
|
|
| | | base (A) | chat (B) | |
| |---|---|---| |
| | variance explained | 0.8773 | 0.8666 | |
|
|
| L0 = 100.5, dead-feature fraction = 42.89% |
|
|
| ## Ξ_norm taxonomy |
| |
| { |
| "shared": 39711, |
| "dead": 31625, |
| "unclassified": 2385, |
| "base_only": 4, |
| "chat_only": 3 |
| } |
| |
| ## Causal validation (this artifact's contribution) |
| |
| Beyond decoder-norm taxonomy, every "shared" feature was tested for causal-effect |
| equivalence: ablate in both models on matched probe inputs, measure Pearson |
| correlation of the two KL-shifts. See `causal_validation.csv` and |
| `cosine_vs_causal.png`. Median causal-equivalence over shared features is in |
| the figure; this is, to our knowledge, the first time this metric is reported |
| for a model-diffing crosscoder. |
|
|
| ## Citation |
|
|
| - Lindsey et al. 2024 β Sparse Crosscoders for Cross-Layer Features and Model Diffing |
| - Anthropic Jan 2025 β Insights on Crosscoder Model Diffing |
| - Minder, Dumas, Juang, Chughtai, Nanda β NeurIPS 2025 (arxiv:2504.02922) |
| - Bhatt et al. β Cross-Architecture Model Diffing with Crosscoders (arxiv:2602.11729) |
|
|
| ## Reproduce |
|
|
| Notebook: `OpenInterpretability/notebooks/17b_crosscoder_model_diff_papergrade.ipynb` |
|
|