--- license: apache-2.0 tags: - mechanistic-interpretability - sparse-autoencoder - crosscoder - model-diffing base_model: - google/gemma-2-2b - google/gemma-2-2b-it --- # Cross-Model Crosscoder — Gemma-2-2B base vs IT (papergrade) BatchTopK crosscoder trained on layer 13 residual stream of `google/gemma-2-2b` and `google/gemma-2-2b-it` simultaneously. The dictionary (73,728 latents) decomposes both models' activations into shared, base-specific, and chat-specific features. ## Recipe - BatchTopK k = 100 (annealed from 1000) - 100 M training tokens (FineWeb-Edu + LMSYS-chat-1M, 50/50) - Per-model normalization, BOS dropped - Adam lr 0.0001, decay last 20%, grad clip 1.0 ## Validation | | base (A) | chat (B) | |---|---|---| | variance explained | 0.8773 | 0.8666 | L0 = 100.5, dead-feature fraction = 42.89% ## Δ_norm taxonomy { "shared": 39711, "dead": 31625, "unclassified": 2385, "base_only": 4, "chat_only": 3 } ## Causal validation (this artifact's contribution) Beyond decoder-norm taxonomy, every "shared" feature was tested for causal-effect equivalence: ablate in both models on matched probe inputs, measure Pearson correlation of the two KL-shifts. See `causal_validation.csv` and `cosine_vs_causal.png`. Median causal-equivalence over shared features is in the figure; this is, to our knowledge, the first time this metric is reported for a model-diffing crosscoder. ## Citation - Lindsey et al. 2024 — Sparse Crosscoders for Cross-Layer Features and Model Diffing - Anthropic Jan 2025 — Insights on Crosscoder Model Diffing - Minder, Dumas, Juang, Chughtai, Nanda — NeurIPS 2025 (arxiv:2504.02922) - Bhatt et al. — Cross-Architecture Model Diffing with Crosscoders (arxiv:2602.11729) ## Reproduce Notebook: `OpenInterpretability/notebooks/17b_crosscoder_model_diff_papergrade.ipynb`