Cross-Model Crosscoder — Gemma-2-2B base vs IT (papergrade)

BatchTopK crosscoder trained on layer 13 residual stream of google/gemma-2-2b and google/gemma-2-2b-it simultaneously. The dictionary (73,728 latents) decomposes both models' activations into shared, base-specific, and chat-specific features.

Recipe

BatchTopK k = 100 (annealed from 1000)
100 M training tokens (FineWeb-Edu + LMSYS-chat-1M, 50/50)
Per-model normalization, BOS dropped
Adam lr 0.0001, decay last 20%, grad clip 1.0

Validation

	base (A)	chat (B)
variance explained	0.8773	0.8666

L0 = 100.5, dead-feature fraction = 42.89%

Δ_norm taxonomy

{ "shared": 39711, "dead": 31625, "unclassified": 2385, "base_only": 4, "chat_only": 3 }

Causal validation (this artifact's contribution)

Beyond decoder-norm taxonomy, every "shared" feature was tested for causal-effect equivalence: ablate in both models on matched probe inputs, measure Pearson correlation of the two KL-shifts. See causal_validation.csv and cosine_vs_causal.png. Median causal-equivalence over shared features is in the figure; this is, to our knowledge, the first time this metric is reported for a model-diffing crosscoder.

Citation

Lindsey et al. 2024 — Sparse Crosscoders for Cross-Layer Features and Model Diffing
Anthropic Jan 2025 — Insights on Crosscoder Model Diffing
Minder, Dumas, Juang, Chughtai, Nanda — NeurIPS 2025 (arxiv:2504.02922)
Bhatt et al. — Cross-Architecture Model Diffing with Crosscoders (arxiv:2602.11729)

Reproduce

Notebook: OpenInterpretability/notebooks/17b_crosscoder_model_diff_papergrade.ipynb

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/gemma2-2b-crosscoder-model-diff-papergrade

Base model

google/gemma-2-2b

Finetuned

(490)

this model

Papers for caiovicentino1/gemma2-2b-crosscoder-model-diff-papergrade

Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

Paper • 2602.11729 • Published Feb 12

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Paper • 2504.02922 • Published Apr 3, 2025