Cross-Model Crosscoder — Gemma-2-2B base vs IT (papergrade)

BatchTopK crosscoder trained on layer 13 residual stream of google/gemma-2-2b and google/gemma-2-2b-it simultaneously. The dictionary (73,728 latents) decomposes both models' activations into shared, base-specific, and chat-specific features.

Recipe

  • BatchTopK k = 100 (annealed from 1000)
  • 100 M training tokens (FineWeb-Edu + LMSYS-chat-1M, 50/50)
  • Per-model normalization, BOS dropped
  • Adam lr 0.0001, decay last 20%, grad clip 1.0

Validation

base (A) chat (B)
variance explained 0.8773 0.8666

L0 = 100.5, dead-feature fraction = 42.89%

Δ_norm taxonomy

{ "shared": 39711, "dead": 31625, "unclassified": 2385, "base_only": 4, "chat_only": 3 }

Causal validation (this artifact's contribution)

Beyond decoder-norm taxonomy, every "shared" feature was tested for causal-effect equivalence: ablate in both models on matched probe inputs, measure Pearson correlation of the two KL-shifts. See causal_validation.csv and cosine_vs_causal.png. Median causal-equivalence over shared features is in the figure; this is, to our knowledge, the first time this metric is reported for a model-diffing crosscoder.

Citation

  • Lindsey et al. 2024 — Sparse Crosscoders for Cross-Layer Features and Model Diffing
  • Anthropic Jan 2025 — Insights on Crosscoder Model Diffing
  • Minder, Dumas, Juang, Chughtai, Nanda — NeurIPS 2025 (arxiv:2504.02922)
  • Bhatt et al. — Cross-Architecture Model Diffing with Crosscoders (arxiv:2602.11729)

Reproduce

Notebook: OpenInterpretability/notebooks/17b_crosscoder_model_diff_papergrade.ipynb

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/gemma2-2b-crosscoder-model-diff-papergrade

Finetuned
(490)
this model

Papers for caiovicentino1/gemma2-2b-crosscoder-model-diff-papergrade