---
license: apache-2.0
tags:
  - mechanistic-interpretability
  - sparse-autoencoder
  - crosscoder
  - model-diffing
base_model:
  - google/gemma-2-2b
  - google/gemma-2-2b-it
---

# Cross-Model Crosscoder — Gemma-2-2B base vs IT (papergrade)

BatchTopK crosscoder trained on layer 13 residual stream of
`google/gemma-2-2b` and `google/gemma-2-2b-it` simultaneously. The dictionary
(73,728 latents) decomposes both models' activations into shared,
base-specific, and chat-specific features.

## Recipe

- BatchTopK k = 100 (annealed from 1000)
- 100 M training tokens (FineWeb-Edu + LMSYS-chat-1M, 50/50)
- Per-model normalization, BOS dropped
- Adam lr 0.0001, decay last 20%, grad clip 1.0

## Validation

| | base (A) | chat (B) |
|---|---|---|
| variance explained | 0.8773 | 0.8666 |

L0 = 100.5,  dead-feature fraction = 42.89%

## Δ_norm taxonomy

{
  "shared": 39711,
  "dead": 31625,
  "unclassified": 2385,
  "base_only": 4,
  "chat_only": 3
}

## Causal validation (this artifact's contribution)

Beyond decoder-norm taxonomy, every "shared" feature was tested for causal-effect
equivalence: ablate in both models on matched probe inputs, measure Pearson
correlation of the two KL-shifts. See `causal_validation.csv` and
`cosine_vs_causal.png`. Median causal-equivalence over shared features is in
the figure; this is, to our knowledge, the first time this metric is reported
for a model-diffing crosscoder.

## Citation

- Lindsey et al. 2024 — Sparse Crosscoders for Cross-Layer Features and Model Diffing
- Anthropic Jan 2025 — Insights on Crosscoder Model Diffing
- Minder, Dumas, Juang, Chughtai, Nanda — NeurIPS 2025 (arxiv:2504.02922)
- Bhatt et al. — Cross-Architecture Model Diffing with Crosscoders (arxiv:2602.11729)

## Reproduce

Notebook: `OpenInterpretability/notebooks/17b_crosscoder_model_diff_papergrade.ipynb`