caiovicentino1
/

gemma2-2b-crosscoder-model-diff-papergrade

mechanistic-interpretability

sparse-autoencoder

Model card Files Files and versions

gemma2-2b-crosscoder-model-diff-papergrade / README.md

caiovicentino1's picture

Upload folder using huggingface_hub

144fd82 verified 10 days ago

|

history blame contribute delete

1.85 kB

	---
	license: apache-2.0
	tags:
	- mechanistic-interpretability
	- sparse-autoencoder
	- crosscoder
	- model-diffing
	base_model:
	- google/gemma-2-2b
	- google/gemma-2-2b-it
	---

	# Cross-Model Crosscoder — Gemma-2-2B base vs IT (papergrade)

	BatchTopK crosscoder trained on layer 13 residual stream of
	`google/gemma-2-2b` and `google/gemma-2-2b-it` simultaneously. The dictionary
	(73,728 latents) decomposes both models' activations into shared,
	base-specific, and chat-specific features.

	## Recipe

	- BatchTopK k = 100 (annealed from 1000)
	- 100 M training tokens (FineWeb-Edu + LMSYS-chat-1M, 50/50)
	- Per-model normalization, BOS dropped
	- Adam lr 0.0001, decay last 20%, grad clip 1.0

	## Validation

	\| \| base (A) \| chat (B) \|
	\|---\|---\|---\|
	\| variance explained \| 0.8773 \| 0.8666 \|

	L0 = 100.5, dead-feature fraction = 42.89%

	## Δ_norm taxonomy

	{
	"shared": 39711,
	"dead": 31625,
	"unclassified": 2385,
	"base_only": 4,
	"chat_only": 3
	}

	## Causal validation (this artifact's contribution)

	Beyond decoder-norm taxonomy, every "shared" feature was tested for causal-effect
	equivalence: ablate in both models on matched probe inputs, measure Pearson
	correlation of the two KL-shifts. See `causal_validation.csv` and
	`cosine_vs_causal.png`. Median causal-equivalence over shared features is in
	the figure; this is, to our knowledge, the first time this metric is reported
	for a model-diffing crosscoder.

	## Citation

	- Lindsey et al. 2024 — Sparse Crosscoders for Cross-Layer Features and Model Diffing
	- Anthropic Jan 2025 — Insights on Crosscoder Model Diffing
	- Minder, Dumas, Juang, Chughtai, Nanda — NeurIPS 2025 (arxiv:2504.02922)
	- Bhatt et al. — Cross-Architecture Model Diffing with Crosscoders (arxiv:2602.11729)

	## Reproduce

	Notebook: `OpenInterpretability/notebooks/17b_crosscoder_model_diff_papergrade.ipynb`