JackYoung27
/

writesae-ckpts

Feature Extraction

sparse-autoencoder

interpretability

mechanistic-interpretability

linear-attention

state-space-model

Model card Files Files and versions

writesae-ckpts / README.md

JackYoung27's picture

Initial public release

f1850af 1 day ago

|

history blame contribute delete

3.49 kB

	---
	license: mit
	library_name: pytorch
	tags:
	- sparse-autoencoder
	- interpretability
	- mechanistic-interpretability
	- gated-deltanet
	- mamba
	- rwkv
	- linear-attention
	- state-space-model
	base_model:
	- Qwen/Qwen3.5-0.8B
	- Qwen/Qwen3.5-4B
	- Qwen/Qwen3.5-27B
	language:
	- en
	pipeline_tag: feature-extraction
	---

	# WriteSAE

	WriteSAE: Sparse Autoencoders for Recurrent State

	Jack Young

	[Paper](https://arxiv.org/abs/2605.12770) \| [Website](https://www.jackyoung.io/research/writesae) \| [Code](https://github.com/JackYoung27/writesae)

	WriteSAE factors each decoder atom as the rank-1 outer product vᵢwᵢᵀ, matching the native kₜvₜᵀ write that Gated DeltaNet, Mamba-2, and RWKV-7 install into a dₖ × dᵥ matrix cache. Residual SAEs cannot reach that write site; WriteSAE can. Atom substitution beats matched-Frobenius-norm ablation on 92.4% of n=4,851 firings at Qwen3.5-0.8B L9 H4, the closed form predicts measured logit shifts at R² = 0.98, and sustained three-position installs lift midrank target-in-continuation from 33.3% to 100% under greedy decoding. Cross-architecture: GDN rank-1 atoms transfer to Mamba-2-370M at 88.1% over 2,500 firings, with sharpness ordering GDN > RWKV-7 > Mamba-2.

	## Quick start

	```python
	from huggingface_hub import snapshot_download
	import torch

	ckpt_dir = snapshot_download(
	"JackYoung27/writesae-ckpts",
	allow_patterns=["writesae/qwen0p8b/L9_H4/*"],
	)

	ckpt = torch.load(
	f"{ckpt_dir}/writesae/qwen0p8b/L9_H4/best.pt",
	weights_only=False,
	map_location="cpu",
	)

	# Decoder atom 412 — the paper's ERASE example.
	v_412 = ckpt["sae"].decoder.v[412] # (d_k,)
	w_412 = ckpt["sae"].decoder.w[412] # (d_v,)
	atom = torch.outer(v_412, w_412) # (d_k, d_v)
	```

	Standalone runnable in [`LOAD_EXAMPLE.py`](LOAD_EXAMPLE.py).

	## Variants

	\| variant \| encoder \| decoder \| role \|
	\|---\|---\|---\|---\|
	\| WriteSAE \| bilinear vᵢᵀ S wᵢ \| rank-1 vᵢwᵢᵀ \| All headline numbers \|
	\| FlatSAE \| linear on vec(S) \| flat \| Architectural-prior comparison \|
	\| MatrixSAE \| linear on vec(S) \| full-rank \| Ablation \|
	\| BilinearSAE \| bilinear \| bilinear \| Ablation \|

	## Base models covered

	Qwen3.5-0.8B (primary), Qwen3.5-4B, Qwen3.5-27B, Mamba-2-370M, RWKV-7-1.5B, DeltaNet-1.3B, GLA-1.3B. See [`MODEL_CARD.md`](MODEL_CARD.md) for full layer / head coverage and training details.

	## Repository layout

	```text
	writesae-ckpts/
	README.md
	MODEL_CARD.md
	manifest.json
	LOAD_EXAMPLE.py
	LICENSE

	writesae/<base-model>/<layer>_<head>/best.pt # primary cells
	flat_baseline/<base-model>_<layer>_<head>/best.pt # FlatSAE controls
	results/<test-name>/ # JSON outputs per paper claim
	```

	## Limitations

	The closed-form factorization predicts well only on Gated DeltaNet (R² = 0.98 at L9 H4); applied to Mamba-2 or Qwen3.5-4B, it returns negative R². The substitution test itself transfers to Mamba-2 (88.1%); the analytical coefficient does not. Per-atom identity varies across SAE seeds; the class-level register / bundle partition reproduces at CV 4–12%.

	## Citation

	```bibtex
	@article{young2026writesae,
	title = {WriteSAE: Sparse Autoencoders for Recurrent State},
	author = {Young, Jack},
	year = {2026},
	journal= {arXiv preprint arXiv:2605.12770},
	url = {https://github.com/JackYoung27/writesae}
	}
	```

	MIT license. Base models retain their upstream licenses; no base-model weights are redistributed.