File size: 4,411 Bytes

f1850af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3fcb6f7
f1850af
3fcb6f7
f1850af
3fcb6f7
 
 
f1850af
3fcb6f7
f1850af
3fcb6f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f1850af
 
 
 
 
3fcb6f7
 
 
f1850af
 
3fcb6f7
f1850af
3fcb6f7
 
 
 
 
 
 
f1850af
3fcb6f7
f1850af
3fcb6f7
 
 
 
 
f1850af
3fcb6f7
f1850af
3fcb6f7
f1850af
3fcb6f7
f1850af
3fcb6f7
f1850af
 
 
3fcb6f7
 
 
 
 
 
 
f1850af

---
license: mit
library_name: pytorch
tags:
  - sparse-autoencoder
  - interpretability
  - mechanistic-interpretability
  - gated-deltanet
  - mamba
  - rwkv
  - linear-attention
  - state-space-model
base_model:
  - Qwen/Qwen3.5-0.8B
  - Qwen/Qwen3.5-4B
  - Qwen/Qwen3.5-27B
language:
  - en
pipeline_tag: feature-extraction
---

# WriteSAE: Sparse Autoencoders for Recurrent State

A sparse autoencoder for the matrix updates that Gated DeltaNet, Mamba-2, and RWKV-7 write into their recurrent cache each token. WriteSAE atoms are rank-1 matrices with the same shape as the model's own write, so a single atom can replace one native write at one position. Companion checkpoints for the paper *WriteSAE: Sparse Autoencoders for Recurrent State* ([arXiv:2605.12770](https://arxiv.org/abs/2605.12770)).

- **Code:** [github.com/JackYoung27/writesae](https://github.com/JackYoung27/writesae)
- **Project page:** [jackyoung.io/research/writesae](https://www.jackyoung.io/research/writesae)
- **Author:** [Jack Young](https://www.jackyoung.io), Indiana University ([youngjh@iu.edu](mailto:youngjh@iu.edu), ORCID [0009-0004-6785-303X](https://orcid.org/0009-0004-6785-303X)).

## Headline result

At a single Gated DeltaNet layer-head on Qwen3.5-0.8B, the WriteSAE atom yields a closer final token distribution than deleting the write on **92.4%** of evaluated positions; averaged per atom, the rate is **89.8%**. A closed-form expression in the forget gate, read query, and output embedding predicts the per-firing logit change at **R²=0.98**. The same replacement test transfers to Mamba-2-370M at **88.1%**. In generation, writing the formula's chosen direction into three consecutive cache positions at 3× the norm of the model's write makes tokens initially ranked 100–1000 by the unmodified model appear in **100%** of continuations, up from 33.3%. To our knowledge this is the first cache-level steering intervention in a state-space or hybrid recurrent layer.

## Variants

| variant | encoder | decoder |
| --- | --- | --- |
| **WriteSAE** | $v_i^\top S w_i$ | $v_i w_i^\top$ (rank-1) |
| FlatSAE | linear on vec($S$) | flat |
| MatrixSAE | linear on vec($S$) | full-rank |
| BilinearSAE | $v_i^\top S w_i$ | bilinear |

WriteSAE is the primary artifact and supports all main-text results.

## Base models covered

- Qwen3.5-0.8B (primary)
- Qwen3.5-4B (scale replication)
- Qwen3.5-27B (scale replication)
- Cross-architecture: DeltaNet 1.3B, GLA 1.3B, Mamba-2 2.8B, RWKV-7

## Quick start

```python
from huggingface_hub import snapshot_download

ckpt_dir = snapshot_download("JackYoung27/writesae-ckpts", local_dir="ckpts")
# ckpts/manifest.json maps tags to SHA256 and metadata.
```

Load and run with the companion code:

```bash
git clone https://github.com/JackYoung27/writesae && cd matrix-sae
pip install -e .
python -m experiments.analysis.analyze \
  --sae_checkpoint ckpts/writesae/qwen3p5-0p8b/L9_H4/best.pt \
  --data_dir states --layer 9 --head 4 --output_dir out
```

## Training details

- Architecture: rank-1 decoder atoms $v_i w_i^\top$, bilinear encoder.
- Dictionary size: 16384 features (configurable).
- Sparsity: TopK activation; BatchTopK supported.
- Training data: OpenWebText (`Skylion007/openwebtext`, streaming), tokenized with the Qwen3.5 tokenizer.
- Training compute: ~180 H100-hours single-GPU total across variants (paper App. B.3).

## Intended use

Interpretability research on matrix-recurrent and linear-attention model internals: decomposing register/bundle structure, validating cross-architecture transfer, and testing causal substitution experiments at the cache write site.

## Out of scope

Production model editing, safety interventions without independent validation, or claims about individual atom identity. Atoms reproduce class-level structure; the basis is SAE-run specific (paper section 6).

## Limitations

- Single primary architecture (GatedDeltaNet); Mamba-2 and GLA are confirmed negative class.
- Small-model primary (0.8B); 4B and 27B replications supplement but do not replace the main evidence base.
- Mechanism claims are class-granular, not per-atom.

## License

MIT.

## Citation

```bibtex
@article{young2026writesae,
  title  = {WriteSAE: Sparse Autoencoders for Recurrent State},
  author = {Young, Jack},
  year   = {2026},
  journal= {arXiv preprint arXiv:2605.12770},
  url    = {https://github.com/JackYoung27/writesae}
}
```