JackYoung27
/

writesae-ckpts

@@ -19,72 +19,80 @@ language:
 pipeline_tag: feature-extraction
 ---
-# WriteSAE
-**WriteSAE: Sparse Autoencoders for Recurrent State**
-Jack Young
-[Paper](https://arxiv.org/abs/2605.12770) | [Website](https://www.jackyoung.io/research/writesae) | [Code](https://github.com/JackYoung27/writesae)
-WriteSAE factors each decoder atom as the rank-1 outer product **vᵢwᵢᵀ**, matching the native **kₜvₜᵀ** write that Gated DeltaNet, Mamba-2, and RWKV-7 install into a **dₖ × dᵥ** matrix cache. Residual SAEs cannot reach that write site; WriteSAE can. Atom substitution beats matched-Frobenius-norm ablation on **92.4%** of *n*=4,851 firings at Qwen3.5-0.8B L9 H4, the closed form predicts measured logit shifts at **R² = 0.98**, and sustained three-position installs lift midrank target-in-continuation from 33.3% to **100%** under greedy decoding. Cross-architecture: GDN rank-1 atoms transfer to Mamba-2-370M at 88.1% over 2,500 firings, with sharpness ordering GDN > RWKV-7 > Mamba-2.
 ## Quick start
 ```python
 from huggingface_hub import snapshot_download
-import torch
-ckpt_dir = snapshot_download(
-    "JackYoung27/writesae-ckpts",
-    allow_patterns=["writesae/qwen0p8b/L9_H4/*"],
-)
-ckpt = torch.load(
-    f"{ckpt_dir}/writesae/qwen0p8b/L9_H4/best.pt",
-    weights_only=False,
-    map_location="cpu",
-)
-# Decoder atom 412 — the paper's ERASE example.
-v_412 = ckpt["sae"].decoder.v[412]   # (d_k,)
-w_412 = ckpt["sae"].decoder.w[412]   # (d_v,)
-atom = torch.outer(v_412, w_412)      # (d_k, d_v)
 ```
-Standalone runnable in [`LOAD_EXAMPLE.py`](LOAD_EXAMPLE.py).
-## Variants
-| variant | encoder | decoder | role |
-|---|---|---|---|
-| **WriteSAE** | bilinear vᵢᵀ S wᵢ | rank-1 vᵢwᵢᵀ | All headline numbers |
-| FlatSAE | linear on vec(S) | flat | Architectural-prior comparison |
-| MatrixSAE | linear on vec(S) | full-rank | Ablation |
-| BilinearSAE | bilinear | bilinear | Ablation |
-## Base models covered
-Qwen3.5-0.8B (primary), Qwen3.5-4B, Qwen3.5-27B, Mamba-2-370M, RWKV-7-1.5B, DeltaNet-1.3B, GLA-1.3B. See [`MODEL_CARD.md`](MODEL_CARD.md) for full layer / head coverage and training details.
-## Repository layout
-```text
-writesae-ckpts/
-  README.md
-  MODEL_CARD.md
-  manifest.json
-  LOAD_EXAMPLE.py
-  LICENSE
-  writesae/<base-model>/<layer>_<head>/best.pt        # primary cells
-  flat_baseline/<base-model>_<layer>_<head>/best.pt   # FlatSAE controls
-  results/<test-name>/                                # JSON outputs per paper claim
-```
 ## Limitations
-The closed-form factorization predicts well only on Gated DeltaNet (R² = 0.98 at L9 H4); applied to Mamba-2 or Qwen3.5-4B, it returns negative R². The substitution test itself transfers to Mamba-2 (88.1%); the analytical coefficient does not. Per-atom identity varies across SAE seeds; the class-level register / bundle partition reproduces at CV 4–12%.
 ## Citation
@@ -97,5 +105,3 @@ The closed-form factorization predicts well only on Gated DeltaNet (R² = 0.98 a
   url    = {https://github.com/JackYoung27/writesae}
 }
 ```
-MIT license. Base models retain their upstream licenses; no base-model weights are redistributed.

 pipeline_tag: feature-extraction
 ---
+# WriteSAE: Sparse Autoencoders for Recurrent State
+A sparse autoencoder for the matrix updates that Gated DeltaNet, Mamba-2, and RWKV-7 write into their recurrent cache each token. WriteSAE atoms are rank-1 matrices with the same shape as the model's own write, so a single atom can replace one native write at one position. Companion checkpoints for the paper *WriteSAE: Sparse Autoencoders for Recurrent State* ([arXiv:2605.12770](https://arxiv.org/abs/2605.12770)).
+- **Code:** [github.com/JackYoung27/writesae](https://github.com/JackYoung27/writesae)
+- **Project page:** [jackyoung.io/research/writesae](https://www.jackyoung.io/research/writesae)
+- **Author:** [Jack Young](https://www.jackyoung.io), Indiana University ([youngjh@iu.edu](mailto:youngjh@iu.edu), ORCID [0009-0004-6785-303X](https://orcid.org/0009-0004-6785-303X)).
+## Headline result
+At a single Gated DeltaNet layer-head on Qwen3.5-0.8B, the WriteSAE atom yields a closer final token distribution than deleting the write on **92.4%** of evaluated positions; averaged per atom, the rate is **89.8%**. A closed-form expression in the forget gate, read query, and output embedding predicts the per-firing logit change at **R²=0.98**. The same replacement test transfers to Mamba-2-370M at **88.1%**. In generation, writing the formula's chosen direction into three consecutive cache positions at 3× the norm of the model's write makes tokens initially ranked 100–1000 by the unmodified model appear in **100%** of continuations, up from 33.3%. To our knowledge this is the first cache-level steering intervention in a state-space or hybrid recurrent layer.
+## Variants
+| variant | encoder | decoder |
+| --- | --- | --- |
+| **WriteSAE** | $v_i^\top S w_i$ | $v_i w_i^\top$ (rank-1) |
+| FlatSAE | linear on vec($S$) | flat |
+| MatrixSAE | linear on vec($S$) | full-rank |
+| BilinearSAE | $v_i^\top S w_i$ | bilinear |
+WriteSAE is the primary artifact and supports all main-text results.
+## Base models covered
+- Qwen3.5-0.8B (primary)
+- Qwen3.5-4B (scale replication)
+- Qwen3.5-27B (scale replication)
+- Cross-architecture: DeltaNet 1.3B, GLA 1.3B, Mamba-2 2.8B, RWKV-7
 ## Quick start
 ```python
 from huggingface_hub import snapshot_download
+ckpt_dir = snapshot_download("JackYoung27/writesae-ckpts", local_dir="ckpts")
+# ckpts/manifest.json maps tags to SHA256 and metadata.
 ```
+Load and run with the companion code:
+```bash
+git clone https://github.com/JackYoung27/writesae && cd matrix-sae
+pip install -e .
+python -m experiments.analysis.analyze \
+  --sae_checkpoint ckpts/writesae/qwen3p5-0p8b/L9_H4/best.pt \
+  --data_dir states --layer 9 --head 4 --output_dir out
+```
+## Training details
+- Architecture: rank-1 decoder atoms $v_i w_i^\top$, bilinear encoder.
+- Dictionary size: 16384 features (configurable).
+- Sparsity: TopK activation; BatchTopK supported.
+- Training data: OpenWebText (`Skylion007/openwebtext`, streaming), tokenized with the Qwen3.5 tokenizer.
+- Training compute: ~180 H100-hours single-GPU total across variants (paper App. B.3).
+## Intended use
+Interpretability research on matrix-recurrent and linear-attention model internals: decomposing register/bundle structure, validating cross-architecture transfer, and testing causal substitution experiments at the cache write site.
+## Out of scope
+Production model editing, safety interventions without independent validation, or claims about individual atom identity. Atoms reproduce class-level structure; the basis is SAE-run specific (paper section 6).
 ## Limitations
+- Single primary architecture (GatedDeltaNet); Mamba-2 and GLA are confirmed negative class.
+- Small-model primary (0.8B); 4B and 27B replications supplement but do not replace the main evidence base.
+- Mechanism claims are class-granular, not per-atom.
+## License
+MIT.
 ## Citation
   url    = {https://github.com/JackYoung27/writesae}
 }
 ```