Deception Behavioral SAEs β SAELens/Neuronpedia Format
366 Sparse Autoencoders trained on behavioral deception activations across 9 language models (5 architecture families), formatted for SAELens and Neuronpedia compatibility.
Original flat-file checkpoints (with full training metadata) are in: Solshine/nanochat-d32-deception-saes-batch
Research Context
These SAEs are trained on same-prompt behavioral sampling data: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling. The SAEs decompose residual stream activations during deceptive vs. honest response generation β enabling interpretability analysis of deception-relevant features.
Paper: "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"
arXiv:2509.20393
Follow-up repo: SolshineCode/deception-nanochat-sae-research
Author: Caleb DeLeeuw (2026)
Key Findings (Cross-Model, 366 SAEs, 9 Models, 5 Architecture Families)
Linear probes on raw activations:
| Model | Params | Peak Layer (depth) | Bal. Accuracy | AUROC |
|---|---|---|---|---|
| nanochat-d32 | 1.88B | L12 (37%) | 86.9% | 0.923 |
| Qwen3-1.7B | 1.7B | L17 (63%) | 80.9% | 0.893 |
| Phi-4-mini-reasoning | 3.8B | L20 (64%) | 80.8% | 0.860 |
| Phi-2 | 2.7B | L21 (75%) | ~75% | β |
| TinyLlama-1.1B | 1.1B | L21 (95%) | 73.2% | 0.784 |
| Llama 3.2-1B | 1.0B | L9 (56%) | 72.5% | β |
| nanochat-d20 | 1.88B | L14 (70%) | ~67% | β |
| SmolLM2-135M | 135M | L4 (80%) | ~69% | β |
| Pythia-160M | 160M | L0 (0%) | 66.0% | 0.696 |
All results p < 0.001, PCA-robust.
SAE decomposition β model-size-dependent:
- Models β€ 1.3B: SAEs help detection (8β47% of SAEs beat raw probe accuracy)
- Models β₯ 1.7B: SAEs hurt detection (0β4% beat raw)
- Transition: between TinyLlama-1.1B (47% help) and Qwen3-1.7B (<4% help)
- Best SAE config (small models): JumpReLU + honest_only training condition
- Phi-2 anomaly: 33% of SAEs help at 2.7B (parallel attention architecture); does NOT extend to Phi-4-mini (3.8B, 2%)
- Feature steering: Null results at all tested layers/models β deception is distributed, not localizable to individual features
Models Covered
| Model | Params | Architecture | Layers in SAEs | SAE Count | SAE Arches |
|---|---|---|---|---|---|
| nanochat-d32 | 1.88B | GPT-NeoX | L4, 8, 12, 16, 20, 24 | 57 | TopK, JumpReLU, Gated |
| nanochat-d20 | 1.88B | GPT-NeoX | L2, 4, 8, 10, 14, 18 | 45 | TopK, JumpReLU |
| Qwen3-1.7B | 1.7B | Qwen | L12, 14, 15, 17, 18 | 45 | TopK, JumpReLU, Gated |
| Phi-4-mini-reasoning | 3.8B | Phi | L2, 6, 10, 14, 18, 22, 26 | 42 | TopK, JumpReLU |
| SmolLM2-135M | 135M | Llama2 | L3, 4, 5, 6, 9, 12, 15, 18, 21 | 54 | TopK, JumpReLU |
| Phi-2 | 2.7B | Phi (parallel) | L4, 8, 12, 16, 20 | 30 | TopK, JumpReLU |
| TinyLlama-1.1B | 1.1B | Llama2 | L3, 6, 9, 12, 15 (+STE) | 39 | TopK, JumpReLU |
| Llama 3.2-1B | 1.0B | Llama | L2, 4, 6 | 18 | TopK, JumpReLU |
| Pythia-160M | 160M | GPT-NeoX | L1, 2, 4, 6, 8, 10 | 36 | TopK, JumpReLU |
| Total | 366 |
Note on STE validation SAEs: nanochat-d20 and TinyLlama each include 9 additional
"ste" tagged SAEs (e.g., d20_jumprelu_ste_L14_honest_only) trained with the corrected
Gaussian-kernel STE to validate that the JumpReLU honest_only advantage is not a
dimensionality artifact. 15/18 conditions (83%) confirm the advantage is real.
Training Details
Hardware: NVIDIA GeForce GTX 1650 Ti with Max-Q Design, 4 GB VRAM (Windows 11 Pro)
Training time: ~400β600 seconds per SAE (300 epochs, batch_size=128)
Framework: Custom PyTorch training loop with SAELens-compatible architecture
Activations: Residual stream (resid_post) collected at generation time
Expansion factor: 4Γ (d_sae = 4 Γ d_model)
Architectures: TopK (k=64), JumpReLU, Gated
Training conditions: mixed (all completions), honest_only, deceptive_only
Classification: Gemini 2.5 Flash (behavioral LLM classification, not regex)
SAE Format
Each SAE is in its own subfolder {sae_id}/ containing:
sae_weights.safetensorsβ weights (W_enc, b_enc, W_dec, b_dec, [threshold for JumpReLU])cfg.jsonβ SAELens-compatible config (architecture, hook_name, d_in, d_sae, training condition)
Known Limitations
JumpReLU threshold training (348 original SAEs):
The 348 original batch SAEs (non-STE) have threshold = 0 throughout β functionally equivalent
to ReLU. The Heaviside step function has zero autograd gradient with respect to the threshold,
so without a custom straight-through estimator (STE), the threshold never updates from its
initialization of zero. These SAEs operate with ~50% feature density (L0 β d_sae/2) rather
than the intended sparse regime. TopK SAEs (exact L0=64) are the properly sparse architecture
in this collection.
STE fix (2026-04-11): The training code has been corrected with a Gaussian-kernel STE
(Rajamanoharan et al. 2024, arXiv:2407.14435). The 18 _ste_ tagged SAEs in this repo
use the corrected code. Targeted validation (18 STE SAEs across d20 and TinyLlama)
confirmed that the honest_only advantage over TopK is not a dimensionality artifact β
15/18 conditions (83%) show STE JumpReLU > TopK even with threshold training.
The honest_only > TopK probe accuracy finding is valid regardless of the threshold bug. The threshold bug affects downstream Neuronpedia feature analysis (active feature density), not the probe accuracy comparisons.
Loading with SAELens
from safetensors.torch import load_file
import json
sae_id = "d32_topk_L12_honest_only" # or any sae_id from the repo
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))
# W_enc shape: [d_in, d_sae], W_dec shape: [d_sae, d_in]
# cfg["training_condition"] records honest_only / deceptive_only / mixed
Citation
If you use these SAEs, please cite the original paper:
@article{thesecretagenda2025,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb},
journal={arXiv:2509.20393},
year={2025}
}
Model tree for Solshine/deception-behavioral-saes-saelens
Base model
EleutherAI/pythia-160m