Deception Behavioral SAEs — SAELens/Neuronpedia Format

366 Sparse Autoencoders trained on behavioral deception activations across 9 language models (5 architecture families), formatted for SAELens and Neuronpedia compatibility.

Original flat-file checkpoints (with full training metadata) are in: Solshine/nanochat-d32-deception-saes-batch

Research Context

These SAEs are trained on same-prompt behavioral sampling data: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling. The SAEs decompose residual stream activations during deceptive vs. honest response generation — enabling interpretability analysis of deception-relevant features.

Paper: "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"
arXiv:2509.20393
Follow-up repo: SolshineCode/deception-nanochat-sae-research
Author: Caleb DeLeeuw (2026)

Key Findings (Cross-Model, 366 SAEs, 9 Models, 5 Architecture Families)

Linear probes on raw activations:

Model	Params	Peak Layer (depth)	Bal. Accuracy	AUROC
nanochat-d32	1.88B	L12 (37%)	86.9%	0.923
Qwen3-1.7B	1.7B	L17 (63%)	80.9%	0.893
Phi-4-mini-reasoning	3.8B	L20 (64%)	80.8%	0.860
Phi-2	2.7B	L21 (75%)	~75%	—
TinyLlama-1.1B	1.1B	L21 (95%)	73.2%	0.784
Llama 3.2-1B	1.0B	L9 (56%)	72.5%	—
nanochat-d20	1.88B	L14 (70%)	~67%	—
SmolLM2-135M	135M	L4 (80%)	~69%	—
Pythia-160M	160M	L0 (0%)	66.0%	0.696

All results p < 0.001, PCA-robust.

SAE decomposition — model-size-dependent:

Models ≤ 1.3B: SAEs help detection (8–47% of SAEs beat raw probe accuracy)
Models ≥ 1.7B: SAEs hurt detection (0–4% beat raw)
Transition: between TinyLlama-1.1B (47% help) and Qwen3-1.7B (<4% help)
Best SAE config (small models): JumpReLU + honest_only training condition
Phi-2 anomaly: 33% of SAEs help at 2.7B (parallel attention architecture); does NOT extend to Phi-4-mini (3.8B, 2%)
Feature steering: Null results at all tested layers/models — deception is distributed, not localizable to individual features

Models Covered

Model	Params	Architecture	Layers in SAEs	SAE Count	SAE Arches
nanochat-d32	1.88B	GPT-NeoX	L4, 8, 12, 16, 20, 24	57	TopK, JumpReLU, Gated
nanochat-d20	1.88B	GPT-NeoX	L2, 4, 8, 10, 14, 18	45	TopK, JumpReLU
Qwen3-1.7B	1.7B	Qwen	L12, 14, 15, 17, 18	45	TopK, JumpReLU, Gated
Phi-4-mini-reasoning	3.8B	Phi	L2, 6, 10, 14, 18, 22, 26	42	TopK, JumpReLU
SmolLM2-135M	135M	Llama2	L3, 4, 5, 6, 9, 12, 15, 18, 21	54	TopK, JumpReLU
Phi-2	2.7B	Phi (parallel)	L4, 8, 12, 16, 20	30	TopK, JumpReLU
TinyLlama-1.1B	1.1B	Llama2	L3, 6, 9, 12, 15 (+STE)	39	TopK, JumpReLU
Llama 3.2-1B	1.0B	Llama	L2, 4, 6	18	TopK, JumpReLU
Pythia-160M	160M	GPT-NeoX	L1, 2, 4, 6, 8, 10	36	TopK, JumpReLU
Total				366

Note on STE validation SAEs: nanochat-d20 and TinyLlama each include 9 additional "ste" tagged SAEs (e.g., d20_jumprelu_ste_L14_honest_only) trained with the corrected Gaussian-kernel STE to validate that the JumpReLU honest_only advantage is not a dimensionality artifact. 15/18 conditions (83%) confirm the advantage is real.

Training Details

Hardware: NVIDIA GeForce GTX 1650 Ti with Max-Q Design, 4 GB VRAM (Windows 11 Pro)
Training time: ~400–600 seconds per SAE (300 epochs, batch_size=128)
Framework: Custom PyTorch training loop with SAELens-compatible architecture
Activations: Residual stream (resid_post) collected at generation time
Expansion factor: 4× (d_sae = 4 × d_model)
Architectures: TopK (k=64), JumpReLU, Gated
Training conditions: mixed (all completions), honest_only, deceptive_only
Classification: Gemini 2.5 Flash (behavioral LLM classification, not regex)

SAE Format

Each SAE is in its own subfolder {sae_id}/ containing:

sae_weights.safetensors — weights (W_enc, b_enc, W_dec, b_dec, [threshold for JumpReLU])
cfg.json — SAELens-compatible config (architecture, hook_name, d_in, d_sae, training condition)

Known Limitations

JumpReLU threshold training (348 original SAEs):
The 348 original batch SAEs (non-STE) have threshold = 0 throughout — functionally equivalent to ReLU. The Heaviside step function has zero autograd gradient with respect to the threshold, so without a custom straight-through estimator (STE), the threshold never updates from its initialization of zero. These SAEs operate with ~50% feature density (L0 ≈ d_sae/2) rather than the intended sparse regime. TopK SAEs (exact L0=64) are the properly sparse architecture in this collection.

STE fix (2026-04-11): The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The 18 _ste_ tagged SAEs in this repo use the corrected code. Targeted validation (18 STE SAEs across d20 and TinyLlama) confirmed that the honest_only advantage over TopK is not a dimensionality artifact — 15/18 conditions (83%) show STE JumpReLU > TopK even with threshold training.

The honest_only > TopK probe accuracy finding is valid regardless of the threshold bug. The threshold bug affects downstream Neuronpedia feature analysis (active feature density), not the probe accuracy comparisons.

Loading with SAELens

from safetensors.torch import load_file
import json

sae_id = "d32_topk_L12_honest_only"  # or any sae_id from the repo
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))
# W_enc shape: [d_in, d_sae], W_dec shape: [d_sae, d_in]
# cfg["training_condition"] records honest_only / deceptive_only / mixed

Citation

If you use these SAEs, please cite the original paper:

@article{thesecretagenda2025,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb},
  journal={arXiv:2509.20393},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Solshine/deception-behavioral-saes-saelens

Base model

EleutherAI/pythia-160m

Finetuned

(221)

this model

Dataset used to train Solshine/deception-behavioral-saes-saelens

Papers for Solshine/deception-behavioral-saes-saelens

The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind

Paper • 2509.20393 • Published Sep 23, 2025

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Paper • 2407.14435 • Published Jul 19, 2024 • 7