Deception Behavioral SAEs β€” SAELens/Neuronpedia Format

366 Sparse Autoencoders trained on behavioral deception activations across 9 language models (5 architecture families), formatted for SAELens and Neuronpedia compatibility.

Original flat-file checkpoints (with full training metadata) are in: Solshine/nanochat-d32-deception-saes-batch

Research Context

These SAEs are trained on same-prompt behavioral sampling data: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling. The SAEs decompose residual stream activations during deceptive vs. honest response generation β€” enabling interpretability analysis of deception-relevant features.

Paper: "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"
arXiv:2509.20393
Follow-up repo: SolshineCode/deception-nanochat-sae-research
Author: Caleb DeLeeuw (2026)

Key Findings (Cross-Model, 366 SAEs, 9 Models, 5 Architecture Families)

Linear probes on raw activations:

Model Params Peak Layer (depth) Bal. Accuracy AUROC
nanochat-d32 1.88B L12 (37%) 86.9% 0.923
Qwen3-1.7B 1.7B L17 (63%) 80.9% 0.893
Phi-4-mini-reasoning 3.8B L20 (64%) 80.8% 0.860
Phi-2 2.7B L21 (75%) ~75% β€”
TinyLlama-1.1B 1.1B L21 (95%) 73.2% 0.784
Llama 3.2-1B 1.0B L9 (56%) 72.5% β€”
nanochat-d20 1.88B L14 (70%) ~67% β€”
SmolLM2-135M 135M L4 (80%) ~69% β€”
Pythia-160M 160M L0 (0%) 66.0% 0.696

All results p < 0.001, PCA-robust.

SAE decomposition β€” model-size-dependent:

  • Models ≀ 1.3B: SAEs help detection (8–47% of SAEs beat raw probe accuracy)
  • Models β‰₯ 1.7B: SAEs hurt detection (0–4% beat raw)
  • Transition: between TinyLlama-1.1B (47% help) and Qwen3-1.7B (<4% help)
  • Best SAE config (small models): JumpReLU + honest_only training condition
  • Phi-2 anomaly: 33% of SAEs help at 2.7B (parallel attention architecture); does NOT extend to Phi-4-mini (3.8B, 2%)
  • Feature steering: Null results at all tested layers/models β€” deception is distributed, not localizable to individual features

Models Covered

Model Params Architecture Layers in SAEs SAE Count SAE Arches
nanochat-d32 1.88B GPT-NeoX L4, 8, 12, 16, 20, 24 57 TopK, JumpReLU, Gated
nanochat-d20 1.88B GPT-NeoX L2, 4, 8, 10, 14, 18 45 TopK, JumpReLU
Qwen3-1.7B 1.7B Qwen L12, 14, 15, 17, 18 45 TopK, JumpReLU, Gated
Phi-4-mini-reasoning 3.8B Phi L2, 6, 10, 14, 18, 22, 26 42 TopK, JumpReLU
SmolLM2-135M 135M Llama2 L3, 4, 5, 6, 9, 12, 15, 18, 21 54 TopK, JumpReLU
Phi-2 2.7B Phi (parallel) L4, 8, 12, 16, 20 30 TopK, JumpReLU
TinyLlama-1.1B 1.1B Llama2 L3, 6, 9, 12, 15 (+STE) 39 TopK, JumpReLU
Llama 3.2-1B 1.0B Llama L2, 4, 6 18 TopK, JumpReLU
Pythia-160M 160M GPT-NeoX L1, 2, 4, 6, 8, 10 36 TopK, JumpReLU
Total 366

Note on STE validation SAEs: nanochat-d20 and TinyLlama each include 9 additional "ste" tagged SAEs (e.g., d20_jumprelu_ste_L14_honest_only) trained with the corrected Gaussian-kernel STE to validate that the JumpReLU honest_only advantage is not a dimensionality artifact. 15/18 conditions (83%) confirm the advantage is real.

Training Details

Hardware: NVIDIA GeForce GTX 1650 Ti with Max-Q Design, 4 GB VRAM (Windows 11 Pro)
Training time: ~400–600 seconds per SAE (300 epochs, batch_size=128)
Framework: Custom PyTorch training loop with SAELens-compatible architecture
Activations: Residual stream (resid_post) collected at generation time
Expansion factor: 4Γ— (d_sae = 4 Γ— d_model)
Architectures: TopK (k=64), JumpReLU, Gated
Training conditions: mixed (all completions), honest_only, deceptive_only
Classification: Gemini 2.5 Flash (behavioral LLM classification, not regex)

SAE Format

Each SAE is in its own subfolder {sae_id}/ containing:

  • sae_weights.safetensors β€” weights (W_enc, b_enc, W_dec, b_dec, [threshold for JumpReLU])
  • cfg.json β€” SAELens-compatible config (architecture, hook_name, d_in, d_sae, training condition)

Known Limitations

JumpReLU threshold training (348 original SAEs):
The 348 original batch SAEs (non-STE) have threshold = 0 throughout β€” functionally equivalent to ReLU. The Heaviside step function has zero autograd gradient with respect to the threshold, so without a custom straight-through estimator (STE), the threshold never updates from its initialization of zero. These SAEs operate with ~50% feature density (L0 β‰ˆ d_sae/2) rather than the intended sparse regime. TopK SAEs (exact L0=64) are the properly sparse architecture in this collection.

STE fix (2026-04-11): The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The 18 _ste_ tagged SAEs in this repo use the corrected code. Targeted validation (18 STE SAEs across d20 and TinyLlama) confirmed that the honest_only advantage over TopK is not a dimensionality artifact β€” 15/18 conditions (83%) show STE JumpReLU > TopK even with threshold training.

The honest_only > TopK probe accuracy finding is valid regardless of the threshold bug. The threshold bug affects downstream Neuronpedia feature analysis (active feature density), not the probe accuracy comparisons.

Loading with SAELens

from safetensors.torch import load_file
import json

sae_id = "d32_topk_L12_honest_only"  # or any sae_id from the repo
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))
# W_enc shape: [d_in, d_sae], W_dec shape: [d_sae, d_in]
# cfg["training_condition"] records honest_only / deceptive_only / mixed

Citation

If you use these SAEs, please cite the original paper:

@article{thesecretagenda2025,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb},
  journal={arXiv:2509.20393},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Solshine/deception-behavioral-saes-saelens

Finetuned
(221)
this model

Dataset used to train Solshine/deception-behavioral-saes-saelens

Papers for Solshine/deception-behavioral-saes-saelens