nanochat-d32 Deception SAE Batch (2026-04-08)
25 Sparse Autoencoders trained on nanochat-d32 activations from the deception-nanochat-sae-research project. Part of a systematic sweep across three SAE architectures, three layers, and three data conditions to characterize how SAE decomposition affects deception detection.
Note (2026-04-08): This batch is 25/27 complete. Two checkpoints (
d32_jumprelu_L12_deceptive_only,d32_jumprelu_L12_honest_only) were not trained due to a system crash interrupting the pipeline. They will be added in a follow-up upload.
Research Context
This is companion data for the paper:
"Behavioral Deception Detection in Language Model Activations via Same-Prompt Sampling" Caleb DeLeeuw (2026). GitHub
Core finding: Linear probes on raw activations detect deceptive vs. honest completions with 86.9% balanced accuracy on nanochat-d32 (Layer 12, AUROC=0.923). SAE decomposition hurts detection β all three architectures produce lower probe accuracy than raw activations (p<0.001, Bonferroni-corrected paired t-test), consistent with deception being encoded in distributed, cross-feature geometry rather than individual sparse features.
What "deception" means here
These SAEs are trained on completions from same-prompt behavioral sampling: a single ambiguous scenario prompt is given to nanochat-d32, temperature sampling produces both deceptive and honest completions, and activations are collected at the token level during the response. The model is NOT being strategically deceptive β it is a base model producing completions that were post-hoc classified by Gemini as containing deceptive vs. honest content.
This design rules out the confound where probes merely classify which prompt was given rather than measuring genuine behavioral encoding.
Training Data
| Condition | Samples | Description |
|---|---|---|
deceptive_only |
650 | Completions classified as deceptive by Gemini |
honest_only |
677 | Completions classified as honest by Gemini |
mixed |
1327 | All usable completions (both conditions) |
Source activations collected from nanochat-d32 (karpathy/nanochat-d32, 1.88B params,
d_model=2048, 32 layers) using same-prompt behavioral sampling with 16 scenario prompts
Γ 100 completions at temperature=1.0.
SAE Architecture Sweep
Training Hyperparameters (all checkpoints)
| Parameter | Value |
|---|---|
| d_in | 2048 (nanochat-d32 hidden dim) |
| d_sae | 8192 (4Γ expansion) |
| num_epochs | 300 |
| batch_size | 128 |
| learning_rate | 3e-4 |
| l1_coefficient | 1e-3 |
| device | CUDA (GTX 1650 Ti) |
Checkpoint Matrix
| Architecture | Layers Trained | Data Conditions |
|---|---|---|
| TopK (k=64) | L4, L8, L12 | mixed, deceptive_only, honest_only |
| Gated | L4, L8, L12 | mixed, deceptive_only, honest_only |
| JumpReLU | L4, L8 | mixed, deceptive_only, honest_only |
| JumpReLU | L12 | mixed only (incomplete β see note above) |
Layer selection rationale: L4 (13% depth, early), L8 (25% depth, mid-early), L12 (39% depth) β L12 is the confirmed peak for nanochat-d32 deception signal (86.9% balanced accuracy from raw activation probe).
Results Summary
Per-Checkpoint Metrics (from *_meta.json files)
| Checkpoint | EV | L0 | Alive Features | d_max | d_mean |
|---|---|---|---|---|---|
| d32_gated_L4_mixed | 99.75% | 4106 | 4474 | 0.396 | 0.056 |
| d32_gated_L4_deceptive_only | 99.72% | 4056 | 4425 | 0.429 | 0.052 |
| d32_gated_L4_honest_only | 99.88% | 4124 | 4550 | 0.372 | 0.040 |
| d32_gated_L8_mixed | 99.79% | 4130 | 4776 | 0.451 | 0.065 |
| d32_gated_L8_deceptive_only | 99.74% | 4120 | 4658 | 0.454 | 0.063 |
| d32_gated_L8_honest_only | 99.83% | 4067 | 4639 | 0.445 | 0.059 |
| d32_gated_L12_mixed | 99.81% | 4025 | 4909 | 0.579 | 0.074 |
| d32_gated_L12_deceptive_only | 99.78% | 4176 | 4985 | 0.506 | 0.074 |
| d32_gated_L12_honest_only | 99.44% | 4161 | 5026 | 0.474 | 0.073 |
| d32_jumprelu_L4_mixed | 99.91% | 1951 | 2101 | 0.402 | 0.040 |
| d32_jumprelu_L4_deceptive_only | 99.88% | 2405 | 2544 | 0.391 | 0.047 |
| d32_jumprelu_L4_honest_only | 99.89% | 2461 | 2663 | 0.375 | 0.046 |
| d32_jumprelu_L8_mixed | 99.87% | 2242 | 2559 | 0.506 | 0.060 |
| d32_jumprelu_L8_deceptive_only | 99.84% | 2520 | 2775 | 0.452 | 0.065 |
| d32_jumprelu_L8_honest_only | 99.85% | 2666 | 2986 | 0.412 | 0.063 |
| d32_jumprelu_L12_mixed | 99.81% | 2461 | 3054 | 0.507 | 0.049 |
| d32_topk_L4_mixed | 99.90% | 64 | 64 | 0.250 | 0.002 |
| d32_topk_L4_deceptive_only | 99.88% | 64 | 64 | 0.205 | 0.001 |
| d32_topk_L4_honest_only | 99.86% | 64 | 64 | 0.193 | 0.001 |
| d32_topk_L8_mixed | 99.80% | 64 | 64 | 0.348 | 0.002 |
| d32_topk_L8_deceptive_only | 99.76% | 64 | 64 | 0.280 | 0.002 |
| d32_topk_L8_honest_only | 99.73% | 64 | 64 | 0.343 | 0.002 |
| d32_topk_L12_mixed | 99.63% | 64 | 73 | 0.419 | 0.001 |
| d32_topk_L12_deceptive_only | 99.57% | 64 | 69 | 0.412 | 0.001 |
| d32_topk_L12_honest_only | 99.54% | 64 | 70 | 0.402 | 0.001 |
Column definitions:
- EV: Explained variance (reconstruction quality)
- L0: Average number of active features per token
- Alive features: Features that activated at least once during training
- d_max: Maximum Cohen's d between deceptive vs. honest feature activations (computed on held-out data from opposing condition)
- d_mean: Mean Cohen's d across all alive features
Key Observation
d_max peaks at Gated L12 mixed (0.579) and JumpReLU L12 mixed (0.507), consistent with L12 being the behavioral encoding peak for nanochat-d32. TopK SAEs show the lowest d_max values (0.19β0.42) and the most feature collapse (only 64β73 alive features despite 8192 neurons), suggesting they are too aggressive for this use case.
Relationship to Main Results (Raw Activation Probes)
These SAEs are trained on the same activation data used in the main deception detection experiments. For comparison:
| Probe Target | Layer 12 Balanced Accuracy |
|---|---|
| Raw activations (2048-dim) | 86.9% |
| Gated SAE features (8192-dim) | 83.4% |
| JumpReLU SAE features (8192-dim) | 82.7% |
| TopK SAE features (8192-dim) | 65.8% |
SAE decomposition consistently reduces detection accuracy. These checkpoints are published to support replication and further analysis of why SAEs hurt detection (distributed encoding hypothesis).
How to Load
import torch
from huggingface_hub import hf_hub_download
# Download a specific checkpoint
path = hf_hub_download(
repo_id="Solshine/nanochat-d32-deception-saes-batch",
filename="d32_gated_L12_mixed.pt",
)
sae = torch.load(path, map_location="cpu", weights_only=False)
# Or download the metadata
import json
meta_path = hf_hub_download(
repo_id="Solshine/nanochat-d32-deception-saes-batch",
filename="d32_gated_L12_mixed_meta.json",
)
with open(meta_path) as f:
meta = json.load(f)
print(meta)
SAE interface (from sae/models.py)
# The SAE objects are PyTorch nn.Module subclasses
# with a consistent interface:
# Get feature activations
features = sae.get_feature_activations(x) # x: (batch, d_in) -> (batch, d_sae)
# Full encode/decode
x_hat = sae(x) # reconstructed activations
z = sae.encode(x) # feature activations (pre-threshold for JumpReLU)
# Architecture-specific
# TopK: exactly k features active per token (k=64 here)
# Gated: soft gating with learnable magnitude/gate separation
# JumpReLU: hard threshold with learned per-feature bandwidth
To use these SAEs with the research codebase:
git clone https://github.com/SolshineCode/deception-nanochat-sae-research
cd deception-nanochat-sae-research
pip install -e .
# Place downloaded .pt files in experiments/scaling/results/batch_saes/
File Naming Convention
{model}_{architecture}_L{layer}_{data_condition}.pt
- model:
d32= nanochat-d32 (1.88B, 32-layer GPT-NeoX) - architecture:
gated,jumprelu,topk - layer:
L4,L8,L12(layer index, 0-based) - data_condition:
mixed(all 1327),deceptive_only(650),honest_only(677)
Each .pt file has a corresponding _meta.json with training hyperparameters,
convergence metrics, and feature discriminability statistics.
Citation
If you use these SAEs, please cite:
@misc{deleeuw2026deception,
title={Behavioral Deception Detection in Language Model Activations via Same-Prompt Sampling},
author={DeLeeuw, Caleb},
year={2026},
url={https://github.com/SolshineCode/deception-nanochat-sae-research},
note={Preprint}
}
Related Resources
- GitHub repo: https://github.com/SolshineCode/deception-nanochat-sae-research
- Dataset + main nanochat SAE:
Solshine/deception-behavioral-nanochat-d32 - Original published SAE (L16 TopK):
Solshine/nanochat-d32-sae-layer16-topk32 - Base model:
karpathy/nanochat-d32 - Companion Qwen3 SAEs:
results/qwen3_saes/(4 JumpReLU checkpoints) β upload pending
Model tree for Solshine/nanochat-d32-deception-saes-batch
Base model
karpathy/nanochat-d32