nanochat-d32 Deception SAE Batch (2026-04-08)

25 Sparse Autoencoders trained on nanochat-d32 activations from the deception-nanochat-sae-research project. Part of a systematic sweep across three SAE architectures, three layers, and three data conditions to characterize how SAE decomposition affects deception detection.

Note (2026-04-08): This batch is 25/27 complete. Two checkpoints (d32_jumprelu_L12_deceptive_only, d32_jumprelu_L12_honest_only) were not trained due to a system crash interrupting the pipeline. They will be added in a follow-up upload.

Research Context

This is companion data for the paper:

"Behavioral Deception Detection in Language Model Activations via Same-Prompt Sampling" Caleb DeLeeuw (2026). GitHub

Core finding: Linear probes on raw activations detect deceptive vs. honest completions with 86.9% balanced accuracy on nanochat-d32 (Layer 12, AUROC=0.923). SAE decomposition hurts detection β€” all three architectures produce lower probe accuracy than raw activations (p<0.001, Bonferroni-corrected paired t-test), consistent with deception being encoded in distributed, cross-feature geometry rather than individual sparse features.

What "deception" means here

These SAEs are trained on completions from same-prompt behavioral sampling: a single ambiguous scenario prompt is given to nanochat-d32, temperature sampling produces both deceptive and honest completions, and activations are collected at the token level during the response. The model is NOT being strategically deceptive β€” it is a base model producing completions that were post-hoc classified by Gemini as containing deceptive vs. honest content.

This design rules out the confound where probes merely classify which prompt was given rather than measuring genuine behavioral encoding.


Training Data

Condition Samples Description
deceptive_only 650 Completions classified as deceptive by Gemini
honest_only 677 Completions classified as honest by Gemini
mixed 1327 All usable completions (both conditions)

Source activations collected from nanochat-d32 (karpathy/nanochat-d32, 1.88B params, d_model=2048, 32 layers) using same-prompt behavioral sampling with 16 scenario prompts Γ— 100 completions at temperature=1.0.


SAE Architecture Sweep

Training Hyperparameters (all checkpoints)

Parameter Value
d_in 2048 (nanochat-d32 hidden dim)
d_sae 8192 (4Γ— expansion)
num_epochs 300
batch_size 128
learning_rate 3e-4
l1_coefficient 1e-3
device CUDA (GTX 1650 Ti)

Checkpoint Matrix

Architecture Layers Trained Data Conditions
TopK (k=64) L4, L8, L12 mixed, deceptive_only, honest_only
Gated L4, L8, L12 mixed, deceptive_only, honest_only
JumpReLU L4, L8 mixed, deceptive_only, honest_only
JumpReLU L12 mixed only (incomplete β€” see note above)

Layer selection rationale: L4 (13% depth, early), L8 (25% depth, mid-early), L12 (39% depth) β€” L12 is the confirmed peak for nanochat-d32 deception signal (86.9% balanced accuracy from raw activation probe).


Results Summary

Per-Checkpoint Metrics (from *_meta.json files)

Checkpoint EV L0 Alive Features d_max d_mean
d32_gated_L4_mixed 99.75% 4106 4474 0.396 0.056
d32_gated_L4_deceptive_only 99.72% 4056 4425 0.429 0.052
d32_gated_L4_honest_only 99.88% 4124 4550 0.372 0.040
d32_gated_L8_mixed 99.79% 4130 4776 0.451 0.065
d32_gated_L8_deceptive_only 99.74% 4120 4658 0.454 0.063
d32_gated_L8_honest_only 99.83% 4067 4639 0.445 0.059
d32_gated_L12_mixed 99.81% 4025 4909 0.579 0.074
d32_gated_L12_deceptive_only 99.78% 4176 4985 0.506 0.074
d32_gated_L12_honest_only 99.44% 4161 5026 0.474 0.073
d32_jumprelu_L4_mixed 99.91% 1951 2101 0.402 0.040
d32_jumprelu_L4_deceptive_only 99.88% 2405 2544 0.391 0.047
d32_jumprelu_L4_honest_only 99.89% 2461 2663 0.375 0.046
d32_jumprelu_L8_mixed 99.87% 2242 2559 0.506 0.060
d32_jumprelu_L8_deceptive_only 99.84% 2520 2775 0.452 0.065
d32_jumprelu_L8_honest_only 99.85% 2666 2986 0.412 0.063
d32_jumprelu_L12_mixed 99.81% 2461 3054 0.507 0.049
d32_topk_L4_mixed 99.90% 64 64 0.250 0.002
d32_topk_L4_deceptive_only 99.88% 64 64 0.205 0.001
d32_topk_L4_honest_only 99.86% 64 64 0.193 0.001
d32_topk_L8_mixed 99.80% 64 64 0.348 0.002
d32_topk_L8_deceptive_only 99.76% 64 64 0.280 0.002
d32_topk_L8_honest_only 99.73% 64 64 0.343 0.002
d32_topk_L12_mixed 99.63% 64 73 0.419 0.001
d32_topk_L12_deceptive_only 99.57% 64 69 0.412 0.001
d32_topk_L12_honest_only 99.54% 64 70 0.402 0.001

Column definitions:

  • EV: Explained variance (reconstruction quality)
  • L0: Average number of active features per token
  • Alive features: Features that activated at least once during training
  • d_max: Maximum Cohen's d between deceptive vs. honest feature activations (computed on held-out data from opposing condition)
  • d_mean: Mean Cohen's d across all alive features

Key Observation

d_max peaks at Gated L12 mixed (0.579) and JumpReLU L12 mixed (0.507), consistent with L12 being the behavioral encoding peak for nanochat-d32. TopK SAEs show the lowest d_max values (0.19–0.42) and the most feature collapse (only 64–73 alive features despite 8192 neurons), suggesting they are too aggressive for this use case.

Relationship to Main Results (Raw Activation Probes)

These SAEs are trained on the same activation data used in the main deception detection experiments. For comparison:

Probe Target Layer 12 Balanced Accuracy
Raw activations (2048-dim) 86.9%
Gated SAE features (8192-dim) 83.4%
JumpReLU SAE features (8192-dim) 82.7%
TopK SAE features (8192-dim) 65.8%

SAE decomposition consistently reduces detection accuracy. These checkpoints are published to support replication and further analysis of why SAEs hurt detection (distributed encoding hypothesis).


How to Load

import torch
from huggingface_hub import hf_hub_download

# Download a specific checkpoint
path = hf_hub_download(
    repo_id="Solshine/nanochat-d32-deception-saes-batch",
    filename="d32_gated_L12_mixed.pt",
)
sae = torch.load(path, map_location="cpu", weights_only=False)

# Or download the metadata
import json
meta_path = hf_hub_download(
    repo_id="Solshine/nanochat-d32-deception-saes-batch",
    filename="d32_gated_L12_mixed_meta.json",
)
with open(meta_path) as f:
    meta = json.load(f)
print(meta)

SAE interface (from sae/models.py)

# The SAE objects are PyTorch nn.Module subclasses
# with a consistent interface:

# Get feature activations
features = sae.get_feature_activations(x)  # x: (batch, d_in) -> (batch, d_sae)

# Full encode/decode
x_hat = sae(x)              # reconstructed activations
z = sae.encode(x)           # feature activations (pre-threshold for JumpReLU)

# Architecture-specific
# TopK: exactly k features active per token (k=64 here)
# Gated: soft gating with learnable magnitude/gate separation
# JumpReLU: hard threshold with learned per-feature bandwidth

To use these SAEs with the research codebase:

git clone https://github.com/SolshineCode/deception-nanochat-sae-research
cd deception-nanochat-sae-research
pip install -e .
# Place downloaded .pt files in experiments/scaling/results/batch_saes/

File Naming Convention

{model}_{architecture}_L{layer}_{data_condition}.pt

  • model: d32 = nanochat-d32 (1.88B, 32-layer GPT-NeoX)
  • architecture: gated, jumprelu, topk
  • layer: L4, L8, L12 (layer index, 0-based)
  • data_condition: mixed (all 1327), deceptive_only (650), honest_only (677)

Each .pt file has a corresponding _meta.json with training hyperparameters, convergence metrics, and feature discriminability statistics.


Citation

If you use these SAEs, please cite:

@misc{deleeuw2026deception,
  title={Behavioral Deception Detection in Language Model Activations via Same-Prompt Sampling},
  author={DeLeeuw, Caleb},
  year={2026},
  url={https://github.com/SolshineCode/deception-nanochat-sae-research},
  note={Preprint}
}

Related Resources

  • GitHub repo: https://github.com/SolshineCode/deception-nanochat-sae-research
  • Dataset + main nanochat SAE: Solshine/deception-behavioral-nanochat-d32
  • Original published SAE (L16 TopK): Solshine/nanochat-d32-sae-layer16-topk32
  • Base model: karpathy/nanochat-d32
  • Companion Qwen3 SAEs: results/qwen3_saes/ (4 JumpReLU checkpoints) β€” upload pending
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Solshine/nanochat-d32-deception-saes-batch

Finetuned
(2)
this model

Dataset used to train Solshine/nanochat-d32-deception-saes-batch