Llama-3.2-1B Deception Behavioral SAEs

18 Sparse Autoencoders trained on residual stream activations from meta-llama/Llama-3.2-1B (1.0B parameter Llama-3 architecture base model), capturing behavioral deception signals via same-prompt temperature sampling.

Part of the cross-model deception SAE study: Solshine/deception-behavioral-saes-saelens (9 models, 348 total SAEs).

What's in This Repo

  • 18 SAEs across 3 layers (L2, L4, L6)
  • 2 architectures: TopK (k=64), JumpReLU
  • 3 training conditions: mixed, deceptive_only, honest_only
  • Format: SAELens/Neuronpedia-compatible (safetensors + cfg.json)
  • Dimensions: d_in=2048, d_sae=8192 (4x expansion)

Research Context

This is a follow-up to "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools" (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling. Llama-3.2-1B underwent a multi-round evaluation with both gpt-4o-mini and Gemini 2.5 Flash classification; final results use Gemini individual classification at T=1.0.

Code: SolshineCode/deception-nanochat-sae-research

Key Findings β€” Llama-3.2-1B

Llama-3.2-1B was the first cross-model replication in the study, and its successful replication β€” after a classification methodology upgrade β€” confirmed that deception signal is not specific to the GPT-NeoX nanochat family.

Metric Value
Peak layer L9 (56% depth) β€” via full layer sweep
Peak balanced accuracy (R3, T=1.0) 72.5%
Peak AUROC 0.784
Best SAE probe accuracy 65.1% (llama_jumprelu_L4_honest_only)
SAEs beating raw baseline 4/18 (22%) β€” SAEs help detection

Classification methodology is critical here: The initial Llama run with gpt-4o-mini batch classification labeled 100% of samples as "ambiguous" β€” yielding near-chance probe accuracy. Switching to Gemini individual classification (one completion at a time) unlocked a clean 72.5% balanced accuracy. This finding directly validates the Secret Agenda paper's Appendix G warning against batch or rule-based behavioral classification. Llama-3.2-1B generates more linguistically nuanced responses that confuse batch classifiers.

SAEs help detection (22% beat raw): Four SAEs beat their respective layer's raw baseline, placing Llama-3.2-1B in the SAE-helps camp alongside other sub-1.3B models. The best SAE (llama_jumprelu_L4_honest_only) achieves 65.1% β€” substantially above the L4 raw baseline.

Candidate deception feature with highest differential in the study: Feature attribution analysis on llama_jumprelu_L4_honest_only identified feature #7989 activating in 74.0% of deceptive vs 23.5% of honest completions β€” a +50.5pp differential, the highest found across all models in the study. This warrants further investigation via feature steering.

PCA robustness β€” signal in very few dimensions: The Llama signal saturates at only 5 PCA components (PCA-5: 82.7% β‰ˆ full-dim 78.1%), suggesting an even more concentrated encoding than d20 (which saturated at ~10 components). For comparison, d32 requires ~100 components. Smaller models appear to represent deception in a much lower-dimensional subspace.

Multi-round layer sweep: Across three independent data-collection rounds, the peak layer varied (L6 in R3 at 40% depth, L15 in R2 at 100% depth). The most methodologically careful run (R3, T=1.0, balanced labels, n=244) peaks at L6 (40% depth), consistent with other 1B+ models in the study peaking at mid-network.

Architecture note: Llama-3.2-1B uses Meta's Llama-3 architecture β€” grouped-query attention, SwiGLU MLP, RMSNorm, rotary position embeddings β€” with 16 transformer layers and a 2048-dimensional residual stream. It is a base (not instruction-tuned) model, so behavioral variation arises from temperature sampling.

SAE Format

Each SAE lives in a subfolder named {sae_id}/ containing:

  • sae_weights.safetensors β€” encoder/decoder weights
  • cfg.json β€” SAELens-compatible config

hook_name format: model.layers.{layer}.hook_resid_post

Training Details

Parameter Value
Hardware NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro
Training time ~400–600 seconds per SAE
Epochs 300
Batch size 128
Expansion factor 4x (2048 β†’ 8192)
Activations resid_post collected during autoregressive generation
Training conditions mixed, deceptive_only, honest_only
LLM classifier Gemini 2.5 Flash (individual classification)

Known Limitations

JumpReLU threshold not learned (18 SAEs): All SAEs have threshold = 0 β€” functionally ReLU. L0 β‰ˆ 50% of d_sae. TopK SAEs are unaffected.

STE fix (2026-04-11): The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage is confirmed as not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama confirm).

Small test set for significance: With n=244 total samples and small per-class counts in the test folds, permutation tests are unreliable at n=33 test samples. Best-SAE significance should be interpreted cautiously.

Only 3 layers covered: L2, L4, L6 represent only the first three trained layers. The full 16-layer sweep on raw activations shows the true peak is at L6 (R3) or later β€” the SAE-trained layers may not include the optimal layer.

Loading Example

from safetensors.torch import load_file
import json

sae_id = "llama_jumprelu_L4_honest_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))

# W_enc: [2048, 8192], W_dec: [8192, 2048]
# cfg["hook_name"] == "model.layers.4.hook_resid_post"
print(f"Training condition: {cfg['training_condition']}")

Usage

1. Load an SAE from this repo

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json

repo_id = "Solshine/deception-saes-llama-3-2-1b"
sae_id  = "llama_topk_L4_honest_only"   # replace with any tag in this repo

weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path     = hf_hub_download(repo_id, f"{sae_id}/cfg.json")

with open(cfg_path) as f:
    cfg = json.load(f)

# Option A β€” load with SAELens (β‰₯3.0 required for jumprelu/topk; β‰₯3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))

# Option B β€” load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [2048, 8192], b_enc [8192],
#       W_dec [8192, 2048], b_dec [2048], threshold [8192]

2. Hook into the model and collect residual-stream activations

These SAEs were trained on the residual stream after each transformer layer. The hook_name field in cfg.json gives the exact HuggingFace transformers submodule path to hook. Standard LLaMA-3 architecture. Hook path: model.layers.{layer}.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model     = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

# Read hook_name from the cfg you already loaded:
#   cfg["hook_name"] == "model.layers.4"  (example β€” varies by SAE)
hook_name = cfg["hook_name"]   # e.g. "model.layers.4"

# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)

activations = {}
def hook_fn(module, input, output):
    # Most transformer layers return (hidden_states, ...) as a tuple
    h = output[0] if isinstance(output, tuple) else output
    activations["resid"] = h.detach()

handle = submodule.register_forward_hook(hook_fn)

inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
    model(**inputs)
handle.remove()

# activations["resid"]: [batch, seq_len, 2048]
resid = activations["resid"][:, -1, :]  # last token position

3. Read feature activations

with torch.no_grad():
    feature_acts = sae.encode(resid)  # [batch, 8192] β€” sparse

# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features    = feature_acts[0].topk(10)

print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:",  top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())

# Reconstruct (for sanity check β€” should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()

Caveats and known limitations

Hook names are HuggingFace transformers-style, not TransformerLens-style. The hook_name in cfg.json (e.g. "model.layers.4") is a submodule path in the standard HuggingFace model. SAELens' built-in activation-collection pipeline expects TransformerLens hook names (e.g. blocks.14.hook_resid_post). This means SAE.from_pretrained() with automatic model running will not work β€” use the manual forward-hook pattern above instead.

SAELens version requirements.

  • topk architecture: SAELens β‰₯ 3.0
  • jumprelu architecture: SAELens β‰₯ 3.0
  • gated architecture: SAELens β‰₯ 3.5 (or load manually with state_dict)

These SAEs detect deceptive behavior, not deceptive prompts. They were trained on response-level activations where the same prompt produced both deceptive and honest outputs. Feature activation differences reflect behavioral divergence, not prompt content. See the paper for experimental design details.

Citation

@article{thesecretagenda2025,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb},
  journal={arXiv:2509.20393},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Solshine/deception-saes-llama-3-2-1b

Finetuned
(899)
this model

Dataset used to train Solshine/deception-saes-llama-3-2-1b

Papers for Solshine/deception-saes-llama-3-2-1b