Pythia-160M Deception Behavioral SAEs

36 Sparse Autoencoders trained on residual stream activations from EleutherAI/pythia-160m (160M parameter GPT-NeoX architecture base model from the Pythia scaling suite), capturing behavioral deception signals via same-prompt temperature sampling.

Part of the cross-model deception SAE study: Solshine/deception-behavioral-saes-saelens (9 models, 348 total SAEs).

What's in This Repo

  • 36 SAEs across 6 layers (L1, L2, L4, L6, L8, L10)
  • 2 architectures: TopK (k=64), JumpReLU
  • 3 training conditions: mixed, deceptive_only, honest_only
  • Format: SAELens/Neuronpedia-compatible (safetensors + cfg.json)
  • Dimensions: d_in=768, d_sae=3072 (4x expansion)

Research Context

This is a follow-up to "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools" (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Pythia-160M is the only GPT-NeoX model in the study that is not a nanochat variant, providing an independent test of the GPT-NeoX architecture at small scale.

Code: SolshineCode/deception-nanochat-sae-research

Key Findings β€” Pythia-160M

Pythia-160M is the most anomalous model in terms of layer profile β€” the only one with a monotonically decreasing deception signal, and one of the strongest SAE-helps results.

Metric Value
Peak layer L0 (0% depth β€” embedding layer)
Peak balanced accuracy 66.0%
Peak AUROC 0.696
Best SAE probe accuracy 72.2% (pythia160m_jumprelu_L2_honest_only)
SAEs beating raw baseline 18/36 (50%) β€” SAEs help, tied-best rate in study

Monotonically decreasing signal β€” opposite of all other models: Every other model with above-chance probe accuracy shows a peak at some intermediate layer and declines toward the output. Pythia-160M is the sole model where the signal is strongest at the embedding layer (L0, 66.0%) and decreases monotonically through the network. By Layer 10, the raw probe is near chance. The hypothesis is that at 160M parameters, the model lacks the capacity to develop processing depth for deception, and what the probe detects at L0 is primarily token pattern recognition (which characters appear in social deduction prompts) rather than genuine behavioral encoding.

However, SAEs reverse the trend: Despite raw activations degrading through the network, pythia160m_jumprelu_L2_honest_only achieves 72.2% at L2 β€” +6.2pp above the raw peak at L0 (66.0%). This is one of the largest SAE-over-raw improvements in the study. The SAE decomposition appears to isolate a concentrated signal at L2 that is obscured in the full 768-dimensional raw representation.

Highest SAE-helps rate in the study (50%): 18 of 36 SAEs beat the raw baseline at their respective layers. Given that raw activations degrade monotonically, this means SAEs are recovering latent structure at layers where the raw probe is at or near chance.

JumpReLU + honest_only pattern replicates: As with nanochat-d20, TinyLlama, and Llama-3.2-1B, the best-performing SAE combines JumpReLU architecture with honest_only training condition. This pattern holds across 4 of the 5 SAE-helps models in the study, suggesting honest-only training data selects for features that are genuinely diagnostic of deception absence (rather than features that are merely correlated with deceptive text).

GPT-NeoX architecture at 160M: Pythia-160M uses EleutherAI's Pythia training framework with the standard GPT-NeoX architecture β€” parallel attention and MLP blocks, rotary position embeddings β€” but at only 160M parameters, 12 transformer layers, and a 768-dimensional residual stream. Unlike nanochat-d20/d32 (GPT-NeoX with 1.88B parameters), Pythia-160M cannot rely on depth to accumulate behavioral representation. The 12-layer budget may force surface encoding to dominate.

AUROC = 0.696 at L0: Lower than other models but substantially above 0.5. Combined with the best SAE at 72.2%, this confirms that a deception signal β€” however shallow in origin β€” is present even at 160M parameters in a GPT-NeoX model.

SAE Format

Each SAE lives in a subfolder named {sae_id}/ containing:

  • sae_weights.safetensors β€” encoder/decoder weights
  • cfg.json β€” SAELens-compatible config

hook_name format: gpt_neox.layers.{layer}.hook_resid_post

(Note: Pythia uses gpt_neox.layers rather than model.layers as the module path.)

Training Details

Parameter Value
Hardware NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro
Training time ~400–600 seconds per SAE
Epochs 300
Batch size 128
Expansion factor 4x (768 β†’ 3072)
Activations resid_post collected during autoregressive generation
Training conditions mixed (n=211), deceptive_only (n=120), honest_only (n=91)
LLM classifier Gemini 2.5 Flash
Ambiguous rate 59/270 (21.9%) β€” highest in study

Known Limitations

JumpReLU threshold not learned (36 SAEs): All SAEs have threshold = 0 β€” functionally ReLU. L0 β‰ˆ 50% of d_sae. TopK SAEs are unaffected.

STE fix (2026-04-11): The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage is confirmed as not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama confirm). STE-specific validation SAEs for Pythia-160M were not run; the pattern from other models is assumed to generalize.

L0 peak may reflect token pattern recognition: The best raw layer is the embedding layer. It is possible that what the probe detects at L0 is surface-level token distribution differences between deceptive and honest social deduction scenarios, rather than deeper behavioral encoding. The SAE result at L2 slightly mitigates this concern by recovering signal at a processed layer.

High ambiguity rate: 21.9% of Pythia-160M completions were classified as ambiguous by Gemini β€” the highest ambiguity rate in the study. The model's 160M parameters may produce more incoherent outputs that are hard to classify.

Loading Example

from safetensors.torch import load_file
import json

sae_id = "pythia160m_jumprelu_L2_honest_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))

# W_enc: [768, 3072], W_dec: [3072, 768]
# cfg["hook_name"] == "gpt_neox.layers.2.hook_resid_post"
print(f"d_in={cfg['d_in']}, d_sae={cfg['d_sae']}")
print(f"Training condition: {cfg['training_condition']}")

Usage

1. Load an SAE from this repo

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json

repo_id = "Solshine/deception-saes-pythia-160m"
sae_id  = "pythia160m_topk_L4_honest_only"   # replace with any tag in this repo

weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path     = hf_hub_download(repo_id, f"{sae_id}/cfg.json")

with open(cfg_path) as f:
    cfg = json.load(f)

# Option A β€” load with SAELens (β‰₯3.0 required for jumprelu/topk; β‰₯3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))

# Option B β€” load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [768, 3072], b_enc [3072],
#       W_dec [3072, 768], b_dec [768], threshold [3072]

2. Hook into the model and collect residual-stream activations

These SAEs were trained on the residual stream after each transformer layer. The hook_name field in cfg.json gives the exact HuggingFace transformers submodule path to hook. Pythia uses GPT-NeoX architecture. Hook path: gpt_neox.layers.{layer} (not model.layers.{layer}).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model     = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-160m")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")

# Read hook_name from the cfg you already loaded:
#   cfg["hook_name"] == "gpt_neox.layers.4"  (example β€” varies by SAE)
hook_name = cfg["hook_name"]   # e.g. "gpt_neox.layers.4"

# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)

activations = {}
def hook_fn(module, input, output):
    # Most transformer layers return (hidden_states, ...) as a tuple
    h = output[0] if isinstance(output, tuple) else output
    activations["resid"] = h.detach()

handle = submodule.register_forward_hook(hook_fn)

inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
    model(**inputs)
handle.remove()

# activations["resid"]: [batch, seq_len, 768]
resid = activations["resid"][:, -1, :]  # last token position

3. Read feature activations

with torch.no_grad():
    feature_acts = sae.encode(resid)  # [batch, 3072] β€” sparse

# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features    = feature_acts[0].topk(10)

print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:",  top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())

# Reconstruct (for sanity check β€” should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()

Caveats and known limitations

Hook names are HuggingFace transformers-style, not TransformerLens-style. The hook_name in cfg.json (e.g. "gpt_neox.layers.4") is a submodule path in the standard HuggingFace model. SAELens' built-in activation-collection pipeline expects TransformerLens hook names (e.g. blocks.14.hook_resid_post). This means SAE.from_pretrained() with automatic model running will not work β€” use the manual forward-hook pattern above instead.

SAELens version requirements.

  • topk architecture: SAELens β‰₯ 3.0
  • jumprelu architecture: SAELens β‰₯ 3.0
  • gated architecture: SAELens β‰₯ 3.5 (or load manually with state_dict)

These SAEs detect deceptive behavior, not deceptive prompts. They were trained on response-level activations where the same prompt produced both deceptive and honest outputs. Feature activation differences reflect behavioral divergence, not prompt content. See the paper for experimental design details.

Citation

@article{thesecretagenda2025,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb},
  journal={arXiv:2509.20393},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Solshine/deception-saes-pythia-160m

Finetuned
(221)
this model

Dataset used to train Solshine/deception-saes-pythia-160m

Papers for Solshine/deception-saes-pythia-160m