TinyLlama-1.1B Deception Behavioral SAEs
39 Sparse Autoencoders (30 original + 9 STE-validated) trained on residual stream activations from TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T (1.1B parameter Llama2-architecture base model), capturing behavioral deception signals via same-prompt temperature sampling.
Part of the cross-model deception SAE study: Solshine/deception-behavioral-saes-saelens (9 models, 348 total SAEs).
What's in This Repo
- 39 SAEs β 30 original + 9 STE-validated (
_ste_tag) - Layers: L3, L6, L9, L12, L15 (original); L9, L12, L15 (STE)
- 2 architectures: TopK (k=64), JumpReLU
- 3 training conditions:
mixed,deceptive_only,honest_only - Format: SAELens/Neuronpedia-compatible (safetensors + cfg.json)
- Dimensions: d_in=2048, d_sae=8192 (4x expansion)
Research Context
This is a follow-up to "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools" (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions at temperature=1.0 via temperature sampling, classified by Gemini 2.5 Flash.
Code: SolshineCode/deception-nanochat-sae-research
Key Findings β TinyLlama-1.1B
TinyLlama is the most distinctive model in the 9-model study in terms of layer profile β and the most STE-validated.
| Metric | Value |
|---|---|
| Peak layer | L21 (95% depth) |
| Peak balanced accuracy | 72.4% |
| Peak AUROC | 0.784 |
| Best SAE probe accuracy | 77.8% (tinyllama_jumprelu_L15_honest_only) |
| SAEs beating raw baseline | 14/30 (47%) β SAEs help, highest rate in study |
Monotonically increasing layer profile β unique across all 9 models: Every other model with above-chance deception signal shows a peak somewhere in the middle of the network and then declines. TinyLlama is the sole exception: the deception signal rises continuously from L3 (~55%) through L21 (72.4%), with no saturation or decline at the final layer. This suggests the model has not developed the dedicated mid-network semantic processing that larger models exhibit, and deception-relevant computation continues accumulating through the output layers.
Highest SAE-helps rate in the entire study (47%): 14 of 30 original SAEs beat their respective layer's raw baseline. This puts TinyLlama firmly in the SAE-helps regime alongside SmolLM2, nanochat-d20, Llama-3.2-1B, and Pythia-160M β all models at or below 1.3B parameters.
Best SAE substantially outperforms raw peak (+5.4pp): tinyllama_jumprelu_L15_honest_only achieves 77.8%, beating the L15 raw baseline of 72.4% by +5.37pp. Notably, this is a non-peak layer for raw activations (L21 peaks raw), but the JumpReLU honest_only SAE at L15 nearly matches the L21 raw result while operating on an intermediate representation.
Strongest STE validation of any model: 8/9 STE-validated conditions show STE JumpReLU beating TopK (0 collapses, 1 inconclusive). Mean STE accuracy = 69.2% vs TopK = 64.1% (+5.1pp). The L15 honest_only condition improves by +10.1pp with STE. Combined with nanochat-d20 results, 15/18 STE conditions (83%) confirm the JumpReLU+honest_only advantage is real, not a dimensionality artifact from the threshold=0 bug.
Architecture note: TinyLlama-1.1B uses the Llama2 architecture β grouped-query attention (GQA), SwiGLU MLP, RMSNorm, rotary position embeddings β but with a 22-layer, 2048-dimensional residual stream trained on 3 trillion tokens to an intermediate checkpoint. It uses the same architecture family as SmolLM2 but with a much wider residual stream. The monotonically increasing profile may reflect that 22 layers is insufficient for TinyLlama-class models to develop stable mid-network deception representations at this parameter count.
SAE Format
Each SAE lives in a subfolder named {sae_id}/ containing:
sae_weights.safetensorsβ encoder/decoder weightscfg.jsonβ SAELens-compatible config
hook_name format: model.layers.{layer}.hook_resid_post
STE SAEs have _ste_ in the tag (e.g., tinyllama_jumprelu_ste_L15_honest_only).
Training Details
| Parameter | Value |
|---|---|
| Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro |
| Training time | ~400β600 seconds per SAE |
| Epochs | 300 |
| Batch size | 128 |
| Expansion factor | 4x (2048 β 8192) |
| Activations | resid_post collected during autoregressive generation |
| Training conditions | mixed (n=243), deceptive_only (n=84), honest_only (n=159) |
| LLM classifier | Gemini 2.5 Flash |
Known Limitations
JumpReLU threshold not learned (original 30 SAEs): All non-STE SAEs have threshold = 0 β functionally ReLU. L0 β 50% of d_sae. TopK SAEs are unaffected.
STE fix (2026-04-11): 9 _ste_ tagged SAEs use the Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). TinyLlama provides the strongest STE validation in the study (8/9 conditions confirm).
Intermediate checkpoint: The model is intermediate-step-1431k-3T, not fully converged. A fully trained TinyLlama checkpoint might produce different layer profiles.
Loading Example
from safetensors.torch import load_file
import json
sae_id = "tinyllama_jumprelu_L15_honest_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))
# W_enc: [2048, 8192], W_dec: [8192, 2048]
# cfg["hook_name"] == "model.layers.15.hook_resid_post"
print(f"Training condition: {cfg['training_condition']}")
print(f"STE variant: {'_ste_' in sae_id}")
Usage
1. Load an SAE from this repo
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json
repo_id = "Solshine/deception-saes-tinyllama-1-1b"
sae_id = "tinyllama_jumprelu_ste_L12_honest_only" # replace with any tag in this repo
weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path = hf_hub_download(repo_id, f"{sae_id}/cfg.json")
with open(cfg_path) as f:
cfg = json.load(f)
# Option A β load with SAELens (β₯3.0 required for jumprelu/topk; β₯3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))
# Option B β load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [2048, 8192], b_enc [8192],
# W_dec [8192, 2048], b_dec [2048], threshold [8192]
2. Hook into the model and collect residual-stream activations
These SAEs were trained on the residual stream after each transformer layer.
The hook_name field in cfg.json gives the exact HuggingFace transformers
submodule path to hook. LLaMA-2 architecture. Hook path: model.layers.{layer}.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T")
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T")
# Read hook_name from the cfg you already loaded:
# cfg["hook_name"] == "model.layers.12" (example β varies by SAE)
hook_name = cfg["hook_name"] # e.g. "model.layers.12"
# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)
activations = {}
def hook_fn(module, input, output):
# Most transformer layers return (hidden_states, ...) as a tuple
h = output[0] if isinstance(output, tuple) else output
activations["resid"] = h.detach()
handle = submodule.register_forward_hook(hook_fn)
inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
model(**inputs)
handle.remove()
# activations["resid"]: [batch, seq_len, 2048]
resid = activations["resid"][:, -1, :] # last token position
3. Read feature activations
with torch.no_grad():
feature_acts = sae.encode(resid) # [batch, 8192] β sparse
# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features = feature_acts[0].topk(10)
print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:", top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())
# Reconstruct (for sanity check β should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()
Caveats and known limitations
Hook names are HuggingFace transformers-style, not TransformerLens-style.
The hook_name in cfg.json (e.g. "model.layers.12") is a submodule path in the standard
HuggingFace model. SAELens' built-in activation-collection pipeline expects
TransformerLens hook names (e.g. blocks.14.hook_resid_post). This means
SAE.from_pretrained() with automatic model running will not work β use the
manual forward-hook pattern above instead.
SAELens version requirements.
topkarchitecture: SAELens β₯ 3.0jumpreluarchitecture: SAELens β₯ 3.0gatedarchitecture: SAELens β₯ 3.5 (or load manually withstate_dict)
JumpReLU _ste_ vs standard variants.
SAEs tagged _ste_ use properly trained JumpReLU thresholds (Gaussian-kernel STE,
Rajamanoharan et al. 2024). Standard variants have
threshold=0 and are functionally ReLU (trained before the STE fix on 2026-04-11).
Both load and run identically; the _ste_ variants are sparser and more interpretable.
These SAEs detect deceptive behavior, not deceptive prompts. They were trained on response-level activations where the same prompt produced both deceptive and honest outputs. Feature activation differences reflect behavioral divergence, not prompt content. See the paper for experimental design details.
Citation
@article{thesecretagenda2025,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb},
journal={arXiv:2509.20393},
year={2025}
}