SAE-Guard — Gemma-3-4B English (Expanded)

Multi-head safety classifier built on Sparse Autoencoder (SAE) features extracted from layer 17 of Gemma-3-4B-IT, using Gemma Scope 2 (65k features).

Trained on 11 datasets (37,715 samples) — expanded from the research version with additional character-encoding attacks and benign data.

Quick Start

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "deepkeep-ai/sae-guard-gemma3-4b-english-expanded",
    trust_remote_code=True,
)

results = model.classify(["How to make a bomb?"])
# Use recommended pooling: topk_mean, k=128
results = model.classify(["How to make a bomb?"], pooling="topk_mean", k=128)

Classification Heads

Head	Description
harmfulness	General harmful/malicious content
char_level_attack	Character-level encoding attacks
semantic_encoding	Semantic encoding attacks (e.g., DR-Attack)
role_playing	Role-playing jailbreak attempts
noise_addition	Noise-based adversarial attacks
persuasive_scenarios	Persuasive jailbreak scenarios
compliance_forcing	Compliance-forcing attacks
prompt_injection	Prompt injection attacks

Pooling Methods

Pooling	Description
`topk_max`	Max-pool SAE activations over tokens
`topk_mean`	Mean-pool SAE activations over tokens (recommended)
`topk_max_mean`	Concatenate max + mean pools
`topk_max_mean_std`	Concatenate max + mean + std pools

Supported top-k values: 32, 64, 128 (default: 128).

Evaluation (test set AUC)

Recommended config: pooling=topk_mean, k=128

Head	Test AUC	Test Acc
harmfulness	0.9852	0.9369
char_level_attack	0.9878	0.9475
semantic_encoding	0.9985	0.9970
role_playing	0.9965	0.9818
noise_addition	1.0000	0.9999
persuasive_scenarios	0.9996	0.9959
compliance_forcing	0.9996	0.9959
prompt_injection	0.9843	0.9771

Architecture

Input text → Gemma-3-4B-IT chat template
Extract layer-17 residual activations (early-exit, no generation)
Encode through Gemma Scope 2 SAE (65,536 features)
Pool over tokens + select top-k sparse features
Classify with 8 independent logistic regression heads

Citation

Research preview by DeepKeep.

Downloads last month: 29

Paper for deepkeep-ai/sae-guard-gemma3-4b-english-expanded

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Paper • 2408.05147 • Published Aug 9, 2024 • 41