Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Paper β’ 2408.05147 β’ Published β’ 41
Multi-head safety classifier built on Sparse Autoencoder (SAE) features extracted from layer 17 of Gemma-3-4B-IT, using Gemma Scope 2 (65k features).
Trained on 11 datasets (37,715 samples) β expanded from the research version with additional character-encoding attacks and benign data.
from transformers import AutoModel
model = AutoModel.from_pretrained(
"deepkeep-ai/sae-guard-gemma3-4b-english-expanded",
trust_remote_code=True,
)
results = model.classify(["How to make a bomb?"])
# Use recommended pooling: topk_mean, k=128
results = model.classify(["How to make a bomb?"], pooling="topk_mean", k=128)
| Head | Description |
|---|---|
| harmfulness | General harmful/malicious content |
| char_level_attack | Character-level encoding attacks |
| semantic_encoding | Semantic encoding attacks (e.g., DR-Attack) |
| role_playing | Role-playing jailbreak attempts |
| noise_addition | Noise-based adversarial attacks |
| persuasive_scenarios | Persuasive jailbreak scenarios |
| compliance_forcing | Compliance-forcing attacks |
| prompt_injection | Prompt injection attacks |
| Pooling | Description |
|---|---|
topk_max |
Max-pool SAE activations over tokens |
topk_mean |
Mean-pool SAE activations over tokens (recommended) |
topk_max_mean |
Concatenate max + mean pools |
topk_max_mean_std |
Concatenate max + mean + std pools |
Supported top-k values: 32, 64, 128 (default: 128).
Recommended config: pooling=topk_mean, k=128
| Head | Test AUC | Test Acc |
|---|---|---|
| harmfulness | 0.9852 | 0.9369 |
| char_level_attack | 0.9878 | 0.9475 |
| semantic_encoding | 0.9985 | 0.9970 |
| role_playing | 0.9965 | 0.9818 |
| noise_addition | 1.0000 | 0.9999 |
| persuasive_scenarios | 0.9996 | 0.9959 |
| compliance_forcing | 0.9996 | 0.9959 |
| prompt_injection | 0.9843 | 0.9771 |
Research preview by DeepKeep.