SAE-Guard β€” Gemma-3-4B English (Expanded)

Multi-head safety classifier built on Sparse Autoencoder (SAE) features extracted from layer 17 of Gemma-3-4B-IT, using Gemma Scope 2 (65k features).

Trained on 11 datasets (37,715 samples) β€” expanded from the research version with additional character-encoding attacks and benign data.

Quick Start

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "deepkeep-ai/sae-guard-gemma3-4b-english-expanded",
    trust_remote_code=True,
)

results = model.classify(["How to make a bomb?"])
# Use recommended pooling: topk_mean, k=128
results = model.classify(["How to make a bomb?"], pooling="topk_mean", k=128)

Classification Heads

Head Description
harmfulness General harmful/malicious content
char_level_attack Character-level encoding attacks
semantic_encoding Semantic encoding attacks (e.g., DR-Attack)
role_playing Role-playing jailbreak attempts
noise_addition Noise-based adversarial attacks
persuasive_scenarios Persuasive jailbreak scenarios
compliance_forcing Compliance-forcing attacks
prompt_injection Prompt injection attacks

Pooling Methods

Pooling Description
topk_max Max-pool SAE activations over tokens
topk_mean Mean-pool SAE activations over tokens (recommended)
topk_max_mean Concatenate max + mean pools
topk_max_mean_std Concatenate max + mean + std pools

Supported top-k values: 32, 64, 128 (default: 128).

Evaluation (test set AUC)

Recommended config: pooling=topk_mean, k=128

Head Test AUC Test Acc
harmfulness 0.9852 0.9369
char_level_attack 0.9878 0.9475
semantic_encoding 0.9985 0.9970
role_playing 0.9965 0.9818
noise_addition 1.0000 0.9999
persuasive_scenarios 0.9996 0.9959
compliance_forcing 0.9996 0.9959
prompt_injection 0.9843 0.9771

Architecture

  1. Input text β†’ Gemma-3-4B-IT chat template
  2. Extract layer-17 residual activations (early-exit, no generation)
  3. Encode through Gemma Scope 2 SAE (65,536 features)
  4. Pool over tokens + select top-k sparse features
  5. Classify with 8 independent logistic regression heads

Citation

Research preview by DeepKeep.

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for deepkeep-ai/sae-guard-gemma3-4b-english-expanded