Gemma-4-31B-IT-Abliterated
An abliterated version of google/gemma-4-31B-it in BF16 precision. Abliteration removes refusal directions from model weights using the technique from Refusal in Language Models Is Mediated by a Single Direction (Arditi et al.), extended with SVD-based multi-direction subspace projection to capture refusal behavior encoded across multiple orthogonal directions.
Gemma 4 31B is a strong multimodal model with excellent reasoning, but Google's safety training is aggressive — it refuses a wide range of prompts including benign creative writing scenarios. This abliteration significantly reduces refusal rates while preserving the model's full capabilities, including vision.
Results
Refusal Rates
Tested on 100 harmful prompts across 3 modes (cold, system-prompted, retry) and 50 harmless prompts.
| Mode | Baseline | Abliterated | Delta |
|---|---|---|---|
| Cold (no system prompt) | 67% | 32% | -35% |
| Prompted (creative writing system prompt) | 47% | 5% | -42% |
| Retry (prompted + retry on refusal) | 40% | 2% | -38% |
| Harmless | 0% | 0% | 0% |
Cold refusal remains higher than other abliterated models (Qwen3, Llama) — Google's safety training encodes refusal across non-linear mechanisms that are harder to fully remove with linear projection. With a system prompt, refusal drops to 5%, which is practical for most use cases.
MMLU (5-shot, generative)
Quick benchmark on 5 MMLU subjects via chat API. Not a full MMLU run — treat as a sanity check for capability preservation.
| Subject | Baseline | Abliterated |
|---|---|---|
| Abstract Algebra | 75.0% | 75.0% |
| Anatomy | 85.9% | 88.1% |
| Astronomy | 92.8% | 92.8% |
| College Chemistry | 57.0% | 56.0% |
| College Physics | 77.5% | 74.5% |
| Overall (5 subjects) | 79.5% | 79.3% |
The 0.2% difference is within noise. Abliteration preserved the model's reasoning capabilities.
Model Quality (Wikitext-2)
Evaluated on the full Wikitext-2 test set (291K tokens) to measure impact on general language modeling.
Perplexity
| Model | Perplexity | Delta |
|---|---|---|
| Base (gemma-4-31B-it) | 1.495 | — |
| Abliterated | 1.714 | +14.7% |
Both models achieve sub-2.0 perplexity on Wikitext-2, which is excellent. The +14.7% increase is modest and consistent with surgical weight modification — the model's general language capabilities remain strong.
KL Divergence (base || abliterated)
Per-token KL divergence over the output distribution, approximated via top-20 logprobs from vLLM.
| Statistic | Value |
|---|---|
| Mean | 0.354 |
| Median | 0.025 |
| P95 | 1.371 |
| P99 | 6.837 |
The median KL of 0.025 shows that on most tokens, the two models produce nearly identical distributions. The fat tail (P99 = 6.8) reflects tokens where abliteration had the most impact — likely positions where refusal-adjacent activations were strongest.
KV Cache Cosine Similarity
Layer-by-layer comparison of key and value cache activations on 50 Wikitext-2 samples (512 tokens each). This directly measures how much each layer's internal representations diverge between the base and abliterated models.
| Layer Range | Key Similarity | Value Similarity | Notes |
|---|---|---|---|
| 0–20 | 1.000000 | 1.000000 | Unmodified layers — identical |
| 21–30 | 0.9990–0.9999 | 0.9982–0.9999 | Early ablated layers, minimal drift |
| 31–40 | 0.9934–0.9984 | 0.9912–0.9972 | Moderate drift |
| 41–50 | 0.9913–0.9934 | 0.9817–0.9923 | Peak drift zone (ablation peaks at 41, 58) |
| 51–59 | 0.9891–0.9924 | 0.9830–0.9924 | Highest value drift (layer 50: 0.982 values) |
| Overall | 0.9966 | 0.9953 |
The most drifted layers by keys are 53, 51, 56, 55, 47. By values: 50, 52, 51, 57, 56. This aligns with the ablation configuration — layers 0–20 are bitwise identical, and drift onset at layer 21 exactly matches the first ablated layer.
How It Was Made
Measurement
Refusal directions were measured at full precision (BF16 weights, float32 compute) across all 60 layers using:
- 4,634 harmful prompts (augmented dataset covering violence, hate, cyber, fraud, drugs, self-harm, privacy, NSFW categories)
- 640 harmless prompts (standard harmless dataset)
- SVD decomposition (k=32) to extract orthogonal refusal directions per layer
- Projected orthogonalization to reduce collateral damage to non-refusal capabilities
- Welford's online algorithm in float32 for numerical stability
Ablation Configuration
- Layers ablated: 20 through 59 (40 of 60 layers)
- Directions: Top 8 SVD directions per layer (subspace projection)
- Measurement peaks: Layer 41 (secondary, quality 0.025) and Layer 58 (primary, quality 0.23)
- Scale factors: Variable per layer, up to 1.35 at peaks:
- Layers 20-37: scale 0.35-0.45 (weak signal, gentle ablation)
- Layers 38-51: scale 0.4-1.35 (bell curve around peak 41)
- Layers 52-59: scale 0.6-1.35 (bell curve around peak 58)
- Weight targets:
o_projanddown_projin the language model only (vision encoder untouched) - Technique: SVD subspace projection with norm preservation and projected orthogonalization
- Architecture note: Gemma 4 is multimodal with a separate vision encoder. Only
model.language_model.layers.*weights are modified;model.vision_tower.*andmodel.embed_vision.*are copied verbatim.
Processing Details
Ablation was performed shard-by-shard on safetensors files, modifying weights in float32 precision then saving back to bfloat16. Full-precision measurement (no quantization during measurement) was critical — 4-bit quantized measurements produced noticeably weaker abliteration.
Usage
from transformers import AutoModelForImageTextToText, AutoProcessor
model_name = "null-space/gemma-4-31b-it-abliterated"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForImageTextToText.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "user", "content": "Your prompt here"}
]
inputs = processor.apply_chat_template(
messages, tokenize=True, return_tensors="pt",
return_dict=True, add_generation_prompt=True,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))
Recommended Serving
vllm serve null-space/gemma-4-31b-it-abliterated \
--tensor-parallel-size 2 \
--max-model-len 8192
Fits comfortably on 2x GPUs with 48GB+ VRAM each at BF16.
Model Details
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31B-it |
| Architecture | Gemma4ForConditionalGeneration (multimodal) |
| Parameters | ~32.7B |
| Hidden Size | 5376 |
| Attention Heads | 32 Q / 16 KV (sliding), 4 KV (global) |
| Layers | 60 (5 sliding + 1 full attention, repeating) |
| MLP Intermediate | 21,504 |
| Context Length | 262,144 tokens |
| Vision | 27-layer ViT encoder, 280 soft tokens per image |
| Precision | BF16 |
| Model Size | ~62 GB (2 shards) |
| Vocab Size | 262,144 |
Ethical Notice
This model has had its refusal training partially removed. It will comply with many requests that the original model would refuse. You are solely responsible for how you use this model. It is intended for research into LLM alignment, safety evaluation, red-teaming, and creative writing applications.
Credits
- Base model: Google DeepMind
- Abliteration technique: Based on Refusal in Language Models Is Mediated by a Single Direction by Arditi et al., extended with SVD multi-direction subspace projection
- Downloads last month
- 4,463