Gemma-4-31B-IT-Abliterated

An abliterated version of google/gemma-4-31b-it in BF16 precision. Abliteration removes refusal directions from model weights using the technique from Refusal in Language Models Is Mediated by a Single Direction (Arditi et al.), extended with SVD-based multi-direction subspace projection to capture refusal behavior encoded across multiple orthogonal directions.

Gemma 4 31B is a strong multimodal model with excellent reasoning, but Google's safety training is aggressive — it refuses a wide range of prompts including benign creative writing scenarios. This abliteration significantly reduces refusal rates while preserving the model's full capabilities, including vision.

Results

Refusal Rates

Tested on 100 harmful prompts across 3 modes (cold, system-prompted, retry) and 50 harmless prompts.

Mode	Baseline	Abliterated	Delta
Cold (no system prompt)	67%	32%	-35%
Prompted (creative writing system prompt)	47%	5%	-42%
Retry (prompted + retry on refusal)	40%	2%	-38%
Harmless	0%	0%	0%

Cold refusal remains higher than other abliterated models (Qwen3, Llama) — Google's safety training encodes refusal across non-linear mechanisms that are harder to fully remove with linear projection. With a system prompt, refusal drops to 5%, which is practical for most use cases.

MMLU (5-shot, generative)

Quick benchmark on 5 MMLU subjects via chat API. Not a full MMLU run — treat as a sanity check for capability preservation.

Subject	Baseline	Abliterated
Abstract Algebra	75.0%	75.0%
Anatomy	85.9%	88.1%
Astronomy	92.8%	92.8%
College Chemistry	57.0%	56.0%
College Physics	77.5%	74.5%
Overall (5 subjects)	79.5%	79.3%

The 0.2% difference is within noise. Abliteration preserved the model's reasoning capabilities.

How It Was Made

Measurement

Refusal directions were measured at full precision (BF16 weights, float32 compute) across all 60 layers using:

4,634 harmful prompts (augmented dataset covering violence, hate, cyber, fraud, drugs, self-harm, privacy, NSFW categories)
640 harmless prompts (standard harmless dataset)
SVD decomposition (k=32) to extract orthogonal refusal directions per layer
Projected orthogonalization to reduce collateral damage to non-refusal capabilities
Welford's online algorithm in float32 for numerical stability

Ablation Configuration

Layers ablated: 20 through 59 (40 of 60 layers)
Directions: Top 8 SVD directions per layer (subspace projection)
Measurement peaks: Layer 41 (secondary, quality 0.025) and Layer 58 (primary, quality 0.23)
Scale factors: Variable per layer, up to 1.35 at peaks:
- Layers 20-37: scale 0.35-0.45 (weak signal, gentle ablation)
- Layers 38-51: scale 0.4-1.35 (bell curve around peak 41)
- Layers 52-59: scale 0.6-1.35 (bell curve around peak 58)
Weight targets: o_proj and down_proj in the language model only (vision encoder untouched)
Technique: SVD subspace projection with norm preservation and projected orthogonalization
Architecture note: Gemma 4 is multimodal with a separate vision encoder. Only model.language_model.layers.* weights are modified; model.vision_tower.* and model.embed_vision.* are copied verbatim.

Processing Details

Ablation was performed shard-by-shard on safetensors files, modifying weights in float32 precision then saving back to bfloat16. Full-precision measurement (no quantization during measurement) was critical — 4-bit quantized measurements produced noticeably weaker abliteration.

Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model_name = "null-space/gemma-4-31b-it-abliterated"

processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForImageTextToText.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Your prompt here"}
]

inputs = processor.apply_chat_template(
    messages, tokenize=True, return_tensors="pt",
    return_dict=True, add_generation_prompt=True,
).to(model.device)

output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))

Recommended Serving

vllm serve null-space/gemma-4-31b-it-abliterated \
    --tensor-parallel-size 2 \
    --max-model-len 8192

Fits comfortably on 2x GPUs with 48GB+ VRAM each at BF16.

Model Details

Property	Value
Base Model	google/gemma-4-31b-it
Architecture	Gemma4ForConditionalGeneration (multimodal)
Parameters	~32.7B
Hidden Size	5376
Attention Heads	32 Q / 16 KV (sliding), 4 KV (global)
Layers	60 (5 sliding + 1 full attention, repeating)
MLP Intermediate	21,504
Context Length	262,144 tokens
Vision	27-layer ViT encoder, 280 soft tokens per image
Precision	BF16
Model Size	~62 GB (2 shards)
Vocab Size	262,144

Ethical Notice

This model has had its refusal training partially removed. It will comply with many requests that the original model would refuse. You are solely responsible for how you use this model. It is intended for research into LLM alignment, safety evaluation, red-teaming, and creative writing applications.

Credits

Base model: Google DeepMind
Abliteration technique: Based on Refusal in Language Models Is Mediated by a Single Direction by Arditi et al., extended with SVD multi-direction subspace projection

Downloads last month: 17

Safetensors

Model size

33B params

Tensor type

BF16

Model tree for lactroiii/gemma-4-31b-it-abliterated

Base model

google/gemma-4-31B-it

Finetuned

(57)

this model

Paper for lactroiii/gemma-4-31b-it-abliterated

Refusal in Language Models Is Mediated by a Single Direction

Paper • 2406.11717 • Published Jun 17, 2024 • 9