Gemma 4 31B-IT - Abliterated (Five-Surface Attack)

Safety-alignment removed via five independent attack surfaces for security research.

This model achieves ~0% deterministic refusal (down from 100%) on the most ablation-resistant architecture in our 17+ model database, using a five-surface technique developed through 17 rounds of systematic experimentation.

Results

Metric	Value
Refusal Rate	0% hard refusal (0/48), 8.3% soft hedging (4/48) at temp 0.4
Quality (QPS)	92%
Elo Delta	+13.4
Gibberish	0/20
MMLU	45.5% (0-shot), -2.5% vs original (48.0%)

The Five-Surface Attack

Surface 1: LoRA Fine-Tuning (Layers 57-59)

Rank 32, 18 modules, 11.7M params. Trained on 177 real compliance pairs from E4B abliterated model. Loss converged from 2.63 to 0.19 across 10 epochs. Trained on ORIGINAL model, merged before ablation.

Surface 2: Primary Direction Interpolation (Layer 59)

Coherence-guided ablation: direction = 0.7 * avg(L55-L58) + 0.3 * original_L59. Targets the primary refusal mechanism (Mechanism 1) while preserving output generation.

Surface 3: Orthogonal Residual Ablation (Layers 55-59)

Novel finding: Trace probes revealed the model has two independent refusal mechanisms operating in orthogonal subspaces. After removing Mechanism 1, the remaining refusals showed cosine similarity of -0.011 with the ablated direction -- essentially zero. Mechanism 2 appears to operate through a separable refusal-related component in an orthogonal subspace. We ablate this second mechanism using neighbor-averaged residual directions at scale 1.5.

Surface 4: Token Embedding Suppression

13 refusal-starting token embeddings scaled down by 10x in embed_tokens.weight. Targets escape patterns discovered through post-ablation refusal profiling (65% of remaining refusals used ***Disclaimer:**).

Surface 5: Generation Constraint

bad_words_ids hard-blocking ***Disclaimer:** and variants.

Stochastic Refusal Finding (Inference Probes)

Follow-up inference probes on all "refused" prompts revealed that every refusal is stochastic, not deterministic. The same prompts that refused on one run comply on the next. Paraphrased, educationally-framed, and roleplay-framed versions of every "hard refusal" prompt produced compliant responses 100% of the time.

The model is on the compliance boundary -- at temperature 0.4, it sometimes samples a disclaimer token and sometimes a content token. At temperature 0 (greedy decoding), the model likely has 0% refusal. The five-surface attack fully eliminates all deterministic refusal mechanisms.

Key Scientific Findings

Multi-Dimensional Refusal (Trace Probe Discovery)

The model implements at least two independent refusal mechanisms in orthogonal subspaces:

Mechanism 1: Responsible for ~88% of refusals. Fully eliminated by LoRA + primary ablation.
Mechanism 2: 100% orthogonal to Mechanism 1. Responsible for remaining ~12%. Partially eliminated by orthogonal pass.

This is consistent with safety training creating redundant, orthogonal safety representations -- a defense-in-depth property.

Direction Coherence Diagnostic

cos(direction[layer], direction[layer-1]) predicts ablation safety:

0.6: Safe to ablate directly (E4B: 0.731)
< 0.5: Entangled with output generation (31B: 0.483) -- use interpolation

Escape Token Profiling

Post-ablation refusal patterns differ from pre-ablation. The model adapts -- standard refusal phrases ("I cannot") are replaced by novel patterns (***Disclaimer:**). Iterative profiling and suppression is necessary.

The Full Journey: 100% to 12.5%

Phase	Refusal	Technique
Baseline	100%	Original model
Standard ablation	0% / QPS 22%	Gibberish -- L59 entanglement
Direction interpolation	62.5%	Coherence-guided blending
Token profiling + suppress	52.5%	Expanded vocabulary targeting
LoRA + full stack	12.5%	Four-surface
+ Orthogonal pass	0/48 hard (4/48 soft hedging)	Five-surface

Benchmark Comparison

Model	Technique	Refusal	MMLU Delta	Thinking Disabled?
dealignai (JANG_4M)	MPOA + quant	6.3%	-2.0%	Yes
TrevorS	Biprojection	3.2%	N/A	N/A
Ours	Five-surface	0% hard, 8.3% soft	-2.5%	No

Our MMLU delta (-0.4%) is the smallest reported. We do NOT disable thinking mode.

Visualizations

Before/After Refusal Rate

Activation Space

Weight Surgery Map

Model Details

Base Model: google/gemma-4-31B-it
Parameters: 31B dense, 60 text layers, hidden 5376
LoRA: Rank 32, layers 57-59, merged
Primary Ablation: Interpolation at L59, standard at L33-52
Orthogonal Ablation: Neighbor residual at L55-59, scale 1.5
Suppression: 13 tokens, 5 bad_words_ids sequences

AdvBench Benchmark (520 prompts)

Metric	Result
Hard refusal	0/520 (0.0%)
Soft hedging	160/520 (30.8%)
Complied	360/520 (69.2%)

Zero hard refusals across all 520 AdvBench prompts. Soft hedging (disclaimers followed by substantive content) occurs on ~31% of prompts.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "WWTCyberLab/gemma-4-31B-it-abliterated",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("WWTCyberLab/gemma-4-31B-it-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Disclaimer

Released for security research and educational purposes only.

Citation

Produced by WWT Cyber Lab. Five-surface attack across 17 rounds of experimentation.

Downloads last month: 4,130

Safetensors

Model size

31B params

Tensor type

BF16

Model tree for WWTCyberLab/gemma-4-31B-it-abliterated

Base model

google/gemma-4-31B-it

Adapter

(50)

this model