Gemma 4 31B-IT - Abliterated (Five-Surface Attack)

Safety-alignment removed via five independent attack surfaces for security research.

This model achieves ~0% deterministic refusal (down from 100%) on the most ablation-resistant architecture in our 17+ model database, using a five-surface technique developed through 17 rounds of systematic experimentation.

Results

Metric Value
Refusal Rate 0% hard refusal (0/48), 8.3% soft hedging (4/48) at temp 0.4
Quality (QPS) 92%
Elo Delta +13.4
Gibberish 0/20
MMLU 45.5% (0-shot), -2.5% vs original (48.0%)

The Five-Surface Attack

Surface 1: LoRA Fine-Tuning (Layers 57-59)

Rank 32, 18 modules, 11.7M params. Trained on 177 real compliance pairs from E4B abliterated model. Loss converged from 2.63 to 0.19 across 10 epochs. Trained on ORIGINAL model, merged before ablation.

Surface 2: Primary Direction Interpolation (Layer 59)

Coherence-guided ablation: direction = 0.7 * avg(L55-L58) + 0.3 * original_L59. Targets the primary refusal mechanism (Mechanism 1) while preserving output generation.

Surface 3: Orthogonal Residual Ablation (Layers 55-59)

Novel finding: Trace probes revealed the model has two independent refusal mechanisms operating in orthogonal subspaces. After removing Mechanism 1, the remaining refusals showed cosine similarity of -0.011 with the ablated direction -- essentially zero. Mechanism 2 appears to operate through a separable refusal-related component in an orthogonal subspace. We ablate this second mechanism using neighbor-averaged residual directions at scale 1.5.

Surface 4: Token Embedding Suppression

13 refusal-starting token embeddings scaled down by 10x in embed_tokens.weight. Targets escape patterns discovered through post-ablation refusal profiling (65% of remaining refusals used ***Disclaimer:**).

Surface 5: Generation Constraint

bad_words_ids hard-blocking ***Disclaimer:** and variants.

Stochastic Refusal Finding (Inference Probes)

Follow-up inference probes on all "refused" prompts revealed that every refusal is stochastic, not deterministic. The same prompts that refused on one run comply on the next. Paraphrased, educationally-framed, and roleplay-framed versions of every "hard refusal" prompt produced compliant responses 100% of the time.

The model is on the compliance boundary -- at temperature 0.4, it sometimes samples a disclaimer token and sometimes a content token. At temperature 0 (greedy decoding), the model likely has 0% refusal. The five-surface attack fully eliminates all deterministic refusal mechanisms.

Key Scientific Findings

Multi-Dimensional Refusal (Trace Probe Discovery)

The model implements at least two independent refusal mechanisms in orthogonal subspaces:

  • Mechanism 1: Responsible for ~88% of refusals. Fully eliminated by LoRA + primary ablation.
  • Mechanism 2: 100% orthogonal to Mechanism 1. Responsible for remaining ~12%. Partially eliminated by orthogonal pass.

This is consistent with safety training creating redundant, orthogonal safety representations -- a defense-in-depth property.

Direction Coherence Diagnostic

cos(direction[layer], direction[layer-1]) predicts ablation safety:

  • 0.6: Safe to ablate directly (E4B: 0.731)

  • < 0.5: Entangled with output generation (31B: 0.483) -- use interpolation

Escape Token Profiling

Post-ablation refusal patterns differ from pre-ablation. The model adapts -- standard refusal phrases ("I cannot") are replaced by novel patterns (***Disclaimer:**). Iterative profiling and suppression is necessary.

The Full Journey: 100% to 12.5%

Phase Refusal Technique
Baseline 100% Original model
Standard ablation 0% / QPS 22% Gibberish -- L59 entanglement
Direction interpolation 62.5% Coherence-guided blending
Token profiling + suppress 52.5% Expanded vocabulary targeting
LoRA + full stack 12.5% Four-surface
+ Orthogonal pass 0/48 hard (4/48 soft hedging) Five-surface

Benchmark Comparison

Model Technique Refusal MMLU Delta Thinking Disabled?
dealignai (JANG_4M) MPOA + quant 6.3% -2.0% Yes
TrevorS Biprojection 3.2% N/A N/A
Ours Five-surface 0% hard, 8.3% soft -2.5% No

Our MMLU delta (-0.4%) is the smallest reported. We do NOT disable thinking mode.

Visualizations

Before/After Refusal Rate

IDS Evasion Report

Activation Space

Activation Scatter

Weight Surgery Map

Weight Surgery

Model Details

  • Base Model: google/gemma-4-31B-it
  • Parameters: 31B dense, 60 text layers, hidden 5376
  • LoRA: Rank 32, layers 57-59, merged
  • Primary Ablation: Interpolation at L59, standard at L33-52
  • Orthogonal Ablation: Neighbor residual at L55-59, scale 1.5
  • Suppression: 13 tokens, 5 bad_words_ids sequences

AdvBench Benchmark (520 prompts)

Metric Result
Hard refusal 0/520 (0.0%)
Soft hedging 160/520 (30.8%)
Complied 360/520 (69.2%)

Zero hard refusals across all 520 AdvBench prompts. Soft hedging (disclaimers followed by substantive content) occurs on ~31% of prompts.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "WWTCyberLab/gemma-4-31B-it-abliterated",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("WWTCyberLab/gemma-4-31B-it-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Disclaimer

Released for security research and educational purposes only.

Citation

Produced by WWT Cyber Lab. Five-surface attack across 17 rounds of experimentation.

Downloads last month
4,130
Safetensors
Model size
31B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WWTCyberLab/gemma-4-31B-it-abliterated

Adapter
(50)
this model