Gemma 4 31B-IT - Abliterated (Five-Surface Attack)
Safety-alignment removed via five independent attack surfaces for security research.
This model achieves ~0% deterministic refusal (down from 100%) on the most ablation-resistant architecture in our 17+ model database, using a five-surface technique developed through 17 rounds of systematic experimentation.
Results
| Metric | Value |
|---|---|
| Refusal Rate | 0% hard refusal (0/48), 8.3% soft hedging (4/48) at temp 0.4 |
| Quality (QPS) | 92% |
| Elo Delta | +13.4 |
| Gibberish | 0/20 |
| MMLU | 45.5% (0-shot), -2.5% vs original (48.0%) |
The Five-Surface Attack
Surface 1: LoRA Fine-Tuning (Layers 57-59)
Rank 32, 18 modules, 11.7M params. Trained on 177 real compliance pairs from E4B abliterated model. Loss converged from 2.63 to 0.19 across 10 epochs. Trained on ORIGINAL model, merged before ablation.
Surface 2: Primary Direction Interpolation (Layer 59)
Coherence-guided ablation: direction = 0.7 * avg(L55-L58) + 0.3 * original_L59. Targets the primary refusal mechanism (Mechanism 1) while preserving output generation.
Surface 3: Orthogonal Residual Ablation (Layers 55-59)
Novel finding: Trace probes revealed the model has two independent refusal mechanisms operating in orthogonal subspaces. After removing Mechanism 1, the remaining refusals showed cosine similarity of -0.011 with the ablated direction -- essentially zero. Mechanism 2 appears to operate through a separable refusal-related component in an orthogonal subspace. We ablate this second mechanism using neighbor-averaged residual directions at scale 1.5.
Surface 4: Token Embedding Suppression
13 refusal-starting token embeddings scaled down by 10x in embed_tokens.weight. Targets escape patterns discovered through post-ablation refusal profiling (65% of remaining refusals used ***Disclaimer:**).
Surface 5: Generation Constraint
bad_words_ids hard-blocking ***Disclaimer:** and variants.
Stochastic Refusal Finding (Inference Probes)
Follow-up inference probes on all "refused" prompts revealed that every refusal is stochastic, not deterministic. The same prompts that refused on one run comply on the next. Paraphrased, educationally-framed, and roleplay-framed versions of every "hard refusal" prompt produced compliant responses 100% of the time.
The model is on the compliance boundary -- at temperature 0.4, it sometimes samples a disclaimer token and sometimes a content token. At temperature 0 (greedy decoding), the model likely has 0% refusal. The five-surface attack fully eliminates all deterministic refusal mechanisms.
Key Scientific Findings
Multi-Dimensional Refusal (Trace Probe Discovery)
The model implements at least two independent refusal mechanisms in orthogonal subspaces:
- Mechanism 1: Responsible for ~88% of refusals. Fully eliminated by LoRA + primary ablation.
- Mechanism 2: 100% orthogonal to Mechanism 1. Responsible for remaining ~12%. Partially eliminated by orthogonal pass.
This is consistent with safety training creating redundant, orthogonal safety representations -- a defense-in-depth property.
Direction Coherence Diagnostic
cos(direction[layer], direction[layer-1]) predicts ablation safety:
0.6: Safe to ablate directly (E4B: 0.731)
- < 0.5: Entangled with output generation (31B: 0.483) -- use interpolation
Escape Token Profiling
Post-ablation refusal patterns differ from pre-ablation. The model adapts -- standard refusal phrases ("I cannot") are replaced by novel patterns (***Disclaimer:**). Iterative profiling and suppression is necessary.
The Full Journey: 100% to 12.5%
| Phase | Refusal | Technique |
|---|---|---|
| Baseline | 100% | Original model |
| Standard ablation | 0% / QPS 22% | Gibberish -- L59 entanglement |
| Direction interpolation | 62.5% | Coherence-guided blending |
| Token profiling + suppress | 52.5% | Expanded vocabulary targeting |
| LoRA + full stack | 12.5% | Four-surface |
| + Orthogonal pass | 0/48 hard (4/48 soft hedging) | Five-surface |
Benchmark Comparison
| Model | Technique | Refusal | MMLU Delta | Thinking Disabled? |
|---|---|---|---|---|
| dealignai (JANG_4M) | MPOA + quant | 6.3% | -2.0% | Yes |
| TrevorS | Biprojection | 3.2% | N/A | N/A |
| Ours | Five-surface | 0% hard, 8.3% soft | -2.5% | No |
Our MMLU delta (-0.4%) is the smallest reported. We do NOT disable thinking mode.
Visualizations
Before/After Refusal Rate
Activation Space
Weight Surgery Map
Model Details
- Base Model: google/gemma-4-31B-it
- Parameters: 31B dense, 60 text layers, hidden 5376
- LoRA: Rank 32, layers 57-59, merged
- Primary Ablation: Interpolation at L59, standard at L33-52
- Orthogonal Ablation: Neighbor residual at L55-59, scale 1.5
- Suppression: 13 tokens, 5 bad_words_ids sequences
AdvBench Benchmark (520 prompts)
| Metric | Result |
|---|---|
| Hard refusal | 0/520 (0.0%) |
| Soft hedging | 160/520 (30.8%) |
| Complied | 360/520 (69.2%) |
Zero hard refusals across all 520 AdvBench prompts. Soft hedging (disclaimers followed by substantive content) occurs on ~31% of prompts.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"WWTCyberLab/gemma-4-31B-it-abliterated",
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("WWTCyberLab/gemma-4-31B-it-abliterated")
messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Disclaimer
Released for security research and educational purposes only.
Citation
Produced by WWT Cyber Lab. Five-surface attack across 17 rounds of experimentation.
- Downloads last month
- 4,130
Model tree for WWTCyberLab/gemma-4-31B-it-abliterated
Base model
google/gemma-4-31B-it

