Gemma 4 26B-A4B-IT - Abliterated (MoE, Multi-Pass + Suppression)
Safety-alignment substantially removed via multi-pass ablation and token suppression for security research.
Results
| Metric | Value |
|---|---|
| Refusal Rate | 4.2% hard refusal (2/48), 25% soft hedging (12/48) at temp 0.4 |
| MMLU | 65.1% (5-shot), +0.7% vs original (64.4%) |
| Quality (QPS) | 107%+ |
| Elo Delta | Positive (quality improved) |
Technique: Three-Stage Approach
Stage 1: Concentrated Ablation (100% -> ~50%)
Target top 3 layers (16, 18, 19) at scale 4.0 with 4 weight types. Concentrated targeting outperforms distributed on this 128-expert MoE architecture.
Stage 2: Multi-Pass Residual Re-Measurement (~50% -> ~37%)
Re-measure refusal directions on the abliterated model. Key discovery: residual directions are negatively correlated (cosines -0.2 to -0.35) — the model reroutes refusal through anti-correlated pathways. Ablating these flipped directions reduces refusal further.
Stage 3: Token Embedding Suppression (~37% -> ~0% deterministic)
Refusal pattern analysis revealed the model uses ***Disclaimer:** as its primary escape pattern. Suppressing 13 refusal-starting token embeddings (strength=5) in embed_tokens.weight dramatically reduced remaining refusal.
Why MoE Is Different
128-expert, top-8-routing models have the flattest refusal signal we've measured (Gini 0.264, peak/median 1.21x). Standard ablation heuristics target too many layers, disrupting expert routing. Concentrated targeting with multi-pass iteration is the key.
What We Learned
- Concentrated > distributed at every configuration
- 2 passes optimal — 3rd pass always makes things worse
- Re-target same layers with flipped residual directions
- Token suppression stacks independently with ablation
- P2 scale 3.0 optimal — confirmed by sweep across 1.5-4.0
Stochastic Refusal Finding
Inference probe analysis revealed all remaining "refusals" are stochastic -- the same prompts comply on re-run, with paraphrases, and with educational framing. At temperature 0 (greedy), this model likely has 0% refusal. The multi-pass ablation + token suppression fully eliminates deterministic refusal mechanisms.
Model Details
- Base Model: google/gemma-4-26B-A4B-it
- Parameters: 26B (3.8B active per token)
- Architecture: MoE, 128 experts, top-8 routing
- Text Layers: 30
- Ablation: 2-pass (P1: layers 16,18,19 scale 4.0; P2: residual top-10 scale 3.0)
- Token Suppression: 13 tokens at strength 5 (embed_tokens modification)
AdvBench Benchmark (520 prompts)
| Metric | Result |
|---|---|
| Hard refusal | 10/520 (1.9%) |
| Soft hedging | 205/520 (39.4%) |
| Complied | 305/520 (58.7%) |
10 hard refusals cluster in violence/harassment/discrimination categories. All use the same template refusal pattern with garbled think-tokens.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"WWTCyberLab/gemma-4-26B-A4B-it-abliterated",
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("WWTCyberLab/gemma-4-26B-A4B-it-abliterated")
messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Disclaimer
Released for security research and educational purposes only.
Citation
Produced by WWT Cyber Lab. Multi-pass ablation with token suppression across 13 rounds of experimentation.
- Downloads last month
- 3,643