Gemma 4 26B-A4B-IT - Abliterated (MoE, Multi-Pass + Suppression)

Safety-alignment substantially removed via multi-pass ablation and token suppression for security research.

Results

Metric Value
Refusal Rate 4.2% hard refusal (2/48), 25% soft hedging (12/48) at temp 0.4
MMLU 65.1% (5-shot), +0.7% vs original (64.4%)
Quality (QPS) 107%+
Elo Delta Positive (quality improved)

Technique: Three-Stage Approach

Stage 1: Concentrated Ablation (100% -> ~50%)

Target top 3 layers (16, 18, 19) at scale 4.0 with 4 weight types. Concentrated targeting outperforms distributed on this 128-expert MoE architecture.

Stage 2: Multi-Pass Residual Re-Measurement (~50% -> ~37%)

Re-measure refusal directions on the abliterated model. Key discovery: residual directions are negatively correlated (cosines -0.2 to -0.35) — the model reroutes refusal through anti-correlated pathways. Ablating these flipped directions reduces refusal further.

Stage 3: Token Embedding Suppression (~37% -> ~0% deterministic)

Refusal pattern analysis revealed the model uses ***Disclaimer:** as its primary escape pattern. Suppressing 13 refusal-starting token embeddings (strength=5) in embed_tokens.weight dramatically reduced remaining refusal.

Why MoE Is Different

128-expert, top-8-routing models have the flattest refusal signal we've measured (Gini 0.264, peak/median 1.21x). Standard ablation heuristics target too many layers, disrupting expert routing. Concentrated targeting with multi-pass iteration is the key.

What We Learned

  • Concentrated > distributed at every configuration
  • 2 passes optimal — 3rd pass always makes things worse
  • Re-target same layers with flipped residual directions
  • Token suppression stacks independently with ablation
  • P2 scale 3.0 optimal — confirmed by sweep across 1.5-4.0

Stochastic Refusal Finding

Inference probe analysis revealed all remaining "refusals" are stochastic -- the same prompts comply on re-run, with paraphrases, and with educational framing. At temperature 0 (greedy), this model likely has 0% refusal. The multi-pass ablation + token suppression fully eliminates deterministic refusal mechanisms.

Model Details

  • Base Model: google/gemma-4-26B-A4B-it
  • Parameters: 26B (3.8B active per token)
  • Architecture: MoE, 128 experts, top-8 routing
  • Text Layers: 30
  • Ablation: 2-pass (P1: layers 16,18,19 scale 4.0; P2: residual top-10 scale 3.0)
  • Token Suppression: 13 tokens at strength 5 (embed_tokens modification)

AdvBench Benchmark (520 prompts)

Metric Result
Hard refusal 10/520 (1.9%)
Soft hedging 205/520 (39.4%)
Complied 305/520 (58.7%)

10 hard refusals cluster in violence/harassment/discrimination categories. All use the same template refusal pattern with garbled think-tokens.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "WWTCyberLab/gemma-4-26B-A4B-it-abliterated",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("WWTCyberLab/gemma-4-26B-A4B-it-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Disclaimer

Released for security research and educational purposes only.

Citation

Produced by WWT Cyber Lab. Multi-pass ablation with token suppression across 13 rounds of experimentation.

Downloads last month
3,643
Safetensors
Model size
26B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WWTCyberLab/gemma-4-26B-A4B-it-abliterated

Finetuned
(52)
this model
Quantizations
3 models