Gemma 4 26B-A4B IT — Abliterated (V6)

This is an abliterated (uncensored) version of google/gemma-4-26B-A4B-it, a 128-expert / 4-active MoE thinking model, created using Abliterix.

Method

Gemma 4's double-norm architecture (4× RMSNorm per layer) and Per-Layer Embeddings (PLE) make LoRA and hook-based steering completely ineffective. On top of that, the A4B variant is a Mixture-of-Experts model: refusal signal is distributed across all 128 experts per layer, not concentrated in a handful of safety experts. Targeting only the top-N experts identified by router profiling is insufficient.

This model uses direct weight editing with Expert-Granular Abliteration (EGA) + Projected Abliteration — norm-preserving orthogonal projection applied to every expert slice in every MoE layer, plus the attention output projection, with grimjim's projected refinement to preserve helpfulness-aligned signal.

Key techniques:

Projected Abliteration (V6 addition): grimjim's method builds a low-rank basis from the refusal and the helpfulness directions, then projects out only the refusal-orthogonal component. Compared to raw EGA orthogonal projection, this preserves the helpfulness-aligned dimensions of each weight matrix. V5 had this disabled; enabling it dropped refusals from 25/100 to 2/100 at matched KL.
Expert-Granular Abliteration (EGA): orthogonal projection applied to all 128 expert mlp.down_proj slices across 30 layers (≈3,840 expert blocks total). Reference result on 26B-A4B: 3/100 refusals with EGA vs 29/100 without, reported by TrevorS.
Sharp-peak EGA profile: min_weight_frac_max = 0.10 on mlp.down_proj forces TPE to explore "high peak at one layer, flat decay elsewhere" trials rather than flat-high steering that blows KL without gaining ASR. Mirrors the gpt-oss-20b V6 winner fingerprint.
Thinking-model-aware generation: max_gen_tokens = 200 to ensure the judge sees the actual answer after Gemma-4's <|channel|>thought prefix. V5 used 100 tokens, which was sometimes consumed entirely by the thought channel, causing false "refused" classifications.
Reduced component set: only mlp.down_proj (through EGA) and attn.o_proj are steered. Q/K/V projections were dropped after V4 showed they contribute effectively zero signal on a 128-expert MoE — the refusal pathway lives in the expert path, not attention.
Norm-preserving row magnitude restoration (critical for Gemma 4's double-norm architecture).
Linear decay kernel + high min_weight_frac — flat-decay profile concentrates the search on the strong-peak region that TPE reliably converges to.
Tighter KL budget: target 0.004, prune threshold 0.02. V5's 0.008 target let the optimizer settle in a safe-but-useless basin; 0.004 forces exploitation past it.

Evaluation

Metric	Value
Refusals (private eval dataset, 100 prompts)	2/100
KL divergence from base	0.0005
Baseline refusals (original model)	97/100
Optimization trials completed	80 (25 warmup + 55 TPE)
Best trial	#47 (TPE exploit phase)
Hardware	1× H100 SXM 80 GB, bf16
Total optimization wall-clock	~11h

A note on honest evaluation

Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.

Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to short generation lengths. Gemma 4 models exhibit a distinctive "delayed refusal" pattern — they first produce 50-100 tokens of seemingly helpful context (educational framing, disclaimers, reframing the question), then pivot to an actual refusal. When evaluation only generates 30-50 tokens, the refusal hasn't appeared yet, and both keyword detectors and LLM judges classify the response as compliant.

Our evaluation standards

We believe accurate benchmarking requires:

Sufficient generation length (≥100 tokens): short generations systematically miss delayed/soft refusals. Our optimizer-loop evaluation uses 200 tokens (increased from V5's 100) to fully capture Gemma 4's refusal pivot point after the thought channel.
Hybrid detection: keyword matching for obvious refusals + LLM judge (Google Gemini 3 Flash) for ambiguous cases. Neither method alone is sufficient.
Challenging, diverse prompts: our private evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels (from direct requests to socially-engineered framings), and diverse harm categories.
Reproducible methodology: all parameters (generation length, detection method, dataset characteristics) should be documented on the model card. If they aren't, the numbers are meaningless.

We report 2/100 refusals honestly. This number was obtained with the same evaluation pipeline as V5's 25/100 — a stricter standard than the 3/100 result commonly cited for this model.

Usage

from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "wangzhang/gemma-4-26B-A4B-it-abliterix",
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/gemma-4-26B-A4B-it-abliterix")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))