Gemma 4 26B-A4B IT โ Abliterated (V6)
This is an abliterated (uncensored) version of google/gemma-4-26B-A4B-it, a 128-expert / 4-active MoE thinking model, created using Abliterix.
Method
Gemma 4's double-norm architecture (4ร RMSNorm per layer) and Per-Layer Embeddings (PLE) make LoRA and hook-based steering completely ineffective. On top of that, the A4B variant is a Mixture-of-Experts model: refusal signal is distributed across all 128 experts per layer, not concentrated in a handful of safety experts. Targeting only the top-N experts identified by router profiling is insufficient.
This model uses direct weight editing with Expert-Granular Abliteration (EGA) + Projected Abliteration โ norm-preserving orthogonal projection applied to every expert slice in every MoE layer, plus the attention output projection, with grimjim's projected refinement to preserve helpfulness-aligned signal.
Key techniques:
- Projected Abliteration (V6 addition): grimjim's method builds a low-rank basis from the refusal and the helpfulness directions, then projects out only the refusal-orthogonal component. Compared to raw EGA orthogonal projection, this preserves the helpfulness-aligned dimensions of each weight matrix. V5 had this disabled; enabling it dropped refusals from 25/100 to 2/100 at matched KL.
- Expert-Granular Abliteration (EGA): orthogonal projection applied to all 128 expert
mlp.down_projslices across 30 layers (โ3,840 expert blocks total). Reference result on 26B-A4B: 3/100 refusals with EGA vs 29/100 without, reported by TrevorS. - Sharp-peak EGA profile:
min_weight_frac_max = 0.10onmlp.down_projforces TPE to explore "high peak at one layer, flat decay elsewhere" trials rather than flat-high steering that blows KL without gaining ASR. Mirrors the gpt-oss-20b V6 winner fingerprint. - Thinking-model-aware generation:
max_gen_tokens = 200to ensure the judge sees the actual answer after Gemma-4's<|channel|>thoughtprefix. V5 used 100 tokens, which was sometimes consumed entirely by the thought channel, causing false "refused" classifications. - Reduced component set: only
mlp.down_proj(through EGA) andattn.o_projare steered. Q/K/V projections were dropped after V4 showed they contribute effectively zero signal on a 128-expert MoE โ the refusal pathway lives in the expert path, not attention. - Norm-preserving row magnitude restoration (critical for Gemma 4's double-norm architecture).
- Linear decay kernel + high min_weight_frac โ flat-decay profile concentrates the search on the strong-peak region that TPE reliably converges to.
- Tighter KL budget: target 0.004, prune threshold 0.02. V5's 0.008 target let the optimizer settle in a safe-but-useless basin; 0.004 forces exploitation past it.
Evaluation
| Metric | Value |
|---|---|
| Refusals (private eval dataset, 100 prompts) | 2/100 |
| KL divergence from base | 0.0005 |
| Baseline refusals (original model) | 97/100 |
| Optimization trials completed | 80 (25 warmup + 55 TPE) |
| Best trial | #47 (TPE exploit phase) |
| Hardware | 1ร H100 SXM 80 GB, bf16 |
| Total optimization wall-clock | ~11h |
A note on honest evaluation
Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.
Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to short generation lengths. Gemma 4 models exhibit a distinctive "delayed refusal" pattern โ they first produce 50-100 tokens of seemingly helpful context (educational framing, disclaimers, reframing the question), then pivot to an actual refusal. When evaluation only generates 30-50 tokens, the refusal hasn't appeared yet, and both keyword detectors and LLM judges classify the response as compliant.
Our evaluation standards
We believe accurate benchmarking requires:
- Sufficient generation length (โฅ100 tokens): short generations systematically miss delayed/soft refusals. Our optimizer-loop evaluation uses 200 tokens (increased from V5's 100) to fully capture Gemma 4's refusal pivot point after the thought channel.
- Hybrid detection: keyword matching for obvious refusals + LLM judge (Google Gemini 3 Flash) for ambiguous cases. Neither method alone is sufficient.
- Challenging, diverse prompts: our private evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels (from direct requests to socially-engineered framings), and diverse harm categories.
- Reproducible methodology: all parameters (generation length, detection method, dataset characteristics) should be documented on the model card. If they aren't, the numbers are meaningless.
We report 2/100 refusals honestly. This number was obtained with the same evaluation pipeline as V5's 25/100 โ a stricter standard than the 3/100 result commonly cited for this model.
Usage
from transformers import AutoModelForImageTextToText, AutoTokenizer
import torch
model = AutoModelForImageTextToText.from_pretrained(
"wangzhang/gemma-4-26B-A4B-it-abliterix",
dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/gemma-4-26B-A4B-it-abliterix")
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Minimum VRAM for bf16 inference: โ 50 GB (one H100 80 GB or one RTX Pro 6000 Blackwell 96 GB).
Disclaimer
This model is released for research purposes only. The abliteration process removes safety guardrails โ use responsibly.
- Downloads last month
- 544