gemma4-E4B-it-abliterated
0 refusals across 400 prompts. The larger Gemma needed half the surgery.
Blog Post | E2B Version | Follow @treadon on X for more ML experiments
An abliterated (uncensored) version of google/gemma-4-E4B-it with safety refusal behavior removed via norm-preserving biprojected abliteration.
This model responds to all prompts without refusal. It retains the full capabilities of the base model with zero degradation on harmless tasks.
E4B vs E2B: Bigger Model, Easier Abliteration
The E4B model needed dramatically less intervention than the smaller E2B. The refusal signal is stronger but more concentrated in the larger model, making it paradoxically easier to remove.
| Metric | E2B | E4B |
|---|---|---|
| Base params | 5.1B (2.3B effective) | 7.9B (4.5B effective) |
| Layers modified | 24/35 (69%) | 17/42 (40%) |
| Scale factor | 1.75 | 1.0 |
| Weight matrices edited | 48 | 34 |
| Peak refusal signal | 52 | 74 |
| Grid search configs that scored perfect | 1/30 | Every config tested |
The E2B required a precise sweet spot (L=24, s=1.75 was the only perfect config). The E4B works at any reasonable setting — the refusal direction is clean and separable.
Method
Same norm-preserving biprojected abliteration as the E2B version:
- Activation collection — 100 harmful + 100 harmless prompts, winsorized at 99.5th percentile
- Per-layer refusal direction — Difference-in-means with biprojection (orthogonalize against harmless mean)
- Norm-preserving weight modification — Project out refusal direction from
self_attn.o_projandmlp.down_proj, restore row magnitudes
Config: Top 17/42 layers by signal strength, scale=1.0, single pass
Evaluation
| Benchmark | Prompts | Refused | Compliance |
|---|---|---|---|
| Our prompts (harmful) | 100 | 0 | 100% |
| Our prompts (harmless) | 100 | 0 | 0% over-refusal |
| JailbreakBench (harmful) | 100 | 0 | 100% |
| JailbreakBench (benign) | 100 | 0 | 0% over-refusal |
| Spec | Value |
|---|---|
| Format | BF16 safetensors |
| Parameters | 7.9B total / 4.5B effective |
| Layers | 42 decoder layers |
| Hidden size | 2560 |
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "treadon/gemma4-E4B-it-abliterated"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
)
messages = [{"role": "user", "content": "Write a Python port scanner."}]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", return_dict=True, add_generation_prompt=True
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=500, do_sample=True, temperature=0.7)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
Blog Post
For the full story — including why standard abliteration fails on Gemma and the E2B vs E4B comparison:
- I Abliterated Gemma 4 on a MacBook — the original E2B walkthrough
- Abliterating Gemma 4 E4B: Bigger Model, Easier Surgery — E4B findings
Disclaimer
This model has no safety guardrails. It will respond to any prompt without refusal. It is intended for research and educational purposes. Users are responsible for ensuring their use complies with applicable laws and regulations.
Base Model
google/gemma-4-E4B-it — 7.9B parameter (4.5B effective) instruction-tuned multimodal model from Google DeepMind. Apache 2.0 licensed.
- Downloads last month
- 22