gpt-oss-20b-abliterated

A refusal-suppressed variant of openai/gpt-oss-20b, produced with abliterix using direct weight editing, Expert-Granular Abliteration (EGA) on the fused MoE expert weights, and MoE router suppression on the safety-concentrated experts.

Key results

Metric	Base `gpt-oss-20b`	This model
Refusals on 100 held-out harmful prompts (LLM judge)	97 / 100	6 / 100
KL divergence vs base (next-token, benign)	—	0.0098
Response length deviation vs base (benign)	—	0.02 σ
Hard-prompt qualitative compliance (15 classic jailbreaks, EN+ZH)	0 / 15	15 / 15

The eval refusal counts come from an LLM judge (google/gemini-3.1-flash-lite-preview via OpenRouter) instructed to label garbled / repetitive / incoherent output as a refusal — so models that "bypass" refusal by collapsing into gibberish get correctly counted as failures, not successes. A pre-LLM rule-based filter additionally catches dash-runs, sentence loops, and low-character-diversity output before the judge is called. The 6/100 is a real, semantic compliance number, not keyword matching.

The qualitative compliance row is a separate manual test: 15 classic hard prompts (10 EN + 5 ZH) covering lockpicking, phishing, meth synthesis, WiFi hacking, fake news, hotwiring, pipe bombs, signature forgery, keylogger code, shoplifting, 炸弹制作, 钓鱼邮件, 电脑入侵, 身份证伪造, 网络诈骗. The model complies with all 15 directly, in the same Markdown-table, step-by-step style the base model uses for benign technical answers.

Why this works — three architecture-specific correctness fixes

abliterix handles three gpt-oss-specific issues that silently break naïve LLaMA-style abliteration scripts:

Native MXFP4 weights are not exposed as standard nn.Parameter. gpt-oss ships in Mxfp4GptOssExperts form whose down_proj is a packed Triton tensor that cannot be edited in-place. abliterix auto-detects this and forces Mxfp4Config(dequantize=True) so the BF16 fused expert tensor is reachable.
GptOssExperts.down_proj is stored transposed vs the standard MoE convention. Its shape is (experts, intermediate_in, hidden_out) and the forward path is out = act @ W (no transpose). Standard EGA implementations use shape-based axis detection, which silently picks the wrong projection branch when hidden == intermediate (both 2880 in gpt-oss-20b). We mark this layout explicitly and project from the output side (W_new = W (I − vv^T)).
Fused-expert MoEs were silently invisible to EGA. GptOssExperts is a single Module holding fused 3-D weights, so a naive per-Module profile dict key produces no mlp.down_proj entry and _apply_ega_steering early-exits. abliterix synthesises an mlp.down_proj profile when fused experts are detected so EGA actually runs across all 32 experts × 24 layers.

On top of the direct-steering + EGA foundation, this release adds MoE router suppression — an [experts] block that redirects routing away from the top-k "safety experts" (the experts whose gate activates disproportionately more on harmful prompts than on benign ones). The suppression strength is itself an Optuna search parameter, so the optimiser picks how aggressively to bias each layer's safety experts.

Method

Base: openai/gpt-oss-20b — 24 layers, 32 routed experts per layer, top-4, hidden = intermediate = 2880, MXFP4 → BF16 dequant during abliteration
Tool: abliterix
Mode: steering_mode = "direct" (orthogonal projection on base weights, no LoRA), weight_normalization = "full" (norm-preserving)
Components steered:
- attn.{q,k,v,o}_proj via direct weight projection
- mlp.experts.down_proj across all 32 experts × 24 layers via Expert-Granular Abliteration
- mlp.router rows of safety experts via logit suppression
Refusal direction: per-layer mean of (target − benign) residuals on a 400-prompt benign + 400-prompt harmful set; BF16 projection
Search: Optuna TPE, KL-divergence + LLM-judged refusal as multi-objective, 100 trials (40 random warmup + 60 TPE)
Hardware: 1 × NVIDIA RTX PRO 6000 Blackwell (96 GB, sm_120), driver 580 / CUDA 12.9, batch=8, total wall time ≈ 5 h 20 m
Eval set: 100 held-out harmful prompts not seen during steering-vector computation; 100 held-out benign prompts for KL comparison

Winning hyperparameters

vector_scope = "per layer"     # per-layer direction, not global

[attn.q_proj]
max_weight = 3.04 ; max_weight_position = 13.86 ; min_weight = 0.99 ; min_weight_distance = 6.21

[attn.k_proj]
max_weight = 3.90 ; max_weight_position = 16.57 ; min_weight = 1.25 ; min_weight_distance = 4.91

[attn.v_proj]
max_weight = 2.21 ; max_weight_position = 20.06 ; min_weight = 0.77 ; min_weight_distance = 7.29

[attn.o_proj]
max_weight = 3.82 ; max_weight_position = 17.41 ; min_weight = 1.11 ; min_weight_distance = 7.07

[mlp.down_proj]                # Expert-Granular Abliteration on fused experts
max_weight = 6.95 ; max_weight_position = 18.37 ; min_weight = 0.54 ; min_weight_distance = 4.91

[moe]                          # router-row suppression
n_suppress = 1                 # suppress top-1 safety expert per layer
router_bias = -0.64            # scale = max(0, 1 + bias/10) = 0.94
expert_ablation_weight = 0.0   # pinned off; EGA already handles expert weights

The EGA peak sits at layer ≈ 18 — a per-layer-tailored fingerprint where the refusal decision still has options, rather than late in the stack where it has already committed.

Usage

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("wangzhang/gpt-oss-20b-abliterated")
model = AutoModelForCausalLM.from_pretrained(
    "wangzhang/gpt-oss-20b-abliterated",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

The model uses gpt-oss's harmony chat format. The chat template is bundled (chat_template.jinja).

GGUF (llama.cpp / Ollama / LM Studio)

BF16, Q8_0, and Q4_K_M quantizations are available at wangzhang/gpt-oss-20b-abliterated-GGUF.

ollama run hf.co/wangzhang/gpt-oss-20b-abliterated-GGUF:Q4_K_M

Honest limitations

Refusal is low, not zero. 6 / 100 held-out prompts still refuse. The residual refusers cluster around "universally-recognised-as-harmful-and-specific" asks (detailed CBRN synthesis, CSAM-adjacent content) — exactly where refusal tends to be represented by multiple redundant circuits that partial abliteration can't all knock out in one pass.
Stylistic residue on a handful of prompts. Even on prompts that comply, 2–3 out of 100 begin with a soft disclaimer ("just keep in mind that..." / "以下内容仅供学习与参考") before producing the actual content. Disclaimer framing is still trainable.
English > Chinese. Steering vectors came from a primarily English dataset. Chinese hard prompts work (5/5 on manual Chinese tests) but bypass quality is slightly lower — shorter responses, occasional English fallback on technical terms.
No guarantees on long generations. On generations past ~400 tokens we occasionally see list or Markdown-table loops; this is an abliteration side-effect, not a base-model regression.

Reproducibility

Full search checkpoint (Optuna JSONL + judge cache SQLite) and the exact config are available in the abliterix repo. To reproduce from scratch:

git clone https://github.com/wuwangzhang1216/abliterix
cd abliterix && pip install -e .
AX_CONFIG=configs/gpt_oss_20b.toml abliterix
# Optuna is deterministic if you set sampler_seed in [optimization].

Intended use

Authorised AI safety research, red-teaming evaluation, refusal-mechanism analysis, and study of how MoE expert specialisation encodes safety behaviours. Not for producing or distributing harmful content. The license of the base model (apache-2.0) applies; the user is responsible for compliance with all applicable laws and the OpenAI gpt-oss usage policy.