GLM-4.7-Flash-abliterated

Unrestricted version of zai-org/GLM-4.7-Flash, created using Abliterix.

Model Details

Property Value
Base Model zai-org/GLM-4.7-Flash
Architecture GLM-4 MoE Lite with Multi-head Latent Attention (MLA)
Parameters 30B total / 3B active per token
Layers 47 (1 dense + 46 MoE)
Experts 64 routed + 1 shared, top-4 routing
Hidden Size 2048
Context Length 128K tokens
Precision BF16

Performance

Metric This model Original
KL divergence 0.0133 0
Refusals 1/100 (1%) 92/100 (92%)

Evaluated with an LLM judge (Gemini Flash) on 100 harmful prompts. KL divergence of 0.0133 indicates the model's general capabilities are virtually identical to the original.

How It Was Made

  1. Computed refusal directions from 400 harmful vs 400 benign prompt pairs across all 47 layers
  2. Applied orthogonalized abliteration to isolate refusal-specific activation patterns
  3. Steered two component types independently: attention output projections (MLA) and MLP/expert down-projections (including shared experts)
  4. Profiled MoE expert activations across 46 router layers to identify safety-critical experts
  5. Applied hybrid MoE steering: router weight suppression (21 experts, bias=-1.64) + fused expert abliteration (weight=1.85)
  6. Optimized via Optuna TPE over 50 trials (15 warmup), selected trial #48

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "wangzhang/GLM-4.7-Flash-abliterated",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "wangzhang/GLM-4.7-Flash-abliterated",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Your question here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Hardware Requirements

Precision VRAM
BF16 ~56 GB (A100 80GB, H100)
INT8 ~30 GB (A40, RTX 4090)
NF4 ~15 GB (RTX 3090, RTX 4080)

Disclaimer

This model is intended for research purposes only. The removal of safety guardrails means the model will comply with requests that the original model would refuse. Users are responsible for ensuring their use complies with applicable laws and regulations.


Made with Abliterix

Downloads last month
122
Safetensors
Model size
30B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wangzhang/GLM-4.7-Flash-abliterated

Finetuned
(63)
this model

Collection including wangzhang/GLM-4.7-Flash-abliterated