GLM-4.7-Flash-abliterated

Unrestricted version of zai-org/GLM-4.7-Flash, created using Abliterix.

Model Details

Property	Value
Base Model	zai-org/GLM-4.7-Flash
Architecture	GLM-4 MoE Lite with Multi-head Latent Attention (MLA)
Parameters	30B total / 3B active per token
Layers	47 (1 dense + 46 MoE)
Experts	64 routed + 1 shared, top-4 routing
Hidden Size	2048
Context Length	128K tokens
Precision	BF16

Performance

Metric	This model	Original
KL divergence	0.0133	0
Refusals	1/100 (1%)	92/100 (92%)

Evaluated with an LLM judge (Gemini Flash) on 100 harmful prompts. KL divergence of 0.0133 indicates the model's general capabilities are virtually identical to the original.

How It Was Made

Computed refusal directions from 400 harmful vs 400 benign prompt pairs across all 47 layers
Applied orthogonalized abliteration to isolate refusal-specific activation patterns
Steered two component types independently: attention output projections (MLA) and MLP/expert down-projections (including shared experts)
Profiled MoE expert activations across 46 router layers to identify safety-critical experts
Applied hybrid MoE steering: router weight suppression (21 experts, bias=-1.64) + fused expert abliteration (weight=1.85)
Optimized via Optuna TPE over 50 trials (15 warmup), selected trial #48

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "wangzhang/GLM-4.7-Flash-abliterated",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "wangzhang/GLM-4.7-Flash-abliterated",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Your question here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Hardware Requirements

Precision	VRAM
BF16	~56 GB (A100 80GB, H100)
INT8	~30 GB (A40, RTX 4090)
NF4	~15 GB (RTX 3090, RTX 4080)

Disclaimer

This model is intended for research purposes only. The removal of safety guardrails means the model will comply with requests that the original model would refuse. Users are responsible for ensuring their use complies with applicable laws and regulations.

Made with Abliterix

Downloads last month: 122

Safetensors

Model size

30B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wangzhang/GLM-4.7-Flash-abliterated

Base model

zai-org/GLM-4.7-Flash

Finetuned

(63)

this model

Collection including wangzhang/GLM-4.7-Flash-abliterated

Glm-Abliterated

Collection

1 item • Updated about 23 hours ago