GLM-5.1 — 25% Expert Pruned (REAP) — W4A16
This is a GPTQ 4-bit weight-quantized variant of the 25% expert-pruned zai-org/GLM-5.1 using REAP (Relative Expert Activation Pruning), produced with AutoRound for learned rounding optimization.
| Property | Value |
|---|---|
| Base model | zai-org/GLM-5.1 (744B MoE, 256 experts/layer) |
| Architecture | GlmMoeDsaForCausalLM (MoE + Dynamic Sparse Attention) |
| Routed experts | 256 → 192 (25% removed, 64 per layer) |
| Active params/token | ~14B (top-8 routing preserved) |
| Quantization | GPTQ W4A16 (int4 symmetric, group_size=128) |
| Quantizer | auto-round 0.12.2 (200 iterations, SignSGD) |
| Quantized size | 277 GB (56 safetensor shards) |
| BF16 source | 0xSero/GLM-5.1-555B-A14B-REAP |
| GGUF variant | 0xSero/GLM-5.1-555B-A14B-REAP-GGUF (325 GB, Q4_K_M) |
Benchmark Results (GGUF Q4_K_M, inference mode, temp=0.8)
The GPTQ W4A16 uses the same learned rounding method (AutoRound) as the GGUF Q4_K_M. Benchmark scores from the GGUF variant (zero repetition loops):
| Suite | Metric | Result | Repetition Loops |
|---|---|---|---|
| Terminal-Bench (50) | Proxy Pass | 44/50 (88%) | 0/50 |
| SWE-bench Pro (50) | Proxy Pass | 33/50 (66%) | 0/50 |
| GSM8K (50) | Correct | 30/50 (60%) | 0/50 |
| HLE (50) | Correct | 9/50 (18%) | 0/50 |
Zero repetition loops across 220 benchmark probes. The 25% prune retains 192/256 experts, providing enough expert diversity for stable generation at all sequence lengths.
How to Use
vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16",
tensor_parallel_size=4, # 4× B200 or 8× A100
max_model_len=8192,
trust_remote_code=True,
)
params = SamplingParams(temperature=0.8, max_tokens=4096)
outputs = llm.generate(["Hello, world!"], params)
SGLang
python -m sglang.launch_server \
--model-path 0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 \
--tp 4 \
--trust-remote-code
Requires
- ~70-80 GiB VRAM per GPU across 4 GPUs (B200), or ~280 GiB total
- CUDA 12.8+ (sm_100a / Blackwell)
- vLLM >= 0.19.0 with
deep_gemminstalled (for DSA sparse attention) trust_remote_code=True
Quantization Details
Method: AutoRound W4A16 — learned rounding via SignSGD (200 iterations per layer), calibrated on 128 samples from NeelNanda/pile-10k at 2048 sequence length.
Protected (kept at full precision):
- Dense MLP layers 0-2 (
gate_proj,up_proj,down_proj) - DSA indexer (
weights_proj) lm_head
Quantized to int4 (43,971/44,059 linear layers):
- All attention projections (
q_a_proj,q_b_proj,kv_a_proj,kv_b_proj,o_proj) - All routed MoE expert projections (192 experts × gate/up/down × 75 MoE layers)
- Shared expert projections
GPTQ config: bits=4, group_size=128, sym=true, desc_act=false
Why GPTQ over GGUF Q4_K_M?
| GPTQ W4A16 (this) | GGUF Q4_K_M | |
|---|---|---|
| Size | 277 GB | 325 GB |
| Serving | vLLM, SGLang, TGI (GPU) | llama.cpp (CPU/GPU hybrid) |
| Quant method | Learned rounding (SignSGD) | K-means clustering |
| Throughput | Higher (GPU-native kernels) | Lower |
| Best for | Production GPU serving | Local inference, edge |
GPTQ packs 4-bit weights more efficiently with group_size=128 symmetric quantization, resulting in a smaller checkpoint than GGUF Q4_K_M at the same bit-width.
Related Models
| Model | Prune % | Experts | Format | Size | Status |
|---|---|---|---|---|---|
0xSero/GLM-5.1-555B-A14B-REAP |
25% | 192/256 | BF16 | 1.1T | Source checkpoint |
0xSero/GLM-5.1-555B-A14B-REAP-GGUF |
25% | 192/256 | GGUF Q4_K_M | 325G | llama.cpp serving |
| This model | 25% | 192/256 | GPTQ W4A16 | 277G | vLLM/SGLang serving |
0xSero/GLM-5.1-444B-A14B-REAP |
40% | 154/256 | BF16 | 910G | Has repetition issues — use 25% |
Support This Work
If you find these models useful, please consider supporting continued open-source model compression research:
Citation
If you use this model, please cite the REAP paper and AutoRound.
Sponsors
Thank you for the kind sponsors, wouldn't be possible without them:
- Nvidia
- TNG Technology
- Lambda
- Prime Intellect
- HotAisle
GLM-5.1 REAP Family — Hardware Compatibility
All variants in this family are REAP-pruned (2510.13999) descendants of zai-org/GLM-5.1 (original: 744B params, 256 experts/MoE layer, 40B activated/token). Pick a variant based on your GPU architecture and available VRAM.
Quick picker
| You have | Use |
|---|---|
| 8× H100/H200 80GB (Hopper, sm_90) | GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 or GLM-5.1-555B-A14B-REAP-NVFP4 (NVFP4 on Hopper via modelopt_fp4 + triton path) |
| 4× RTX PRO 6000 Blackwell Workstation 96GB (sm_120) | GLM-5.1-478B-A42B-REAP-NVFP4 (further-pruned 160-expert, 200k ctx) — this is the Blackwell Workstation reference config |
| 4× B200 180GB (sm_100) | GLM-5.1-478B-A42B-REAP-NVFP4 or GLM-5.1-555B-A14B-REAP-NVFP4 |
| 8× B200 / Blackwell datacenter | GLM-5.1-555B-A14B-REAP-NVFP4 (192-expert, upstream's reference config with flashinfer + b12x backends) |
| 8× A100 80GB (Ampere, sm_80) | GLM-5.1-444B-A14B-REAP (BF16) or -GPTQ-W4A16 |
| CPU / Apple Silicon / consumer GPU with llama.cpp | GLM-5.1-555B-A14B-REAP-GGUF or GLM-5.1-444B-A14B-REAP-GGUF |
Full family
| Variant | Format | Size | Experts/layer | Activated/token | Min VRAM (TP) | Inference engine | Best on |
|---|---|---|---|---|---|---|---|
| GLM-5.1-555B-A14B-REAP | BF16 | ~1125 GB | 192 | ~14B | 8× 141 GB (H200) | sglang / vllm | Hopper |
| GLM-5.1-444B-A14B-REAP | BF16 | ~910 GB | 154 | ~14B | 8× 114 GB | sglang / vllm | Ampere / Hopper |
| GLM-5.1-555B-A14B-REAP-NVFP4 | NVFP4 (4-bit) | ~320 GB | 192 | ~14B | 4× 80 GB (B200), 8× 48 GB | sglang --quantization modelopt_fp4 |
Blackwell (native); Hopper (triton path) |
| GLM-5.1-478B-A42B-REAP-NVFP4 | NVFP4 (4-bit) | ~285 GB | 160 | ~42B | 4× 80 GB Blackwell | sglang --quantization modelopt_fp4 |
4× RTX PRO 6000 Blackwell @ 200k ctx |
| GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 | GPTQ W4A16 | ~297 GB | 192 | ~14B | 4× 80 GB | vllm / sglang --quantization gptq_marlin |
Hopper (best), works on Ampere |
| GLM-5.1-555B-A14B-REAP-GGUF | GGUF (Q2–Q8) | ~348 GB | 192 | ~14B | Varies by quant | llama.cpp | CPU / Apple / consumer CUDA |
| GLM-5.1-444B-A14B-REAP-GGUF | GGUF (Q2–Q8) | ~283 GB | 154 | ~14B | Varies by quant | llama.cpp | CPU / Apple / consumer CUDA |
Notes
- NVFP4 on Hopper (H100/H200): supported from sglang 25.10 / 0.5.10+ (NVIDIA SGLang release notes); native Blackwell tensor-core FP4 still gives better throughput.
- NVFP4 on B200 / Blackwell datacenter (sm_100): use flashinfer attention +
b12xor flashinfer MoE backends — this is the recipe in the original 555B-A14B-REAP-NVFP4 card. - NVFP4 on Blackwell Workstation (sm_120): use
--attention-backend triton(not flashinfer — PCIe P2P atomics unavailable on the consumer board),--moe-runner-backend cutlass,--fp4-gemm-backend flashinfer_cudnn. See the GLM-5.1-478B-A42B-REAP-NVFP4 card for the full 200k-ctx replication guide. - GPTQ-W4A16 vs NVFP4: same bit depth, different hardware path. NVFP4 has native Blackwell support and per-16 fp8 scales; GPTQ is group-quantized int4 with broader engine support.
- REAP expert count variants (555B/444B): different expert-retention ratios from the same base; 555B keeps more experts (higher quality ceiling), 444B trades quality for 20% less VRAM.
- Why NVFP4-478B-A42B-REAP is different: it's double-pruned (256 → 192 → 160 experts), optimized for a specific Blackwell Workstation 4×96GB target at 200k context. The A42B suffix reflects measured activated params/token on the 160-expert MoE, not the REAP branding convention of the sibling variants.
Pointer to active inference recipe
See GLM-5.1-478B-A42B-REAP-NVFP4 README for the full Blackwell Workstation replication guide (exact software pins, NSA patch, launch flags, measured 200k-ctx perf, sampling recommendations). Most of the sglang flags carry over to other NVFP4 variants on other hardware.
Citation
@misc{lasby2025reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year={2025},
eprint={2510.13999},
archivePrefix={arXiv},
}
- Downloads last month
- 2,416