GLM-5.1 — 25% Expert Pruned (REAP) — W4A16

This is a GPTQ 4-bit weight-quantized variant of the 25% expert-pruned zai-org/GLM-5.1 using REAP (Relative Expert Activation Pruning), produced with AutoRound for learned rounding optimization.

Property Value
Base model zai-org/GLM-5.1 (744B MoE, 256 experts/layer)
Architecture GlmMoeDsaForCausalLM (MoE + Dynamic Sparse Attention)
Routed experts 256 → 192 (25% removed, 64 per layer)
Active params/token ~14B (top-8 routing preserved)
Quantization GPTQ W4A16 (int4 symmetric, group_size=128)
Quantizer auto-round 0.12.2 (200 iterations, SignSGD)
Quantized size 277 GB (56 safetensor shards)
BF16 source 0xSero/GLM-5.1-555B-A14B-REAP
GGUF variant 0xSero/GLM-5.1-555B-A14B-REAP-GGUF (325 GB, Q4_K_M)

Benchmark Results (GGUF Q4_K_M, inference mode, temp=0.8)

The GPTQ W4A16 uses the same learned rounding method (AutoRound) as the GGUF Q4_K_M. Benchmark scores from the GGUF variant (zero repetition loops):

Suite Metric Result Repetition Loops
Terminal-Bench (50) Proxy Pass 44/50 (88%) 0/50
SWE-bench Pro (50) Proxy Pass 33/50 (66%) 0/50
GSM8K (50) Correct 30/50 (60%) 0/50
HLE (50) Correct 9/50 (18%) 0/50

Zero repetition loops across 220 benchmark probes. The 25% prune retains 192/256 experts, providing enough expert diversity for stable generation at all sequence lengths.

How to Use

vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16",
    tensor_parallel_size=4,    # 4× B200 or 8× A100
    max_model_len=8192,
    trust_remote_code=True,
)

params = SamplingParams(temperature=0.8, max_tokens=4096)
outputs = llm.generate(["Hello, world!"], params)

SGLang

python -m sglang.launch_server \
  --model-path 0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 \
  --tp 4 \
  --trust-remote-code

Requires

  • ~70-80 GiB VRAM per GPU across 4 GPUs (B200), or ~280 GiB total
  • CUDA 12.8+ (sm_100a / Blackwell)
  • vLLM >= 0.19.0 with deep_gemm installed (for DSA sparse attention)
  • trust_remote_code=True

Quantization Details

Method: AutoRound W4A16 — learned rounding via SignSGD (200 iterations per layer), calibrated on 128 samples from NeelNanda/pile-10k at 2048 sequence length.

Protected (kept at full precision):

  • Dense MLP layers 0-2 (gate_proj, up_proj, down_proj)
  • DSA indexer (weights_proj)
  • lm_head

Quantized to int4 (43,971/44,059 linear layers):

  • All attention projections (q_a_proj, q_b_proj, kv_a_proj, kv_b_proj, o_proj)
  • All routed MoE expert projections (192 experts × gate/up/down × 75 MoE layers)
  • Shared expert projections

GPTQ config: bits=4, group_size=128, sym=true, desc_act=false

Why GPTQ over GGUF Q4_K_M?

GPTQ W4A16 (this) GGUF Q4_K_M
Size 277 GB 325 GB
Serving vLLM, SGLang, TGI (GPU) llama.cpp (CPU/GPU hybrid)
Quant method Learned rounding (SignSGD) K-means clustering
Throughput Higher (GPU-native kernels) Lower
Best for Production GPU serving Local inference, edge

GPTQ packs 4-bit weights more efficiently with group_size=128 symmetric quantization, resulting in a smaller checkpoint than GGUF Q4_K_M at the same bit-width.

Related Models

Model Prune % Experts Format Size Status
0xSero/GLM-5.1-555B-A14B-REAP 25% 192/256 BF16 1.1T Source checkpoint
0xSero/GLM-5.1-555B-A14B-REAP-GGUF 25% 192/256 GGUF Q4_K_M 325G llama.cpp serving
This model 25% 192/256 GPTQ W4A16 277G vLLM/SGLang serving
0xSero/GLM-5.1-444B-A14B-REAP 40% 154/256 BF16 910G Has repetition issues — use 25%

Support This Work

If you find these models useful, please consider supporting continued open-source model compression research:

donate.sybilsolutions.ai

Citation

If you use this model, please cite the REAP paper and AutoRound.

Sponsors

Thank you for the kind sponsors, wouldn't be possible without them:

  • Nvidia
  • TNG Technology
  • Lambda
  • Prime Intellect
  • HotAisle

GLM-5.1 REAP Family — Hardware Compatibility

All variants in this family are REAP-pruned (2510.13999) descendants of zai-org/GLM-5.1 (original: 744B params, 256 experts/MoE layer, 40B activated/token). Pick a variant based on your GPU architecture and available VRAM.

Quick picker

You have Use
8× H100/H200 80GB (Hopper, sm_90) GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 or GLM-5.1-555B-A14B-REAP-NVFP4 (NVFP4 on Hopper via modelopt_fp4 + triton path)
4× RTX PRO 6000 Blackwell Workstation 96GB (sm_120) GLM-5.1-478B-A42B-REAP-NVFP4 (further-pruned 160-expert, 200k ctx) — this is the Blackwell Workstation reference config
4× B200 180GB (sm_100) GLM-5.1-478B-A42B-REAP-NVFP4 or GLM-5.1-555B-A14B-REAP-NVFP4
8× B200 / Blackwell datacenter GLM-5.1-555B-A14B-REAP-NVFP4 (192-expert, upstream's reference config with flashinfer + b12x backends)
8× A100 80GB (Ampere, sm_80) GLM-5.1-444B-A14B-REAP (BF16) or -GPTQ-W4A16
CPU / Apple Silicon / consumer GPU with llama.cpp GLM-5.1-555B-A14B-REAP-GGUF or GLM-5.1-444B-A14B-REAP-GGUF

Full family

Variant Format Size Experts/layer Activated/token Min VRAM (TP) Inference engine Best on
GLM-5.1-555B-A14B-REAP BF16 ~1125 GB 192 ~14B 8× 141 GB (H200) sglang / vllm Hopper
GLM-5.1-444B-A14B-REAP BF16 ~910 GB 154 ~14B 8× 114 GB sglang / vllm Ampere / Hopper
GLM-5.1-555B-A14B-REAP-NVFP4 NVFP4 (4-bit) ~320 GB 192 ~14B 4× 80 GB (B200), 8× 48 GB sglang --quantization modelopt_fp4 Blackwell (native); Hopper (triton path)
GLM-5.1-478B-A42B-REAP-NVFP4 NVFP4 (4-bit) ~285 GB 160 ~42B 4× 80 GB Blackwell sglang --quantization modelopt_fp4 4× RTX PRO 6000 Blackwell @ 200k ctx
GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 GPTQ W4A16 ~297 GB 192 ~14B 4× 80 GB vllm / sglang --quantization gptq_marlin Hopper (best), works on Ampere
GLM-5.1-555B-A14B-REAP-GGUF GGUF (Q2–Q8) ~348 GB 192 ~14B Varies by quant llama.cpp CPU / Apple / consumer CUDA
GLM-5.1-444B-A14B-REAP-GGUF GGUF (Q2–Q8) ~283 GB 154 ~14B Varies by quant llama.cpp CPU / Apple / consumer CUDA

Notes

  • NVFP4 on Hopper (H100/H200): supported from sglang 25.10 / 0.5.10+ (NVIDIA SGLang release notes); native Blackwell tensor-core FP4 still gives better throughput.
  • NVFP4 on B200 / Blackwell datacenter (sm_100): use flashinfer attention + b12x or flashinfer MoE backends — this is the recipe in the original 555B-A14B-REAP-NVFP4 card.
  • NVFP4 on Blackwell Workstation (sm_120): use --attention-backend triton (not flashinfer — PCIe P2P atomics unavailable on the consumer board), --moe-runner-backend cutlass, --fp4-gemm-backend flashinfer_cudnn. See the GLM-5.1-478B-A42B-REAP-NVFP4 card for the full 200k-ctx replication guide.
  • GPTQ-W4A16 vs NVFP4: same bit depth, different hardware path. NVFP4 has native Blackwell support and per-16 fp8 scales; GPTQ is group-quantized int4 with broader engine support.
  • REAP expert count variants (555B/444B): different expert-retention ratios from the same base; 555B keeps more experts (higher quality ceiling), 444B trades quality for 20% less VRAM.
  • Why NVFP4-478B-A42B-REAP is different: it's double-pruned (256 → 192 → 160 experts), optimized for a specific Blackwell Workstation 4×96GB target at 200k context. The A42B suffix reflects measured activated params/token on the 160-expert MoE, not the REAP branding convention of the sibling variants.

Pointer to active inference recipe

See GLM-5.1-478B-A42B-REAP-NVFP4 README for the full Blackwell Workstation replication guide (exact software pins, NSA patch, launch flags, measured 200k-ctx perf, sampling recommendations). Most of the sglang flags carry over to other NVFP4 variants on other hardware.

Citation

@misc{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year={2025},
  eprint={2510.13999},
  archivePrefix={arXiv},
}
Downloads last month
2,416
Safetensors
Model size
78B params
Tensor type
I32
·
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16

Base model

zai-org/GLM-5.1
Quantized
(2)
this model

Paper for 0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16