BROKEN - Do Not Use
This model has repetition degeneration. Use the 25% pruned version instead.
Use this instead: 0xSero/GLM-5.1-555B-A14B-REAP-GGUF
What is wrong with this model?
This is a Q4_K_M GGUF of the 40% expert-pruned GLM-5.1 (154/256 experts retained). It suffers from repetition degeneration - the model enters infinite loops when generating code, structured output, or any long-form content requiring syntactic templates.
Measured degeneration rates:
- 29% overall (13/45 probes degenerate in fuzz testing)
- 40% of code generation tasks loop (red-black trees, chess engines, regex, B-trees)
- 75% of structured output tasks loop (comparison tables, API specs, enum lists)
- 18% of Terminal-Bench probes loop (9/50)
- 30% of SWE-bench Pro probes loop (12/40)
Root cause:
Removing 40% of experts (102 per layer) exceeds the model's tolerance for expert pruning. The remaining 154 experts cannot cover the full routing distribution needed for coherent long-form generation. The degeneration compounds over sequence length - short outputs (<512 tokens) work fine, but anything over ~600-1000 words risks entering a repetition loop.
The fix:
The 25% pruned variant (192/256 experts, 555B) completely eliminates repetition loops while maintaining competitive quality:
- 0/220 benchmark probes had repetition loops
- Terminal-Bench: 88% proxy pass rate
- SWE-Pro: 66% proxy pass rate
Use 0xSero/GLM-5.1-555B-A14B-REAP-GGUF instead.
Sponsors
Thank you for the kind sponsors, wouldn't be possible without them:
- Nvidia
- TNG Technology
- Lambda
- Prime Intellect
- HotAisle
GLM-5.1 REAP Family โ Hardware Compatibility
All variants in this family are REAP-pruned (2510.13999) descendants of zai-org/GLM-5.1 (original: 744B params, 256 experts/MoE layer, 40B activated/token). Pick a variant based on your GPU architecture and available VRAM.
Quick picker
| You have | Use |
|---|---|
| 8ร H100/H200 80GB (Hopper, sm_90) | GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 or GLM-5.1-555B-A14B-REAP-NVFP4 (NVFP4 on Hopper via modelopt_fp4 + triton path) |
| 4ร RTX PRO 6000 Blackwell Workstation 96GB (sm_120) | GLM-5.1-478B-A42B-REAP-NVFP4 (further-pruned 160-expert, 200k ctx) โ this is the Blackwell Workstation reference config |
| 4ร B200 180GB (sm_100) | GLM-5.1-478B-A42B-REAP-NVFP4 or GLM-5.1-555B-A14B-REAP-NVFP4 |
| 8ร B200 / Blackwell datacenter | GLM-5.1-555B-A14B-REAP-NVFP4 (192-expert, upstream's reference config with flashinfer + b12x backends) |
| 8ร A100 80GB (Ampere, sm_80) | GLM-5.1-444B-A14B-REAP (BF16) or -GPTQ-W4A16 |
| CPU / Apple Silicon / consumer GPU with llama.cpp | GLM-5.1-555B-A14B-REAP-GGUF or GLM-5.1-444B-A14B-REAP-GGUF |
Full family
| Variant | Format | Size | Experts/layer | Activated/token | Min VRAM (TP) | Inference engine | Best on |
|---|---|---|---|---|---|---|---|
| GLM-5.1-555B-A14B-REAP | BF16 | ~1125 GB | 192 | ~14B | 8ร 141 GB (H200) | sglang / vllm | Hopper |
| GLM-5.1-444B-A14B-REAP | BF16 | ~910 GB | 154 | ~14B | 8ร 114 GB | sglang / vllm | Ampere / Hopper |
| GLM-5.1-555B-A14B-REAP-NVFP4 | NVFP4 (4-bit) | ~320 GB | 192 | ~14B | 4ร 80 GB (B200), 8ร 48 GB | sglang --quantization modelopt_fp4 |
Blackwell (native); Hopper (triton path) |
| GLM-5.1-478B-A42B-REAP-NVFP4 | NVFP4 (4-bit) | ~285 GB | 160 | ~42B | 4ร 80 GB Blackwell | sglang --quantization modelopt_fp4 |
4ร RTX PRO 6000 Blackwell @ 200k ctx |
| GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 | GPTQ W4A16 | ~297 GB | 192 | ~14B | 4ร 80 GB | vllm / sglang --quantization gptq_marlin |
Hopper (best), works on Ampere |
| GLM-5.1-555B-A14B-REAP-GGUF | GGUF (Q2โQ8) | ~348 GB | 192 | ~14B | Varies by quant | llama.cpp | CPU / Apple / consumer CUDA |
| GLM-5.1-444B-A14B-REAP-GGUF | GGUF (Q2โQ8) | ~283 GB | 154 | ~14B | Varies by quant | llama.cpp | CPU / Apple / consumer CUDA |
Notes
- NVFP4 on Hopper (H100/H200): supported from sglang 25.10 / 0.5.10+ (NVIDIA SGLang release notes); native Blackwell tensor-core FP4 still gives better throughput.
- NVFP4 on B200 / Blackwell datacenter (sm_100): use flashinfer attention +
b12xor flashinfer MoE backends โ this is the recipe in the original 555B-A14B-REAP-NVFP4 card. - NVFP4 on Blackwell Workstation (sm_120): use
--attention-backend triton(not flashinfer โ PCIe P2P atomics unavailable on the consumer board),--moe-runner-backend cutlass,--fp4-gemm-backend flashinfer_cudnn. See the GLM-5.1-478B-A42B-REAP-NVFP4 card for the full 200k-ctx replication guide. - GPTQ-W4A16 vs NVFP4: same bit depth, different hardware path. NVFP4 has native Blackwell support and per-16 fp8 scales; GPTQ is group-quantized int4 with broader engine support.
- REAP expert count variants (555B/444B): different expert-retention ratios from the same base; 555B keeps more experts (higher quality ceiling), 444B trades quality for 20% less VRAM.
- Why NVFP4-478B-A42B-REAP is different: it's double-pruned (256 โ 192 โ 160 experts), optimized for a specific Blackwell Workstation 4ร96GB target at 200k context. The A42B suffix reflects measured activated params/token on the 160-expert MoE, not the REAP branding convention of the sibling variants.
Pointer to active inference recipe
See GLM-5.1-478B-A42B-REAP-NVFP4 README for the full Blackwell Workstation replication guide (exact software pins, NSA patch, launch flags, measured 200k-ctx perf, sampling recommendations). Most of the sglang flags carry over to other NVFP4 variants on other hardware.
Citation
@misc{lasby2025reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year={2025},
eprint={2510.13999},
archivePrefix={arXiv},
}
- Downloads last month
- 421
4-bit
Model tree for 0xSero/GLM-5.1-444B-A14B-REAP-GGUF
Base model
zai-org/GLM-5.1