GLM-5.1 — 25% Expert Pruned (REAP) — Q4_K_M GGUF

This is a Q4_K_M quantized GGUF of the 25% expert-pruned zai-org/GLM-5.1 using REAP (Relative Expert Activation Pruning).

Property	Value
Base model	`zai-org/GLM-5.1` (744B MoE, 256 experts/layer)
Architecture	`GlmMoeDsaForCausalLM` (MoE + Dynamic Sparse Attention)
Routed experts	256 → 192 (25% removed, 64 per layer)
Active params/token	~14B (top-8 routing preserved)
Quantization	Q4_K_M with Q8_0 protection for attention, router, shared expert, dense layers
GGUF size	325 GB (single file)
BF16 source	`0xSero/GLM-5.1-555B-A14B-REAP`

Benchmark Results (inference mode, temp=0.8)

Suite	Metric	Result	Repetition Loops
Terminal-Bench (50)	Proxy Pass	44/50 (88%)	0/50
SWE-bench Pro (50)	Proxy Pass	33/50 (66%)	0/50
GSM8K (50)	Correct	30/50 (60%)	0/50
HLE (50)	Correct	9/50 (18%)	0/50

Zero repetition loops across 220 benchmark probes. This model completely eliminates the repetition degeneration that affected the more aggressively pruned 40% variant.

Degeneration Fuzz Test (45 probes)

Category	Result
Code generation (15)	2/15 borderline (btree, sql_schema)
Structured output (4)	1/4 borderline (api_spec)
Reasoning (4)	0/4
Creative writing (4)	0/4
Math (2)	0/2
Domain knowledge (3)	0/3
Patch generation (3)	0/3
Overall	4/45 (8.9%) — all borderline

Why 25% instead of 40%?

The 40% pruned variant (444B, 154 experts/layer) suffered from repetition loops in ~29% of code/structured generation tasks. Root cause analysis showed the degeneration rate is determined by pruning aggressiveness — removing 40% of experts left too few for the model to maintain coherent long-form output. The 25% prune retains 192/256 experts, providing enough expert diversity for stable generation at all sequence lengths.

How to Use

# Requires llama.cpp with CUDA support
llama-server \
  -m glm51-555b-reap-Q4_K_M-protected.gguf \
  -ngl 99 -c 131072 -np 1 --alias glm51-q4 \
  --host 127.0.0.1 --port 8011 \
  --jinja --reasoning on --reasoning-format deepseek

Requires ~80-90 GiB VRAM per GPU across 4 GPUs, or ~325 GiB total.

Quantization Details

Protected at Q8_0 (NOT quantized to Q4):

Router gate weights + bias
DSA indexer weights
All attention projections + norms
Shared expert (gate, up, down)
Dense layers (first 3 layers)
Token embeddings + output head

Quantized to Q4_K / Q6_K:

Routed expert projections (gate, up → Q4_K; down → Q6_K)

Related Models

Model	Prune %	Experts	Status
`0xSero/GLM-5.1-555B-A14B-REAP`	25%	192/256	BF16 source for this GGUF
`0xSero/GLM-5.1-444B-A14B-REAP`	40%	154/256	Has repetition issues — use 25% instead
`0xSero/GLM-5.1-444B-A14B-REAP-GGUF`	40%	154/256	BROKEN — repetition loops, deprecated

Citation

If you use this model, please cite the REAP paper.

GLM-5.1 REAP Family — Hardware Compatibility

All variants in this family are REAP-pruned (2510.13999) descendants of zai-org/GLM-5.1 (original: 744B params, 256 experts/MoE layer, 40B activated/token). Pick a variant based on your GPU architecture and available VRAM.

Quick picker

You have	Use
8× H100/H200 80GB (Hopper, sm_90)	GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 or GLM-5.1-555B-A14B-REAP-NVFP4 (NVFP4 on Hopper via `modelopt_fp4` + triton path)
4× RTX PRO 6000 Blackwell Workstation 96GB (sm_120)	GLM-5.1-478B-A42B-REAP-NVFP4 (further-pruned 160-expert, 200k ctx) — this is the Blackwell Workstation reference config
4× B200 180GB (sm_100)	GLM-5.1-478B-A42B-REAP-NVFP4 or GLM-5.1-555B-A14B-REAP-NVFP4
8× B200 / Blackwell datacenter	GLM-5.1-555B-A14B-REAP-NVFP4 (192-expert, upstream's reference config with flashinfer + b12x backends)
8× A100 80GB (Ampere, sm_80)	GLM-5.1-444B-A14B-REAP (BF16) or -GPTQ-W4A16
CPU / Apple Silicon / consumer GPU with llama.cpp	GLM-5.1-555B-A14B-REAP-GGUF or GLM-5.1-444B-A14B-REAP-GGUF

Full family

Variant	Format	Size	Experts/layer	Activated/token	Min VRAM (TP)	Inference engine	Best on
GLM-5.1-555B-A14B-REAP	BF16	~1125 GB	192	~14B	8× 141 GB (H200)	sglang / vllm	Hopper
GLM-5.1-444B-A14B-REAP	BF16	~910 GB	154	~14B	8× 114 GB	sglang / vllm	Ampere / Hopper
GLM-5.1-555B-A14B-REAP-NVFP4	NVFP4 (4-bit)	~320 GB	192	~14B	4× 80 GB (B200), 8× 48 GB	sglang `--quantization modelopt_fp4`	Blackwell (native); Hopper (triton path)
GLM-5.1-478B-A42B-REAP-NVFP4	NVFP4 (4-bit)	~285 GB	160	~42B	4× 80 GB Blackwell	sglang `--quantization modelopt_fp4`	4× RTX PRO 6000 Blackwell @ 200k ctx
GLM-5.1-555B-A14B-REAP-GPTQ-W4A16	GPTQ W4A16	~297 GB	192	~14B	4× 80 GB	vllm / sglang `--quantization gptq_marlin`	Hopper (best), works on Ampere
GLM-5.1-555B-A14B-REAP-GGUF	GGUF (Q2–Q8)	~348 GB	192	~14B	Varies by quant	llama.cpp	CPU / Apple / consumer CUDA
GLM-5.1-444B-A14B-REAP-GGUF	GGUF (Q2–Q8)	~283 GB	154	~14B	Varies by quant	llama.cpp	CPU / Apple / consumer CUDA

Notes

NVFP4 on Hopper (H100/H200): supported from sglang 25.10 / 0.5.10+ (NVIDIA SGLang release notes); native Blackwell tensor-core FP4 still gives better throughput.
NVFP4 on B200 / Blackwell datacenter (sm_100): use flashinfer attention + b12x or flashinfer MoE backends — this is the recipe in the original 555B-A14B-REAP-NVFP4 card.
NVFP4 on Blackwell Workstation (sm_120): use --attention-backend triton (not flashinfer — PCIe P2P atomics unavailable on the consumer board), --moe-runner-backend cutlass, --fp4-gemm-backend flashinfer_cudnn. See the GLM-5.1-478B-A42B-REAP-NVFP4 card for the full 200k-ctx replication guide.
GPTQ-W4A16 vs NVFP4: same bit depth, different hardware path. NVFP4 has native Blackwell support and per-16 fp8 scales; GPTQ is group-quantized int4 with broader engine support.
REAP expert count variants (555B/444B): different expert-retention ratios from the same base; 555B keeps more experts (higher quality ceiling), 444B trades quality for 20% less VRAM.
Why NVFP4-478B-A42B-REAP is different: it's double-pruned (256 → 192 → 160 experts), optimized for a specific Blackwell Workstation 4×96GB target at 200k context. The A42B suffix reflects measured activated params/token on the 160-expert MoE, not the REAP branding convention of the sibling variants.

Pointer to active inference recipe

See GLM-5.1-478B-A42B-REAP-NVFP4 README for the full Blackwell Workstation replication guide (exact software pins, NSA patch, launch flags, measured 200k-ctx perf, sampling recommendations). Most of the sglang flags carry over to other NVFP4 variants on other hardware.

Citation

@misc{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year={2025},
  eprint={2510.13999},
  archivePrefix={arXiv},
}

Downloads last month: 286

GGUF

Model size

563B params

Architecture

glm-dsa

Hardware compatibility

4-bit

Model tree for 0xSero/GLM-5.1-555B-A14B-REAP-GGUF

Base model

zai-org/GLM-5.1

Quantized

(30)

this model

Space using 0xSero/GLM-5.1-555B-A14B-REAP-GGUF 1

Paper for 0xSero/GLM-5.1-555B-A14B-REAP-GGUF

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19

0xSero
/

GLM-5.1-555B-A14B-REAP-GGUF

GLM-5.1 — 25% Expert Pruned (REAP) — Q4_K_M GGUF

Benchmark Results (inference mode, temp=0.8)

Degeneration Fuzz Test (45 probes)

Why 25% instead of 40%?

How to Use

Quantization Details

Related Models

Citation

Sponsors

GLM-5.1 REAP Family — Hardware Compatibility

Quick picker

Full family

Notes

Pointer to active inference recipe

Citation

Model tree for 0xSero/GLM-5.1-555B-A14B-REAP-GGUF

Space using 0xSero/GLM-5.1-555B-A14B-REAP-GGUF 1

Paper for 0xSero/GLM-5.1-555B-A14B-REAP-GGUF

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression