Proven REAPs
Collection
Benchmarked REAP checkpoints with >=500 all-time downloads. GLM/Qwen/MiniMax/DeepSeek/Kimi/gemma. • 25 items • Updated • 4
Support this work: donate.sybilsolutions.ai
REAP surfaces: GLM | MiniMax | Qwen | Gemma | Paper | Code | PR17 | Cerebras Collection
Expert-pruned GLM-5 (744B -> ~372B params, 256 -> 128 routed experts) in bf16 GGUF format for llama.cpp inference. This is the full-precision intermediate used to produce quantized GGUFs.
| Property | Value |
|---|---|
| Base model | zai-org/GLM-5 (744B, 256 routed experts) |
| Pruning | REAP saliency pruning, 50% expert removal (256 -> 128 experts) |
| Format | bf16 GGUF (full precision, no quantization loss) |
| Size | ~711 GB |
| Architecture | GlmMoeDsaForCausalLM (MLA + Mixture of Experts + DSA indexer) |
| Context | 202,752 tokens |
| Active params | ~20B per token (8 of 128 experts selected) |
This bf16 GGUF serves as the source for all quantized variants. Use it to:
llama-quantize (Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0, etc.)| Variant | BPW | Size | Repo |
|---|---|---|---|
| Q3_K_M | 3.82 | ~170 GB | 0xSero/GLM-5-REAP-50pct-Q3_K_M-GGUF |
# Download
huggingface-cli download 0xSero/GLM-5-REAP-50pct-BF16-GGUF --local-dir GLM-5-REAP-50pct-BF16-GGUF
# Quantize to any format
llama-quantize GLM-5-REAP-50pct-BF16-GGUF/GLM-5-REAP-50pct-BF16.gguf output-Q4_K_M.gguf Q4_K_M
# With importance matrix for better quality
llama-quantize --imatrix imatrix.dat GLM-5-REAP-50pct-BF16-GGUF/GLM-5-REAP-50pct-BF16.gguf output-IQ3_M.gguf IQ3_M
convert_hf_to_gguf.py (auto-dequants FP8, splits fused gate_up_proj, handles 3D expert tensors)The original FP8 safetensors model has a known KV cache NaN bug in HuggingFace Transformers' GlmMoeDsaAttention implementation. llama.cpp bypasses this entirely with its own inference engine, producing correct output with working KV cache.
Thank you for the kind sponsors, wouldn't be possible without them:
16-bit
Base model
zai-org/GLM-5