Model Description
GLM-5.1-NVFP4 is an NVFP4-quantized version of zai-org/GLM-5.1, a 744B-parameter Mixture-of-Experts language model with 40B active parameters, 256 experts per MoE layer (8 activated per token), and DeepSeek Sparse Attention (DSA).
Quantized directly from the full BF16 checkpoint (zai-org/GLM-5.1), not the FP8 release, to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer.
What's quantized
Only the non-shared MoE expert MLP projections are quantized to NVFP4. Attention weights are left in BF16, in addition to the dense MLPs (layers 0-3) and the shared experts. Since the MoE expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings.
Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical to ensure broad expert coverage through natural routing alone.
Calibration dataset
Three calibration passes were run:
- Coding pass β Agentic coding samples (tool calling, multi-turn code generation, function calling) with English and Chinese system prompts.
- Broad pass β Large-scale diverse samples drawn from WildChat-NonToxic and LMSYS-Chat covering real user conversations across a wide range of topics and languages.
- Deep pass β Long-context samples (>8K tokens) from coding and diverse sources to exercise deep-sequence expert activation patterns.
Requirements
Hardware: 8x RTX PRO 6000 Blackwell 96GB (b12x MoE runner recommended)
Note: You must run with --disable-shared-experts-fusion in sglang, otherwise it will incorrectly attempt to fuse the BF16 shared expert.
Community Testing
Docker Image: voipmonitor/sglang:cu130 (festr, 6 days old, has b12x built-in)
Model: lukealonso/GLM-5.1-NVFP4 (434 GB, glm_moe_dsa, 78 layers, 256 experts)
Launch command:
export OMP_NUM_THREADS=16
export SGLANG_ENABLE_SPEC_V2=True
export NVIDIA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8 # 8x Blackwell
python -m sglang.launch_server \
--model-path /path/to/lukealonso/GLM-5.1-NVFP4 \
--served-model-name GLM-5.1 \
--reasoning-parser glm45 \
--tool-call-parser glm47 \
--tp 8 \
--trust-remote-code \
--quantization modelopt_fp4 \
--kv-cache-dtype bf16 \
--fp4-gemm-backend b12x \
--attention-backend flashinfer \
--moe-runner-backend b12x \
--disable-shared-experts-fusion \
--mem-fraction-static 0.85 \
--max-running-requests 64 \
--cuda-graph-max-bs 32 \
--speculative-algorithm NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--host 0.0.0.0 --port 5000
Results (8x RTX PRO 6000 Blackwell 96GB, driver 595):
βββββββββββββββββββ¬ββββββββ
β Metric β tok/s β
βββββββββββββββββββΌββββββββ€
β Short TG β 95-99 β
βββββββββββββββββββΌββββββββ€
β 32K TG β 74 β
βββββββββββββββββββΌββββββββ€
β 128K TG β 73 β
βββββββββββββββββββΌββββββββ€
β Concurrent c=32 β 751 β
βββββββββββββββββββΌββββββββ€
β Concurrent c=64 β 1058 β
βββββββββββββββββββ΄ββββββββ
- Downloads last month
- 3,912
Model tree for lukealonso/GLM-5.1-NVFP4
Base model
zai-org/GLM-5.1