CortexLM/GLM-5.1-NVFP4-MTP
NVFP4 quantized version of zai-org/GLM-5.1, a 754B parameter Mixture-of-Experts language model with 256 routed experts per layer.
Quantized using NVIDIA Model Optimizer (modelopt) with full activation calibration on all 58,459 linear modules including every individual routed expert.
Model Details
| Base model | zai-org/GLM-5.1 |
| Architecture | GlmMoeDsaForCausalLM (754B MoE) |
| Layers | 78 transformer layers + 1 MTP layer |
| Experts | 256 routed + 1 shared per MoE layer (layers 3-77) |
| Hidden size | 6144 |
| Context length | 202,752 tokens |
| Quantization | NVFP4 (4-bit float weights, FP8 block scales, group size 16) |
| KV cache | FP8 quantized |
| MTP layer | BF16 (stored separately in mtp.safetensors) |
| Total size | ~441 GB (vs 1.4 TB BF16 original) |
Quantization Details
This model was quantized using NVIDIA's official Model Optimizer (modelopt) NVFP4 pipeline with proper per-expert calibration:
- Quantization format: NVFP4 -- 4-bit floating point with FP8 per-block scaling factors (
float8_e4m3fn) and a global FP32weight_scale_2, block size of 16 - Calibration: 256 samples from cnn_dailymail and nvidia/Nemotron-Post-Training-Dataset-v2 (chat, code, math, stem splits), sequence length 2048
- Quantized modules: 58,459
nn.Linearmodules, including all 256 routed experts per layer individually quantized with calibratedinput_scale(activation statistics) - KV cache: FP8 cast quantization on all attention layers
- Excluded:
lm_head(kept in BF16) - MTP: Multi-Token Prediction layer (layer 78) kept in BF16 as a separate
mtp.safetensorsfile (19.9 GB) - Hardware: 8x NVIDIA B300 SXM6 275GB GPUs
- Calibration time: ~21 minutes
- modelopt version: 0.42.0.dev (from source, April 2026)
- transformers version: 5.5.0
Weight format
Each quantized linear layer is stored as:
weight:uint8(two FP4 values packed per byte)weight_scale:float8_e4m3fn(per-block FP8 scale, one per 16 elements)weight_scale_2:float32scalar (global per-tensor scale)input_scale:float32scalar (calibrated activation scale, where applicable)
Usage
This checkpoint is designed for use with inference engines that support the NVFP4 format, such as TensorRT-LLM and vLLM with modelopt backend.
Files
- 85 model shards (
model-00001-of-00085.safetensorstomodel-00085-of-00085.safetensors) -- NVFP4 quantized layers 0-77 mtp.safetensors-- BF16 Multi-Token Prediction layer (layer 78, 791 keys, 19.9 GB)model.safetensors.index.json-- shard index mappingconfig.json-- model configuration withquantization_confighf_quant_config.json-- NVFP4 quantization metadatatokenizer.json,tokenizer_config.json-- tokenizer filesgeneration_config.json-- generation defaults
Acknowledgements
- Base model by ZhipuAI
- Quantization tooling by NVIDIA Model Optimizer
- Downloads last month
- 252
Model tree for CortexLM/test
Base model
zai-org/GLM-5.1