CortexLM/GLM-5.1-NVFP4-MTP

NVFP4 quantized version of zai-org/GLM-5.1, a 754B parameter Mixture-of-Experts language model with 256 routed experts per layer.

Quantized using NVIDIA Model Optimizer (modelopt) with full activation calibration on all 58,459 linear modules including every individual routed expert.

Model Details

Base model zai-org/GLM-5.1
Architecture GlmMoeDsaForCausalLM (754B MoE)
Layers 78 transformer layers + 1 MTP layer
Experts 256 routed + 1 shared per MoE layer (layers 3-77)
Hidden size 6144
Context length 202,752 tokens
Quantization NVFP4 (4-bit float weights, FP8 block scales, group size 16)
KV cache FP8 quantized
MTP layer BF16 (stored separately in mtp.safetensors)
Total size ~441 GB (vs 1.4 TB BF16 original)

Quantization Details

This model was quantized using NVIDIA's official Model Optimizer (modelopt) NVFP4 pipeline with proper per-expert calibration:

  • Quantization format: NVFP4 -- 4-bit floating point with FP8 per-block scaling factors (float8_e4m3fn) and a global FP32 weight_scale_2, block size of 16
  • Calibration: 256 samples from cnn_dailymail and nvidia/Nemotron-Post-Training-Dataset-v2 (chat, code, math, stem splits), sequence length 2048
  • Quantized modules: 58,459 nn.Linear modules, including all 256 routed experts per layer individually quantized with calibrated input_scale (activation statistics)
  • KV cache: FP8 cast quantization on all attention layers
  • Excluded: lm_head (kept in BF16)
  • MTP: Multi-Token Prediction layer (layer 78) kept in BF16 as a separate mtp.safetensors file (19.9 GB)
  • Hardware: 8x NVIDIA B300 SXM6 275GB GPUs
  • Calibration time: ~21 minutes
  • modelopt version: 0.42.0.dev (from source, April 2026)
  • transformers version: 5.5.0

Weight format

Each quantized linear layer is stored as:

  • weight: uint8 (two FP4 values packed per byte)
  • weight_scale: float8_e4m3fn (per-block FP8 scale, one per 16 elements)
  • weight_scale_2: float32 scalar (global per-tensor scale)
  • input_scale: float32 scalar (calibrated activation scale, where applicable)

Usage

This checkpoint is designed for use with inference engines that support the NVFP4 format, such as TensorRT-LLM and vLLM with modelopt backend.

Files

  • 85 model shards (model-00001-of-00085.safetensors to model-00085-of-00085.safetensors) -- NVFP4 quantized layers 0-77
  • mtp.safetensors -- BF16 Multi-Token Prediction layer (layer 78, 791 keys, 19.9 GB)
  • model.safetensors.index.json -- shard index mapping
  • config.json -- model configuration with quantization_config
  • hf_quant_config.json -- NVFP4 quantization metadata
  • tokenizer.json, tokenizer_config.json -- tokenizer files
  • generation_config.json -- generation defaults

Acknowledgements

Downloads last month
252
Safetensors
Model size
373B params
Tensor type
BF16
F8_E4M3
U8
F32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for CortexLM/test

Base model

zai-org/GLM-5.1
Quantized
(26)
this model