CortexLM/GLM-5.1-NVFP4-MTP

NVFP4 quantized version of zai-org/GLM-5.1, a 754B parameter Mixture-of-Experts language model with 256 routed experts per layer.

Quantized using NVIDIA Model Optimizer (modelopt) with full activation calibration on all 58,459 linear modules including every individual routed expert.

Model Details


Base model	zai-org/GLM-5.1
Architecture	GlmMoeDsaForCausalLM (754B MoE)
Layers	78 transformer layers + 1 MTP layer
Experts	256 routed + 1 shared per MoE layer (layers 3-77)
Hidden size	6144
Context length	202,752 tokens
Quantization	NVFP4 (4-bit float weights, FP8 block scales, group size 16)
KV cache	FP8 quantized
MTP layer	BF16 (stored separately in `mtp.safetensors`)
Total size	~441 GB (vs 1.4 TB BF16 original)

Quantization Details

This model was quantized using NVIDIA's official Model Optimizer (modelopt) NVFP4 pipeline with proper per-expert calibration:

Quantization format: NVFP4 -- 4-bit floating point with FP8 per-block scaling factors (float8_e4m3fn) and a global FP32 weight_scale_2, block size of 16
Calibration: 256 samples from cnn_dailymail and nvidia/Nemotron-Post-Training-Dataset-v2 (chat, code, math, stem splits), sequence length 2048
Quantized modules: 58,459 nn.Linear modules, including all 256 routed experts per layer individually quantized with calibrated input_scale (activation statistics)
KV cache: FP8 cast quantization on all attention layers
Excluded: lm_head (kept in BF16)
MTP: Multi-Token Prediction layer (layer 78) kept in BF16 as a separate mtp.safetensors file (19.9 GB)
Hardware: 8x NVIDIA B300 SXM6 275GB GPUs
Calibration time: ~21 minutes
modelopt version: 0.42.0.dev (from source, April 2026)
transformers version: 5.5.0

Weight format

Each quantized linear layer is stored as:

weight: uint8 (two FP4 values packed per byte)
weight_scale: float8_e4m3fn (per-block FP8 scale, one per 16 elements)
weight_scale_2: float32 scalar (global per-tensor scale)
input_scale: float32 scalar (calibrated activation scale, where applicable)

Usage

This checkpoint is designed for use with inference engines that support the NVFP4 format, such as TensorRT-LLM and vLLM with modelopt backend.

Files

85 model shards (model-00001-of-00085.safetensors to model-00085-of-00085.safetensors) -- NVFP4 quantized layers 0-77
mtp.safetensors -- BF16 Multi-Token Prediction layer (layer 78, 791 keys, 19.9 GB)
model.safetensors.index.json -- shard index mapping
config.json -- model configuration with quantization_config
hf_quant_config.json -- NVFP4 quantization metadata
tokenizer.json, tokenizer_config.json -- tokenizer files
generation_config.json -- generation defaults

Acknowledgements

Base model by ZhipuAI
Quantization tooling by NVIDIA Model Optimizer

Downloads last month: 252

Safetensors

Model size

373B params

Tensor type

BF16

F8_E4M3

F32

Model tree for CortexLM/test

Base model

zai-org/GLM-5.1

Quantized

(26)

this model