--- license: mit base_model: zai-org/GLM-5.1 tags: - nvidia - nvfp4 - quantized - moe - modelopt - glm library_name: transformers pipeline_tag: text-generation --- # CortexLM/GLM-5.1-NVFP4-MTP NVFP4 quantized version of [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1), a 754B parameter Mixture-of-Experts language model with 256 routed experts per layer. Quantized using [NVIDIA Model Optimizer (modelopt)](https://github.com/NVIDIA/Model-Optimizer) with full activation calibration on all 58,459 linear modules including every individual routed expert. ## Model Details | | | |---|---| | **Base model** | [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1) | | **Architecture** | GlmMoeDsaForCausalLM (754B MoE) | | **Layers** | 78 transformer layers + 1 MTP layer | | **Experts** | 256 routed + 1 shared per MoE layer (layers 3-77) | | **Hidden size** | 6144 | | **Context length** | 202,752 tokens | | **Quantization** | NVFP4 (4-bit float weights, FP8 block scales, group size 16) | | **KV cache** | FP8 quantized | | **MTP layer** | BF16 (stored separately in `mtp.safetensors`) | | **Total size** | ~441 GB (vs 1.4 TB BF16 original) | ## Quantization Details This model was quantized using NVIDIA's official [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) (`modelopt`) NVFP4 pipeline with proper per-expert calibration: - **Quantization format**: NVFP4 -- 4-bit floating point with FP8 per-block scaling factors (`float8_e4m3fn`) and a global FP32 `weight_scale_2`, block size of 16 - **Calibration**: 256 samples from [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) and [nvidia/Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) (chat, code, math, stem splits), sequence length 2048 - **Quantized modules**: 58,459 `nn.Linear` modules, including all 256 routed experts per layer individually quantized with calibrated `input_scale` (activation statistics) - **KV cache**: FP8 cast quantization on all attention layers - **Excluded**: `lm_head` (kept in BF16) - **MTP**: Multi-Token Prediction layer (layer 78) kept in BF16 as a separate `mtp.safetensors` file (19.9 GB) - **Hardware**: 8x NVIDIA B300 SXM6 275GB GPUs - **Calibration time**: ~21 minutes - **modelopt version**: 0.42.0.dev (from source, April 2026) - **transformers version**: 5.5.0 ### Weight format Each quantized linear layer is stored as: - `weight`: `uint8` (two FP4 values packed per byte) - `weight_scale`: `float8_e4m3fn` (per-block FP8 scale, one per 16 elements) - `weight_scale_2`: `float32` scalar (global per-tensor scale) - `input_scale`: `float32` scalar (calibrated activation scale, where applicable) ## Usage This checkpoint is designed for use with inference engines that support the NVFP4 format, such as [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [vLLM](https://github.com/vllm-project/vllm) with modelopt backend. ## Files - 85 model shards (`model-00001-of-00085.safetensors` to `model-00085-of-00085.safetensors`) -- NVFP4 quantized layers 0-77 - `mtp.safetensors` -- BF16 Multi-Token Prediction layer (layer 78, 791 keys, 19.9 GB) - `model.safetensors.index.json` -- shard index mapping - `config.json` -- model configuration with `quantization_config` - `hf_quant_config.json` -- NVFP4 quantization metadata - `tokenizer.json`, `tokenizer_config.json` -- tokenizer files - `generation_config.json` -- generation defaults ## Acknowledgements - Base model by [ZhipuAI](https://huggingface.co/zai-org) - Quantization tooling by [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer)