| --- |
| license: mit |
| base_model: zai-org/GLM-5.1 |
| tags: |
| - nvidia |
| - nvfp4 |
| - quantized |
| - moe |
| - modelopt |
| - glm |
| library_name: transformers |
| pipeline_tag: text-generation |
| --- |
| |
| # CortexLM/GLM-5.1-NVFP4-MTP |
|
|
| NVFP4 quantized version of [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1), a 754B parameter Mixture-of-Experts language model with 256 routed experts per layer. |
|
|
| Quantized using [NVIDIA Model Optimizer (modelopt)](https://github.com/NVIDIA/Model-Optimizer) with full activation calibration on all 58,459 linear modules including every individual routed expert. |
|
|
| ## Model Details |
|
|
| | | | |
| |---|---| |
| | **Base model** | [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1) | |
| | **Architecture** | GlmMoeDsaForCausalLM (754B MoE) | |
| | **Layers** | 78 transformer layers + 1 MTP layer | |
| | **Experts** | 256 routed + 1 shared per MoE layer (layers 3-77) | |
| | **Hidden size** | 6144 | |
| | **Context length** | 202,752 tokens | |
| | **Quantization** | NVFP4 (4-bit float weights, FP8 block scales, group size 16) | |
| | **KV cache** | FP8 quantized | |
| | **MTP layer** | BF16 (stored separately in `mtp.safetensors`) | |
| | **Total size** | ~441 GB (vs 1.4 TB BF16 original) | |
|
|
| ## Quantization Details |
|
|
| This model was quantized using NVIDIA's official [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) (`modelopt`) NVFP4 pipeline with proper per-expert calibration: |
|
|
| - **Quantization format**: NVFP4 -- 4-bit floating point with FP8 per-block scaling factors (`float8_e4m3fn`) and a global FP32 `weight_scale_2`, block size of 16 |
| - **Calibration**: 256 samples from [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) and [nvidia/Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) (chat, code, math, stem splits), sequence length 2048 |
| - **Quantized modules**: 58,459 `nn.Linear` modules, including all 256 routed experts per layer individually quantized with calibrated `input_scale` (activation statistics) |
| - **KV cache**: FP8 cast quantization on all attention layers |
| - **Excluded**: `lm_head` (kept in BF16) |
| - **MTP**: Multi-Token Prediction layer (layer 78) kept in BF16 as a separate `mtp.safetensors` file (19.9 GB) |
| - **Hardware**: 8x NVIDIA B300 SXM6 275GB GPUs |
| - **Calibration time**: ~21 minutes |
| - **modelopt version**: 0.42.0.dev (from source, April 2026) |
| - **transformers version**: 5.5.0 |
|
|
| ### Weight format |
|
|
| Each quantized linear layer is stored as: |
| - `weight`: `uint8` (two FP4 values packed per byte) |
| - `weight_scale`: `float8_e4m3fn` (per-block FP8 scale, one per 16 elements) |
| - `weight_scale_2`: `float32` scalar (global per-tensor scale) |
| - `input_scale`: `float32` scalar (calibrated activation scale, where applicable) |
|
|
| ## Usage |
|
|
| This checkpoint is designed for use with inference engines that support the NVFP4 format, such as [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [vLLM](https://github.com/vllm-project/vllm) with modelopt backend. |
|
|
| ## Files |
|
|
| - 85 model shards (`model-00001-of-00085.safetensors` to `model-00085-of-00085.safetensors`) -- NVFP4 quantized layers 0-77 |
| - `mtp.safetensors` -- BF16 Multi-Token Prediction layer (layer 78, 791 keys, 19.9 GB) |
| - `model.safetensors.index.json` -- shard index mapping |
| - `config.json` -- model configuration with `quantization_config` |
| - `hf_quant_config.json` -- NVFP4 quantization metadata |
| - `tokenizer.json`, `tokenizer_config.json` -- tokenizer files |
| - `generation_config.json` -- generation defaults |
|
|
| ## Acknowledgements |
|
|
| - Base model by [ZhipuAI](https://huggingface.co/zai-org) |
| - Quantization tooling by [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) |
|
|