File size: 3,625 Bytes
cdfb602
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
license: mit
base_model: zai-org/GLM-5.1
tags:
- nvidia
- nvfp4
- quantized
- moe
- modelopt
- glm
library_name: transformers
pipeline_tag: text-generation
---

# CortexLM/GLM-5.1-NVFP4-MTP

NVFP4 quantized version of [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1), a 754B parameter Mixture-of-Experts language model with 256 routed experts per layer.

Quantized using [NVIDIA Model Optimizer (modelopt)](https://github.com/NVIDIA/Model-Optimizer) with full activation calibration on all 58,459 linear modules including every individual routed expert.

## Model Details

| | |
|---|---|
| **Base model** | [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1) |
| **Architecture** | GlmMoeDsaForCausalLM (754B MoE) |
| **Layers** | 78 transformer layers + 1 MTP layer |
| **Experts** | 256 routed + 1 shared per MoE layer (layers 3-77) |
| **Hidden size** | 6144 |
| **Context length** | 202,752 tokens |
| **Quantization** | NVFP4 (4-bit float weights, FP8 block scales, group size 16) |
| **KV cache** | FP8 quantized |
| **MTP layer** | BF16 (stored separately in `mtp.safetensors`) |
| **Total size** | ~441 GB (vs 1.4 TB BF16 original) |

## Quantization Details

This model was quantized using NVIDIA's official [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) (`modelopt`) NVFP4 pipeline with proper per-expert calibration:

- **Quantization format**: NVFP4 -- 4-bit floating point with FP8 per-block scaling factors (`float8_e4m3fn`) and a global FP32 `weight_scale_2`, block size of 16
- **Calibration**: 256 samples from [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) and [nvidia/Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) (chat, code, math, stem splits), sequence length 2048
- **Quantized modules**: 58,459 `nn.Linear` modules, including all 256 routed experts per layer individually quantized with calibrated `input_scale` (activation statistics)
- **KV cache**: FP8 cast quantization on all attention layers
- **Excluded**: `lm_head` (kept in BF16)
- **MTP**: Multi-Token Prediction layer (layer 78) kept in BF16 as a separate `mtp.safetensors` file (19.9 GB)
- **Hardware**: 8x NVIDIA B300 SXM6 275GB GPUs
- **Calibration time**: ~21 minutes
- **modelopt version**: 0.42.0.dev (from source, April 2026)
- **transformers version**: 5.5.0

### Weight format

Each quantized linear layer is stored as:
- `weight`: `uint8` (two FP4 values packed per byte)
- `weight_scale`: `float8_e4m3fn` (per-block FP8 scale, one per 16 elements)
- `weight_scale_2`: `float32` scalar (global per-tensor scale)
- `input_scale`: `float32` scalar (calibrated activation scale, where applicable)

## Usage

This checkpoint is designed for use with inference engines that support the NVFP4 format, such as [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [vLLM](https://github.com/vllm-project/vllm) with modelopt backend.

## Files

- 85 model shards (`model-00001-of-00085.safetensors` to `model-00085-of-00085.safetensors`) -- NVFP4 quantized layers 0-77
- `mtp.safetensors` -- BF16 Multi-Token Prediction layer (layer 78, 791 keys, 19.9 GB)
- `model.safetensors.index.json` -- shard index mapping
- `config.json` -- model configuration with `quantization_config`
- `hf_quant_config.json` -- NVFP4 quantization metadata
- `tokenizer.json`, `tokenizer_config.json` -- tokenizer files
- `generation_config.json` -- generation defaults

## Acknowledgements

- Base model by [ZhipuAI](https://huggingface.co/zai-org)
- Quantization tooling by [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer)