YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GLM-4.7-Flash-REAP-23B-A3B-NVFP4-NVembed
NVFP4-quantized GLM-4.7-Flash-REAP-23B-A3B โ a 23B-parameter MoE reasoning model (3B active) that fits in 12 GB VRAM and runs 44K context on a single RTX 5080.
What's special
- Full NVFP4: all Linear layers,
lm_head, andembed_tokensquantized to NVFP4 - FP8 KV cache: MLA attention with 512-dim compressed KV, stored in FP8
- 44,000-token context on 16 GB VRAM (vs ~11K without lm_head/embed quantization)
- 12.1 GiB single-file
model.safetensors - Thinking/reasoning support (
<think>tokens) with--reasoning-parser deepseek_r1
Hardware
- GPU: NVIDIA RTX 5080 (16 GB VRAM, SM 12.0 / Blackwell)
- NVFP4 GEMM: Marlin backend (FlashInfer FP4 JIT fails on SM 12.0)
- MLA backend: Triton MLA (FLASHINFER_MLA requires
qk_nope_head_dim=128, this model has 192)
Should work on any Blackwell GPU (RTX 50-series, B-series datacenter). Context length scales with available VRAM.
Quick start
1. Clone and build vLLM
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 628302114 # v0.16.1rc1.dev34
# Apply patches (required for NVFP4 lm_head, embed_tokens, MLA FP8 KV cache)
git apply ../GLM-4.7-Flash-REAP-23B-A3B-NVFP4-NVembed/vllm_patches.diff
# Build
pip install -e . --no-build-isolation
2. Install dependencies
pip install flashinfer==0.6.4 # or latest compatible
3. Serve
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_NVFP4_GEMM_BACKEND=marlin
export PYTORCH_ALLOC_CONF=expandable_segments:True
export VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE=$((64 * 1024 * 1024))
BLOCKS=2750
BYTES_PER_BLOCK=433152 # 47 layers x 576 MLA head x 1 FP8 x 16 tokens
MAX_MODEL_LEN=$((BLOCKS * 16)) # 44,000
vllm serve ./GLM-4.7-Flash-REAP-23B-A3B-NVFP4-NVembed \
--trust-remote-code \
--dtype bfloat16 \
--quantization modelopt \
--kv-cache-dtype fp8_e4m3 \
--max-model-len $MAX_MODEL_LEN \
--no-enable-prefix-caching \
--max-num-seqs 1 \
--kv-cache-memory-bytes $((BLOCKS * BYTES_PER_BLOCK)) \
--num-gpu-blocks-override $BLOCKS \
--tensor-parallel-size 1 \
--enable-chunked-prefill \
--max-num-batched-tokens 256 \
--enforce-eager \
--reasoning-parser deepseek_r1 \
--override-generation-config '{"temperature": 0.0, "max_tokens": 44000}'
4. Test
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "GLM-4.7-Flash-REAP-23B-A3B-NVFP4-NVembed",
"messages": [{"role": "user", "content": "What is the capital of France?"}]}' \
| python3 -m json.tool
Serve command flags explained
| Flag | Why |
|---|---|
--trust-remote-code |
Required for GLM4 MoE architecture |
--quantization modelopt |
NVFP4 weight format from NVIDIA Model Optimizer |
--kv-cache-dtype fp8_e4m3 |
FP8 KV cache โ halves KV memory vs BF16 |
--no-enable-prefix-caching |
Prefix caching incompatible with MLA + FP8 KV |
--max-num-seqs 1 |
Single sequence โ 16 GB VRAM budget |
--kv-cache-memory-bytes |
Precise KV allocation (bypasses gpu_memory_utilization startup check) |
--num-gpu-blocks-override |
Binary-searched max stable blocks for RTX 5080 |
--enforce-eager |
CUDA graphs disabled โ saves VRAM |
--reasoning-parser deepseek_r1 |
Separates <think> reasoning from response content |
--override-generation-config |
temperature=0 prevents reasoning spirals |
Performance
| Metric | Value |
|---|---|
| Model size | 12.1 GiB |
| Max context | 44,000 tokens |
| KV cache blocks | 2,750 |
| KV cache size | ~1.14 GiB |
| KV per block | 433,152 bytes |
| Architecture | MoE: 47 layers, 48 experts, top-4, 1 shared |
| Active params | ~3B per token |
Quantization details
- Method: NVIDIA Model Optimizer (
nvidia-modelopt==0.41.0) - Format: NVFP4 (4-bit floating point, group_size=16)
- Calibration: 512 samples across 4 task types (code, instruction, agentic, structured data from databricks-dolly-15k)
- lm_head + embed_tokens: also NVFP4 (saves ~928 MB combined)
- Base model: THUDM/GLM-4.7-Flash-REAP-23B-A3B
vLLM patches
The included vllm_patches.diff modifies 6 files based on vLLM commit 628302114:
| File | Change |
|---|---|
modelopt.py |
NVFP4 lm_head + embed_tokens support; PerTensorScaleParameter init fix (uninitialized slots caused inf in merged linears) |
vocab_parallel_embedding.py |
FP8 weight dtype preservation + BF16 cast in embedding() |
linear.py |
FP8 weight cast in forward pass |
mla_attention.py |
kv_b_proj dtype guard for quantized layers (prevents crash on NVFP4/INT4) |
glm4_moe_lite.py |
name=None guard in weight loading |
triton_mla.py |
FP8 KV cache support for Triton MLA backend |
Upstream PRs
Some of these patches have been submitted upstream and may become unnecessary in future vLLM releases:
- Downloads last month
- 3
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support