YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GLM-4.7-Flash-REAP-23B-A3B-NVFP4-NVembed

NVFP4-quantized GLM-4.7-Flash-REAP-23B-A3B โ€” a 23B-parameter MoE reasoning model (3B active) that fits in 12 GB VRAM and runs 44K context on a single RTX 5080.

What's special

  • Full NVFP4: all Linear layers, lm_head, and embed_tokens quantized to NVFP4
  • FP8 KV cache: MLA attention with 512-dim compressed KV, stored in FP8
  • 44,000-token context on 16 GB VRAM (vs ~11K without lm_head/embed quantization)
  • 12.1 GiB single-file model.safetensors
  • Thinking/reasoning support (<think> tokens) with --reasoning-parser deepseek_r1

Hardware

  • GPU: NVIDIA RTX 5080 (16 GB VRAM, SM 12.0 / Blackwell)
  • NVFP4 GEMM: Marlin backend (FlashInfer FP4 JIT fails on SM 12.0)
  • MLA backend: Triton MLA (FLASHINFER_MLA requires qk_nope_head_dim=128, this model has 192)

Should work on any Blackwell GPU (RTX 50-series, B-series datacenter). Context length scales with available VRAM.

Quick start

1. Clone and build vLLM

git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 628302114  # v0.16.1rc1.dev34

# Apply patches (required for NVFP4 lm_head, embed_tokens, MLA FP8 KV cache)
git apply ../GLM-4.7-Flash-REAP-23B-A3B-NVFP4-NVembed/vllm_patches.diff

# Build
pip install -e . --no-build-isolation

2. Install dependencies

pip install flashinfer==0.6.4  # or latest compatible

3. Serve

export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_NVFP4_GEMM_BACKEND=marlin
export PYTORCH_ALLOC_CONF=expandable_segments:True
export VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE=$((64 * 1024 * 1024))

BLOCKS=2750
BYTES_PER_BLOCK=433152  # 47 layers x 576 MLA head x 1 FP8 x 16 tokens
MAX_MODEL_LEN=$((BLOCKS * 16))  # 44,000

vllm serve ./GLM-4.7-Flash-REAP-23B-A3B-NVFP4-NVembed \
    --trust-remote-code \
    --dtype bfloat16 \
    --quantization modelopt \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len $MAX_MODEL_LEN \
    --no-enable-prefix-caching \
    --max-num-seqs 1 \
    --kv-cache-memory-bytes $((BLOCKS * BYTES_PER_BLOCK)) \
    --num-gpu-blocks-override $BLOCKS \
    --tensor-parallel-size 1 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 256 \
    --enforce-eager \
    --reasoning-parser deepseek_r1 \
    --override-generation-config '{"temperature": 0.0, "max_tokens": 44000}'

4. Test

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "GLM-4.7-Flash-REAP-23B-A3B-NVFP4-NVembed",
       "messages": [{"role": "user", "content": "What is the capital of France?"}]}' \
  | python3 -m json.tool

Serve command flags explained

Flag Why
--trust-remote-code Required for GLM4 MoE architecture
--quantization modelopt NVFP4 weight format from NVIDIA Model Optimizer
--kv-cache-dtype fp8_e4m3 FP8 KV cache โ€” halves KV memory vs BF16
--no-enable-prefix-caching Prefix caching incompatible with MLA + FP8 KV
--max-num-seqs 1 Single sequence โ€” 16 GB VRAM budget
--kv-cache-memory-bytes Precise KV allocation (bypasses gpu_memory_utilization startup check)
--num-gpu-blocks-override Binary-searched max stable blocks for RTX 5080
--enforce-eager CUDA graphs disabled โ€” saves VRAM
--reasoning-parser deepseek_r1 Separates <think> reasoning from response content
--override-generation-config temperature=0 prevents reasoning spirals

Performance

Metric Value
Model size 12.1 GiB
Max context 44,000 tokens
KV cache blocks 2,750
KV cache size ~1.14 GiB
KV per block 433,152 bytes
Architecture MoE: 47 layers, 48 experts, top-4, 1 shared
Active params ~3B per token

Quantization details

  • Method: NVIDIA Model Optimizer (nvidia-modelopt==0.41.0)
  • Format: NVFP4 (4-bit floating point, group_size=16)
  • Calibration: 512 samples across 4 task types (code, instruction, agentic, structured data from databricks-dolly-15k)
  • lm_head + embed_tokens: also NVFP4 (saves ~928 MB combined)
  • Base model: THUDM/GLM-4.7-Flash-REAP-23B-A3B

vLLM patches

The included vllm_patches.diff modifies 6 files based on vLLM commit 628302114:

File Change
modelopt.py NVFP4 lm_head + embed_tokens support; PerTensorScaleParameter init fix (uninitialized slots caused inf in merged linears)
vocab_parallel_embedding.py FP8 weight dtype preservation + BF16 cast in embedding()
linear.py FP8 weight cast in forward pass
mla_attention.py kv_b_proj dtype guard for quantized layers (prevents crash on NVFP4/INT4)
glm4_moe_lite.py name=None guard in weight loading
triton_mla.py FP8 KV cache support for Triton MLA backend

Upstream PRs

Some of these patches have been submitted upstream and may become unnecessary in future vLLM releases:

  • #35576 โ€” MLA weight access crash fix for quantized layers
  • #35660 โ€” NVFP4-quantized lm_head support
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support