Model Description
MiniMax-M2.7-NVFP4 is an NVFP4-quantized version of MiniMaxAI/MiniMax-M2.7, a 230B-parameter Mixture-of-Experts language model with 10B active parameters.
Produced with NVIDIA TensorRT Model Optimizer (modelopt 0.43.0rc3) using the nvfp4_experts_only-fp8_kv recipe — the same recipe structure as nvidia/MiniMax-M2.5-NVFP4. The base model is quantized directly from its official FP8 checkpoint via modelopt's on-the-fly FP8→BF16 dequant path (no separate BF16 master), then requantized to NVFP4.
What's quantized
Only the MoE expert MLP layers (w1, w2, w3 — gate, up, down projections) are quantized to NVFP4 (W4A4, group size 16). Self-attention projections, the MoE router, embeddings, layernorms, and lm_head are left in BF16. The exclude_modules list in hf_quant_config.json matches nvidia/MiniMax-M2.5-NVFP4 exactly.
The KV cache is quantized to FP8 E4M3 with statically calibrated scales baked into the safetensors as per-layer k_proj.k_scale / v_proj.v_scale tensors (62 of each). This avoids vLLM's silent scale=1.0 fallback when KV scales are missing from the checkpoint.
Calibration
- Recipe:
nvfp4_experts_only-fp8_kv - Dataset:
cnn_dailymail+nvidia/Nemotron-Post-Training-Dataset-v2(same as nvidia's M2.5 quant) - Samples: 1024
- Full expert coverage: calibration is run with
--moe_calib_experts_ratio 1.0, which sets the MoE top-k to 256 during calibration so every token activates every expert in every layer. This guarantees all 256 × 62 = 15,872 experts receive a non-zeroamaxregardless of dataset routing, and eliminates the cold-expert failure mode (spurious spaces at subword boundaries, e.g."mpg_ high","SPEC. md") seen when experts are calibrated via natural top-k routing alone.
Hardware
NVFP4 requires native FP4 tensor cores — NVIDIA Blackwell (B100 / B200, RTX Pro 6000, DGX Spark GB10). Not supported on H100/H200 or older.
Quality
Smoke-tested with vLLM v0.19.0 on 2x RTX Pro 6000 (SM120): prose generation and single-turn tool calling both pass, no spurious-space tokens observed. You should always evaluate against your specific use case.
Usage
vLLM
Tested on vllm/vllm-openai:v0.19.0-cu130 with 2x RTX Pro 6000 Blackwell:
vllm serve demon-zombie/MiniMax-M2.7-NVFP4 \
--trust-remote-code \
--served-model-name MiniMax-M2.7 \
--tensor-parallel-size 2 \
--max-model-len 196608 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--enable-auto-tool-choice \
--disable-custom-all-reduce
--disable-custom-all-reduce is required on RTX Pro 6000 (SM120) with TP > 1 — vLLM's custom CUDA all-reduce hangs post-weight-load (workers spin at 100% SM with 0 memory traffic). Not needed on B200. Do not pass --enforce-eager; CUDA graphs capture cleanly on v0.19.0.
SGLang
(pending)
- Downloads last month
- 619
Model tree for demon-zombie/MiniMax-M2.7-NVFP4
Base model
MiniMaxAI/MiniMax-M2.7