Model Description

MiniMax-M2.7-NVFP4 is an NVFP4-quantized version of MiniMaxAI/MiniMax-M2.7, a 230B-parameter Mixture-of-Experts language model with 10B active parameters.

Produced with NVIDIA TensorRT Model Optimizer (modelopt 0.43.0rc3) using the nvfp4_experts_only-fp8_kv recipe — the same recipe structure as nvidia/MiniMax-M2.5-NVFP4. The base model is quantized directly from its official FP8 checkpoint via modelopt's on-the-fly FP8→BF16 dequant path (no separate BF16 master), then requantized to NVFP4.

What's quantized

Only the MoE expert MLP layers (w1, w2, w3 — gate, up, down projections) are quantized to NVFP4 (W4A4, group size 16). Self-attention projections, the MoE router, embeddings, layernorms, and lm_head are left in BF16. The exclude_modules list in hf_quant_config.json matches nvidia/MiniMax-M2.5-NVFP4 exactly.

The KV cache is quantized to FP8 E4M3 with statically calibrated scales baked into the safetensors as per-layer k_proj.k_scale / v_proj.v_scale tensors (62 of each). This avoids vLLM's silent scale=1.0 fallback when KV scales are missing from the checkpoint.

Calibration

  • Recipe: nvfp4_experts_only-fp8_kv
  • Dataset: cnn_dailymail + nvidia/Nemotron-Post-Training-Dataset-v2 (same as nvidia's M2.5 quant)
  • Samples: 1024
  • Full expert coverage: calibration is run with --moe_calib_experts_ratio 1.0, which sets the MoE top-k to 256 during calibration so every token activates every expert in every layer. This guarantees all 256 × 62 = 15,872 experts receive a non-zero amax regardless of dataset routing, and eliminates the cold-expert failure mode (spurious spaces at subword boundaries, e.g. "mpg_ high", "SPEC. md") seen when experts are calibrated via natural top-k routing alone.

Hardware

NVFP4 requires native FP4 tensor cores — NVIDIA Blackwell (B100 / B200, RTX Pro 6000, DGX Spark GB10). Not supported on H100/H200 or older.

Quality

Smoke-tested with vLLM v0.19.0 on 2x RTX Pro 6000 (SM120): prose generation and single-turn tool calling both pass, no spurious-space tokens observed. You should always evaluate against your specific use case.

Usage

vLLM

Tested on vllm/vllm-openai:v0.19.0-cu130 with 2x RTX Pro 6000 Blackwell:

vllm serve demon-zombie/MiniMax-M2.7-NVFP4 \
  --trust-remote-code \
  --served-model-name MiniMax-M2.7 \
  --tensor-parallel-size 2 \
  --max-model-len 196608 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice \
  --disable-custom-all-reduce

--disable-custom-all-reduce is required on RTX Pro 6000 (SM120) with TP > 1 — vLLM's custom CUDA all-reduce hangs post-weight-load (workers spin at 100% SM with 0 memory traffic). Not needed on B200. Do not pass --enforce-eager; CUDA graphs capture cleanly on v0.19.0.

SGLang

(pending)

Downloads last month
619
Safetensors
Model size
116B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for demon-zombie/MiniMax-M2.7-NVFP4

Quantized
(67)
this model