Update 4/12/26 - Calibration data updated, KLD reduced by ~20%. More tuning to follow.

Model Description

MiniMax-M2.7-NVFP4 is an NVFP4-quantized version of MiniMaxAI/MiniMax-M2.7, a 230B-parameter Mixture-of-Experts language model with 10B active parameters.

The original model weights were converted from the official FP8 checkpoint to BF16, then quantized to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer.

What's quantized

Only the MoE expert MLP layers (gate, up, and down projections) are quantized to NVFP4. All other layers are left in BF16. Since the expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings.

Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a vastly larger number of samples than typical to ensure broad expert coverage through natural routing alone.

Calibration dataset

Samples were drawn from a diverse mix of publicly available datasets spanning code generation, function/tool calling, multi-turn reasoning, math, and multilingual (English + Chinese) instruction following. System prompts were randomly varied across samples. The dataset was designed to broadly exercise the model's capabilities and activate diverse token distributions across expert modules.

Quality

(pending)

You should always evaluate against your specific use case.

SGLang

Tested on 2x and 4x RTX Pro 6000 Blackwell.

  Docker Image: voipmonitor/sglang:cu130 (festr, 6 days old, has b12x built-in)             
  Model: lukealonso/GLM-5.1-NVFP4 (434 GB, glm_moe_dsa, 78 layers, 256 experts)   
          
  Launch command:                 
  export OMP_NUM_THREADS=16       
  export SGLANG_ENABLE_SPEC_V2=True                       
  export NVIDIA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8  # 8x Blackwell                   
          
  python -m sglang.launch_server \
    --model-path lukealonso/MiniMax-M2.7-NVFP4 \        
    --served-model-name MiniMax-M2.7 \   
    --reasoning-parser minimax \
    --tool-call-parser minimax-m2 \   
    --tp 2 \
    --enable-torch-compile \
    --trust-remote-code \           
    --quantization modelopt_fp4 \   
    --kv-cache-dtype bf16 \
    --moe-runner-backend b12x \
    --fp4-gemm-backend b12x \       
    --attention-backend flashinfer \
    --enable-pcie-oneshot-allreduce \                    
    --mem-fraction-static 0.85 \                          
    --host 0.0.0.0 --port 5000

vLLM

(pending)

Downloads last month: -

Safetensors

Model size

130B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lukealonso/MiniMax-M2.7-NVFP4

Base model

MiniMaxAI/MiniMax-M2.7

Quantized

(33)

this model