MiniMax-M2.7-MXFP4

MXFP4 quantization of MiniMax-M2.7 (228B params, 62 layers, 256 experts/layer, top-8 sigmoid routing).

All 15,872 MoE expert weights quantized to MXFP4. Attention, layer norms, embeddings, and router weights kept at original precision.

Base (FP8) MXFP4
Size 215 GB 119 GB
Perplexity (WikiText-2) 4.997 5.063 (+1.34%)
KL divergence -- 0.174 nats/tok (mean), 0.031 (median)
Top-1 agreement -- 85.8%
Compression 1x 1.81x

Quality Analysis

Evaluation plots

KLD is heavily right-skewed: median KLD is 0.031 nats/tok (5.6x lower than the mean). 96.6% of tokens have KLD < 1 nat. Only 69 out of 2048 eval tokens show significant divergence -- these are low-confidence positions where the model is already distributing probability across many candidates.

Error is diffuse across experts: per-expert quantization error analysis of all 15,872 experts shows extremely uniform error (std=0.000271, range 0.110--0.116). The 256-expert top-8 architecture is inherently quantization-tolerant -- each expert contributes ~1/8th of the output, so MXFP4 errors average out across the mixture.

Format

MXFP4 block-32 quantization in compressed-tensors format:

  • weight_packed: uint8 [out, in//2] -- two 4-bit values packed per byte (even=low nibble, odd=high nibble)
  • weight_scale: uint8 e8m0 [out, in//32] -- one shared exponent per block of 32 elements

Quantization is calibration-free (no calibration data needed). MXFP4 block-32 scaling is deterministic -- the shared exponent is derived directly from the max magnitude in each block.

Quantized with quant4.

Serving

vLLM

Requires vLLM with MXFP4 compressed-tensors support and the CUTLASS FP4xFP8 kernel for Blackwell GPUs:

vllm serve /path/to/MiniMax-M2.7-MXFP4 \
    --tensor-parallel-size 2 \
    --trust-remote-code \
    --max-num-seqs 512 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 16384 \
    --kv-cache-dtype fp8

Memory Budget

At 119 GB, this fits on 2x DGX Spark (2x 120 GB = 240 GB total) with ~100 GB remaining for KV cache, enabling long-context or multi-session serving that would be impossible with the 215 GB FP8 original.

Evaluation Details

Evaluated on WikiText-2 test set (2048 tokens) using layer-by-layer streaming inference with MiniMaxLayerRunner. Both models run identical forward passes; logits compared token-by-token.

Metric Value
Perplexity (ref) 4.997
Perplexity (MXFP4) 5.063
PPL degradation +1.34%
KL(ref||target) mean 0.174 nats/tok
KL(ref||target) median 0.031 nats/tok
KL(ref||target) P95 0.824 nats/tok
KL(ref||target) P99 1.827 nats/tok
Top-1 agreement 85.8%
Tokens with KLD > 1 nat 69 / 2048 (3.4%)

Acknowledgments

Based on MiniMax-M2.7 by MiniMax. Original model license applies.

Downloads last month
559
Safetensors
Model size
123B params
Tensor type
U8
·
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for olka-fi/MiniMax-M2.7-MXFP4

Quantized
(69)
this model