MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10

NVFP4 quantization of cerebras/MiniMax-M2.5-REAP-139B-A10B for NVIDIA DGX Spark (GB10).

The base model is a Cerebras REAP (Router-weighted Expert Activation Pruning) variant of MiniMaxAI/MiniMax-M2.5. REAP uniformly prunes experts from 256 → 154 (40% pruning), reducing total parameters from 230B to 139B while maintaining near-identical performance. This is the more aggressively pruned sibling of the 172B (25%) variant.

Model Details

Base Model cerebras/MiniMax-M2.5-REAP-139B-A10B
Original Model MiniMaxAI/MiniMax-M2.5 (230B)
Architecture MiniMaxM2ForCausalLM (MoE, 154 experts, 8 active per token)
Total Parameters 139B
Active Parameters 10B per token
Hidden Layers 62
Quantization NVFP4 (4-bit floating point), all layers including self_attn
Format compressed-tensors (safetensors), 17 shards
Size on Disk 75 GB
Context Length 196,608 tokens (~192K)
License Modified MIT (inherited from Cerebras REAP)

Why 139B over 172B?

172B REAP 139B REAP
Expert pruning 25% (256 → 192) 40% (256 → 154)
NVFP4 size 93 GB 75 GB
Single Spark fit Tight (max ~65K ctx) Comfortable (~90K+ ctx headroom)
Cerebras eval loss Baseline ~0.5% degradation

The 139B variant trades minimal quality for more memory headroom on a single DGX Spark. With 75 GB model weight vs 93 GB, there's an additional ~18 GB available for KV cache — which translates to longer achievable context or more concurrent sessions depending on how it's deployed.

Performance (Single NVIDIA DGX Spark — GB10, 128 GB)

TODO: Benchmark pending — model just quantized. Will update with llama-benchy results.

Expected: similar or slightly faster than 172B NVFP4 (27–29 tok/s) due to smaller model footprint.

Quantization Details

  • Method: Post-training quantization via LLM Compressor (llmcompressor 0.10.0)
  • Scheme: NVFP4 (compressed-tensors format)
  • Calibration Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
  • Calibration Samples: 64
  • Max Sequence Length: 2048 tokens
  • Ignore List: lm_head, model.embed_tokens, re:.*block_sparse_moe\.gate$
  • Environment: LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1
  • Hardware Used: NVIDIA DGX Spark (CPU offloading + 300GB swap)
  • Total Quantization Time: 4.7 hours (281 minutes)
    • Quant pipeline model load: 50 seconds (27 BF16 shards into CPU RAM — this is llmcompressor load, NOT vLLM inference)
    • Calibration forward passes + weight calibration (28,892 weights): ~2+ hours (swap-dominated)
    • Model compression: 28,892 iterations in ~60 minutes (highly variable 1–16 it/s due to swap I/O)
    • Model save: 17 shards to disk
    • Bottleneck: swap I/O throughout (260GB model on 128GB RAM + 300GB swap)

Quantization Pipeline

The source model on HuggingFace is labeled BF16 but actually contains float8_e4m3fn weights with weight_scale_inv blocks of shape [128, 128]. A dequantization step was required before NVFP4 quantization:

  1. Download: cerebras/MiniMax-M2.5-REAP-139B-A10B (131GB, 27 shards — FP8)
  2. Dequant FP8 → BF16: Block-wise dequantization (multiply by scale_inv), output 260GB / 27 shards
  3. Quantize BF16 → NVFP4: LLM Compressor oneshot with GB10-optimized ignore list
  4. Output: 75GB / 17 shards (compressed-tensors format)

Ignore-list choice — attention stays quantized

This quant keeps all Linear layers quantized including self_attn (q/k/v projections). Other public NVFP4 variants (e.g. lukealonso/MiniMax-M2.7-NVFP4) keep attention in BF16, which leaves ~47% of per-token decode bandwidth on the table. I picked the aggressive path because NVFP4 calibration appeared to handle attention weight distributions cleanly on this architecture — but it's a different design choice than lukealonso's, and lukealonso's may be the right pick for deployers who prefer the safety margin of BF16 attention.

Container Setup for Quantization

# Image: avarok/dgx-vllm-nvfp4-kernel:v23 (has llmcompressor + deps)
# Override entrypoint since default launches vLLM server

docker run -d --name minimax-139b-quant \
  --gpus all --ipc=host \
  -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-A10B-BF16-real:/workspace/input_model \
  -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10:/workspace/output_model \
  -v /opt/huggingface/models/quantize-minimax-139b.py:/workspace/quantize.py \
  -e LLMCOMPRESSOR_MOE_CALIBRATE_ALL_EXPERTS=1 \
  --entrypoint bash \
  avarok/dgx-vllm-nvfp4-kernel:v23 \
  -c "pip install --upgrade transformers && python /workspace/quantize.py"

Important: The --entrypoint bash override is required because the default entrypoint launches vLLM. The pip install --upgrade transformers is needed because the image ships an older transformers that doesn't support MiniMax M2 architecture.

Swap Configuration

The 260GB BF16 model exceeds 128GB physical RAM. A 300GB swap file was created:

sudo fallocate -l 300G /opt/huggingface/swapfile
sudo chmod 600 /opt/huggingface/swapfile
sudo mkswap /opt/huggingface/swapfile
sudo swapon /opt/huggingface/swapfile

This causes significant I/O stalls during compression (speed drops from 16 it/s to 1 it/s when paging), but the process completes successfully.

Running on a Single DGX Spark

Docker image: avarok/dgx-vllm-nvfp4-kernel:v23 (vLLM 0.16.0-rc2, CUDA 13.0, SM 12.1)

Download the model:

huggingface-cli download saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10 \
  --local-dir /opt/huggingface/models/MiniMax-M2.5-REAP-139B-NVFP4

Launch:

docker run -d --name minimax-139b --gpus all --ipc=host \
  -v /opt/huggingface/models/MiniMax-M2.5-REAP-139B-NVFP4:/models/MiniMax-M2.5-REAP-139B-NVFP4 \
  -p 8000:8000 \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  -e MODEL=/models/MiniMax-M2.5-REAP-139B-NVFP4 \
  -e PORT=8000 \
  -e MAX_MODEL_LEN=131072 \
  -e GPU_MEMORY_UTIL=0.93 \
  -e "VLLM_EXTRA_ARGS=--trust-remote-code --kv-cache-dtype fp8 --attention-backend flashinfer --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think" \
  avarok/dgx-vllm-nvfp4-kernel:v23

Note: With 75 GB model weight (vs 93 GB for 172B), MAX_MODEL_LEN can likely go higher — 131072 should be achievable. Benchmarks will confirm exact limits.

Test it:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMax-M2.5-REAP-139B-NVFP4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "min_p": 0.01,
    "max_tokens": 512
  }'

Environment Variables

Variable Why
VLLM_NVFP4_GEMM_BACKEND=marlin Use Marlin kernels for FP4 GEMM (FlashInfer JIT fails on Spark SM121a)
VLLM_TEST_FORCE_FP8_MARLIN=1 Required for Marlin backend activation
VLLM_USE_FLASHINFER_MOE_FP4=0 Disable FlashInfer for MoE FP4 (JIT ninja build crashes)
VLLM_MARLIN_USE_ATOMIC_ADD=1 Atomic adds for Marlin (stability on GB10)
GPU_MEMORY_UTIL=0.93 0.95 OOMs on Spark; 0.93 is the safe max
--kv-cache-dtype fp8 FP8 KV cache saves memory, enables larger context
--attention-backend flashinfer FlashInfer for attention (not MoE) — works fine

Recommended Sampling Parameters

Per MiniMax documentation:

{
  "temperature": 1.0,
  "top_p": 0.95,
  "top_k": 40,
  "min_p": 0.01
}

Comparison with other public M2.5 REAP NVFP4 quants

Model Quant scope Size Attention tok/s (single Spark)
This repo — 139B REAP NVFP4 All Linear incl. attn 75 GB Quantized TBD
saricles/MiniMax-M2.5-REAP-172B-A10B-NVFP4-GB10 — 172B REAP NVFP4 All Linear incl. attn 93 GB Quantized 28 tok/s
lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4 — 139B NVFP4 Expert MLPs only 79 GB BF16 ~16 tok/s

Different ignore-list philosophies — lukealonso keeps attention in BF16 (safer / more conservative), the GB10-targeted variants here quantize it (smaller, faster on this hardware, slightly more aggressive). Use whichever matches the tradeoff you prefer.

Related Models

Acknowledgments

Downloads last month
1,201
Safetensors
Model size
79B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10

Quantized
(11)
this model

Paper for saricles/MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10