MiMo-V2.5-NVFP4 / README.md
lukealonso's picture
Update README.md
e31cc15 verified
metadata
base_model:
  - XiaomiMiMo/MiMo-V2.5

IMPORTANT: You must use the docker image below, since it contains many custom kernels written for this model specifically

Updated 5/4/26 - New calibration data, new docker image. Full 2x RTX 6K support

Model Description

MiMo-V2.5-NVFP4 is an NVFP4-quantized version of XiaomiMiMo/MiMo-V2.5.

This is a multi-modal model, supporting text, images, audio and video. This quantization carefully preserves those capabilities.

What's quantized

Only the non-shared MoE expert MLP projections are quantized to NVFP4. Attention weights are left in BF16, in addition to the dense MLPs (layers 0-3) and the shared experts. Since the MoE expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings.

Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical to ensure broad expert coverage through natural routing alone.

Calibration dataset

Six calibration passes were run:

  1. Coding — Agentic coding samples (tool calling, multi-turn code generation, function calling) with English and Chinese system prompts.
  2. Broad — Large-scale diverse samples drawn from WildChat-NonToxic and LMSYS-Chat covering real user conversations across a wide range of topics and languages.
  3. Deep — Long-context samples (>8K tokens) from coding and diverse sources to exercise deep-sequence expert activation patterns.
  4. Image — Image question-answering prompts, with the input images drawn from a large collection of public, high quality image datasets.
  5. Audio — Medium-size dataset of mostly speech.
  6. Video — Diverse set of video question-answering prompts, with a wide variety of input videos of different durations and resolutions.

Requirements

The NVFP4 variant of this model is currently only supported on RTX 6000 (SM120), due to the large number of custom kernels that had to be written to support it.

Minimum: 2x RTX PRO 6000 Blackwell 96GB Recommended: 4x RTX PRO 6000 Blackwell 96GB

Community Testing

Note: You will of course want to modify this to bind mount your HF cache, or you'll re-download the model each time.

4x RTX6K:

 docker run --rm -it \
    --name sglang-mimo-v25 \
    --gpus '"device=0,1,2,3"' \
    --ipc=host \
    --network host \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -e OMP_NUM_THREADS=16 \
    -e SAFETENSORS_FAST_GPU=1 \
    -e CUTE_DSL_ARCH="sm_120a" \
    docker.io/lukealonso/sglang-cuda13-b12x \
    python -m sglang.launch_server \
      --model-path lukealonso/MiMo-V2.5-NVFP4 \
      --served-model-name "MiMo-V2.5" \
      --tp-size 4 \
      --page-size 64 \
      --host 0.0.0.0 \
      --port 8000 \
      --kv-cache-dtype fp8_e4m3 \
      --mem-fraction-static 0.85 \
      --swa-full-tokens-ratio 0.3 \
      --chunked-prefill-size 8192 \
      --speculative-algorithm EAGLE \
      --speculative-num-steps 3 \
      --speculative-eagle-topk 1 \
      --speculative-num-draft-tokens 4 \
      --enable-pcie-oneshot-allreduce \
      --enable-multi-layer-eagle \
      --reasoning-parser mimo \
      --tool-call-parser mimo \
      --max-running-requests 8 \
      --moe-runner-backend b12x \
      --attention-backend b12x \
      --mm-attention-backend b12x \
      --fp4-gemm-backend b12x

2x RTX6K:

 docker run --rm -it \
    --name sglang-mimo-v25 \
    --gpus '"device=0,1,2,3"' \
    --ipc=host \
    --network host \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -e OMP_NUM_THREADS=16 \
    -e SAFETENSORS_FAST_GPU=1 \
    -e CUTE_DSL_ARCH="sm_120a" \
    docker.io/lukealonso/sglang-cuda13-b12x \
    python -m sglang.launch_server \
      --model-path lukealonso/MiMo-V2.5-NVFP4 \
      --served-model-name "MiMo-V2.5" \
      --tp-size 2 \
      --page-size 64 \
      --host 0.0.0.0 \
      --port 8000 \
      --kv-cache-dtype fp8_e4m3 \
      --mem-fraction-static 0.95 \
      --swa-full-tokens-ratio 0.3 \
      --chunked-prefill-size 2048 \
      --enable-pcie-oneshot-allreduce \
      --enable-multi-layer-eagle \
      --reasoning-parser mimo \
      --tool-call-parser mimo \
      --context-length 131072 \
      --cuda-graph-max-bs 4 \
      --max-running-requests 4 \
      --speculative-algorithm EAGLE \
      --speculative-num-steps 3 \
      --speculative-eagle-topk 1 \
      --speculative-num-draft-tokens 4 \
      --moe-runner-backend b12x \
      --attention-backend b12x \
      --mm-attention-backend b12x \
      --fp4-gemm-backend b12x \
      --fp8-gemm-backend flashinfer_cutlass