| --- |
| base_model: |
| - XiaomiMiMo/MiMo-V2.5 |
| --- |
| |
| ## IMPORTANT: You *must* use the docker image below, since it contains many custom kernels written for this model specifically ## |
|
|
| ## Model Description |
|
|
| **MiMo-V2.5-NVFP4** is an NVFP4-quantized version of [XiaomiMiMo/MiMo-V2.5](https://huggingface.co/XiaomiMiMo/MiMo-V2.5). |
|
|
| This is a multi-modal model, supporting text, images, audio and video. This quantization carefully preserves those capabilities. |
|
|
| ### What's quantized |
|
|
| Only the *non-shared* MoE expert MLP projections are quantized to NVFP4. Attention weights are left in BF16, in addition to the dense MLPs (layers 0-3) and the shared experts. Since the MoE expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings. |
|
|
| Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical to ensure broad expert coverage through natural routing alone. |
|
|
| ### Calibration dataset |
|
|
| Six calibration passes were run: |
|
|
| 1. **Coding** — Agentic coding samples (tool calling, multi-turn code generation, function calling) with English and Chinese system prompts. |
| 2. **Broad** — Large-scale diverse samples drawn from WildChat-NonToxic and LMSYS-Chat covering real user conversations across a wide range of topics and languages. |
| 3. **Deep** — Long-context samples (>8K tokens) from coding and diverse sources to exercise deep-sequence expert activation patterns. |
| 4. **Image** — Image question-answering prompts, with the input images drawn from a large collection of public, high quality image datasets. |
| 5. **Audio** — Medium-size dataset of mostly speech. |
| 6. **Video** — Diverse set of video question-answering prompts, with a wide variety of input videos of different durations and resolutions. |
|
|
| ### Requirements |
|
|
| The NVFP4 variant of this model is currently only supported on RTX 6000 (SM120), due to the large number of custom kernels that had to be written to support it. |
|
|
| Minimum: 2x RTX PRO 6000 Blackwell 96GB (future memory optimizations forthcoming that'll allow it to fit better, for now you'll have to the model sequence length, batch size to make it fit) |
|
|
| Recommended: 4x RTX PRO 6000 Blackwell 96GB |
|
|
| ### Community Testing |
|
|
| Note: You will of course want to modify this to bind mount your HF cache, or you'll re-download the model each time. |
|
|
| ``` |
| docker run --rm -it \ |
| --name sglang-mimo-v25 \ |
| --gpus '"device=0,1,2,3"' \ |
| --ipc=host \ |
| --network host \ |
| --ulimit memlock=-1 \ |
| --ulimit stack=67108864 \ |
| -e OMP_NUM_THREADS=16 \ |
| -e SAFETENSORS_FAST_GPU=1 \ |
| -e CUTE_DSL_ARCH="sm_120a" \ |
| docker.io/lukealonso/sglang-cuda13-b12x \ |
| python -m sglang.launch_server \ |
| --model-path lukealonso/MiMo-V2.5-NVFP4 \ |
| --served-model-name "MiMo-V2.5" \ |
| --tp-size 4 \ |
| --page-size 64 \ |
| --host 0.0.0.0 \ |
| --port 8000 \ |
| --enforce-piecewise-cuda-graph \ |
| --kv-cache-dtype fp8_e4m3 \ |
| --mem-fraction-static 0.85 \ |
| --chunked-prefill-size 8192 \ |
| --speculative-algorithm EAGLE \ |
| --speculative-num-steps 3 \ |
| --speculative-eagle-topk 1 \ |
| --speculative-num-draft-tokens 4 \ |
| --enable-pcie-oneshot-allreduce \ |
| --enable-multi-layer-eagle \ |
| --reasoning-parser mimo \ |
| --tool-call-parser mimo \ |
| --quantization modelopt_fp4 \ |
| --max-running-requests 8 \ |
| --moe-runner-backend b12x \ |
| --attention-backend b12x \ |
| --mm-attention-backend b12x \ |
| --fp4-gemm-backend b12x |
| |
| ``` |
|
|
|
|
|
|