Duplicate from lukealonso/MiMo-V2.5-NVFP4

ea4b70d 3 days ago

3.71 kB

	---
	base_model:
	- XiaomiMiMo/MiMo-V2.5
	---

	## IMPORTANT: You must use the docker image below, since it contains many custom kernels written for this model specifically ##

	## Model Description

	MiMo-V2.5-NVFP4 is an NVFP4-quantized version of [XiaomiMiMo/MiMo-V2.5](https://huggingface.co/XiaomiMiMo/MiMo-V2.5).

	This is a multi-modal model, supporting text, images, audio and video. This quantization carefully preserves those capabilities.

	### What's quantized

	Only the non-shared MoE expert MLP projections are quantized to NVFP4. Attention weights are left in BF16, in addition to the dense MLPs (layers 0-3) and the shared experts. Since the MoE expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings.

	Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical to ensure broad expert coverage through natural routing alone.

	### Calibration dataset

	Six calibration passes were run:

	1. Coding — Agentic coding samples (tool calling, multi-turn code generation, function calling) with English and Chinese system prompts.
	2. Broad — Large-scale diverse samples drawn from WildChat-NonToxic and LMSYS-Chat covering real user conversations across a wide range of topics and languages.
	3. Deep — Long-context samples (>8K tokens) from coding and diverse sources to exercise deep-sequence expert activation patterns.
	4. Image — Image question-answering prompts, with the input images drawn from a large collection of public, high quality image datasets.
	5. Audio — Medium-size dataset of mostly speech.
	6. Video — Diverse set of video question-answering prompts, with a wide variety of input videos of different durations and resolutions.

	### Requirements

	The NVFP4 variant of this model is currently only supported on RTX 6000 (SM120), due to the large number of custom kernels that had to be written to support it.

	Minimum: 2x RTX PRO 6000 Blackwell 96GB (future memory optimizations forthcoming that'll allow it to fit better, for now you'll have to the model sequence length, batch size to make it fit)

	Recommended: 4x RTX PRO 6000 Blackwell 96GB

	### Community Testing

	Note: You will of course want to modify this to bind mount your HF cache, or you'll re-download the model each time.

	```
	docker run --rm -it \
	--name sglang-mimo-v25 \
	--gpus '"device=0,1,2,3"' \
	--ipc=host \
	--network host \
	--ulimit memlock=-1 \
	--ulimit stack=67108864 \
	-e OMP_NUM_THREADS=16 \
	-e SAFETENSORS_FAST_GPU=1 \
	-e CUTE_DSL_ARCH="sm_120a" \
	docker.io/lukealonso/sglang-cuda13-b12x \
	python -m sglang.launch_server \
	--model-path lukealonso/MiMo-V2.5-NVFP4 \
	--served-model-name "MiMo-V2.5" \
	--tp-size 4 \
	--page-size 64 \
	--host 0.0.0.0 \
	--port 8000 \
	--enforce-piecewise-cuda-graph \
	--kv-cache-dtype fp8_e4m3 \
	--mem-fraction-static 0.85 \
	--chunked-prefill-size 8192 \
	--speculative-algorithm EAGLE \
	--speculative-num-steps 3 \
	--speculative-eagle-topk 1 \
	--speculative-num-draft-tokens 4 \
	--enable-pcie-oneshot-allreduce \
	--enable-multi-layer-eagle \
	--reasoning-parser mimo \
	--tool-call-parser mimo \
	--quantization modelopt_fp4 \
	--max-running-requests 8 \
	--moe-runner-backend b12x \
	--attention-backend b12x \
	--mm-attention-backend b12x \
	--fp4-gemm-backend b12x

	```