MiMo V2.5 on 4-node Grace-Blackwell DGX Spark: 5 SGLang PRs and 2 doc notes

#11

by happypatrick - opened about 17 hours ago

Discussion

happypatrick

about 17 hours ago

•

edited about 17 hours ago

MiMo 团队你们好!这里是加州的一位朋友。😀

Thanks for the MiMo V2.5 release!

Wanted to share 5 SGLang fixes from a recent bring-up plus two small documentation suggestions.

Where I deployed it

A cluster of 4 GB10 (Grace-Blackwell, sm_121a) DGX Sparks. SGLang dev-cu13-mimo-v2.5 image, native torch.distributed NCCL across all 4 nodes (no Ray). TP=4, FP8 weights and FP8 KV (fp8_e4m3), --attention-backend triton, --mem-fraction-static 0.83. Serving the full omni stack: text, audio, video, and the OpenAI-compatible transcription endpoint.

5 SGLang PRs from the bring-up

These all hit the multimodal path on Blackwell. Anyone serving MiMo V2.5 on GB10 will likely need the full set.

#24493 [Multimodal] mimo_v2: coerce list-shaped input_text in process_mm_data_async. The transcription adapter passes input_text as list[str] (decoded segments) or list[int] (token-ID fallback), but the processor handed it straight to AUDIO_TOKEN_REGEX.search which expects str. Every call to /v1/audio/transcriptions 500'd until this was patched.
#24497 [Models] mimo_audio: fall back to SDPA varlen on non-Hopper CUDA. The audio encoder hard-imports flash_attn_varlen_func which requires Hopper (sm_90). On Blackwell (sm_121) and Ada (sm_89), the import fails and the encoder is dead. PR adds a PyTorch SDPA varlen fallback for non-Hopper CUDA.
#24501 [Entrypoints] transcription_adapters: register MiMoV2 adapter. The transcription endpoint had no adapter registered for MiMoV2ForCausalLM, so the request handler crashed on unsupported keyword arguments. Adds a minimal adapter that forwards compatible SamplingParams and skips Whisper-style timestamp parsing (MiMo's tokenizer doesn't emit those tokens).
#24503 [VLM] video_decoder: make pin_memory() best-effort. On Grace-Blackwell unified memory with TP=4 and high mem_fraction_static, pin_memory() competes with the same pool the model uses for weights and KV cache, OOMs deterministically, and every video request fails. Wraps the call in a CPU-tensor guard plus narrow RuntimeError catch and falls back to the unpinned tensor on failure.
#24504 [VLM] video_decoder: handle decord 3.3+ Tensor return from get_batch. As of decord 3.3, VideoReader.get_batch() returns a torch.Tensor directly instead of an NDArray, so the unconditional .asnumpy() raises AttributeError. Sniffs the return type.

None of these break the Hopper, Ampere, or Ada paths: on discrete-GPU systems where pinning normally succeeds and FA3 is available, the behavior is unchanged.

Two documentation suggestions

These are small but tripped me up during bring-up.

1. MTP serving recipe on the V2.5 card

The V2.5 model card describes MTP architecturally ("Three lightweight MTP modules with dense FFNs accelerate inference via speculative decoding"), but the SGLang launch example on the V2.5 card doesn't include any of the speculative-decoding flags. The V2.5-Pro card does include them, and so does the V2-Flash card. The V2.5 card just references sgl-project/sglang#23811 without listing the flags it enables.

The flag block missing from V2.5 (that's on V2.5-Pro):

SGLANG_ENABLE_SPEC_V2=1
--speculative-algorithm EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--enable-multi-layer-eagle

2. Reasoning parser name on the V2.5 card vs the SGLang cookbook

The V2.5 model card says --reasoning-parser qwen3, and the SGLang cookbook for MiMo V2.5 says --reasoning-parser mimo. Both work because they're literal aliases of Qwen3Detector in python/sglang/srt/parser/reasoning_parser.py, but the inconsistency was confusing. Aligning both pages on one name (probably mimo, since that already matches the tool parser) would help. I checked the tool-call side too: both pages say --tool-call-parser mimo, which is correct because mimo and qwen3_coder are different classes in SGLang and only mimo matches V2.5's chat template output.

Thanks again!

PS:
Adding the actual launch sequence for anyone wanting to reproduce this on GB10 / DGX Spark. This is what we're running across all 4 ranks.

Container image: lmsysorg/sglang:dev-cu13-mimo-v2.5

Common env (every rank):

SGLANG_ENABLE_SPEC_V2=True
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
TORCH_CUDA_ARCH_LIST=12.1a            # GB10 sm_121a
OMP_NUM_THREADS=4
MKL_NUM_THREADS=4
TORCH_NUM_THREADS=4

# RDMA / NCCL: adjust interface names to your fabric
NCCL_IGNORE_CPU_AFFINITY=1
NCCL_SOCKET_IFNAME=enP2p1s0f0np0
NCCL_IB_HCA=roceP2p1s0f0
NCCL_IB_DISABLE=0
NCCL_CUMEM_ENABLE=0
GLOO_SOCKET_IFNAME=enP2p1s0f0np0
TP_SOCKET_IFNAME=enP2p1s0f0np0

Per-rank env:

VLLM_HOST_IP=<this rank's RDMA IP>    # e.g. 192.168.100.10/.11/.12/.13

Launch (run on every rank, --node-rank set 0..3, head is rank 0):

python3 -m sglang.launch_server \
  --trust-remote-code \
  --model-path /models/mimo-v2.5 \
  --tp 4 \
  --nnodes 4 \
  --node-rank <0|1|2|3> \
  --dist-init-addr <head-RDMA-IP>:29500 \
  --attention-backend triton \
  --mm-attention-backend triton_attn \
  --moe-runner-backend auto \
  --quantization fp8 \
  --kv-cache-dtype fp8_e4m3 \
  --mem-fraction-static 0.83 \
  --max-running-requests 24 \
  --chunked-prefill-size 8192 \
  --max-prefill-tokens 8192 \
  --swa-full-tokens-ratio 0.3 \
  --reasoning-parser qwen3 \
  --tool-call-parser mimo \
  --schedule-policy lpm \
  --host 0.0.0.0 \
  --port 30000

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment