Working configuration for Nvidia Blackwell

#4
by luismiguelsaez - opened

Hi folks!

The working vLLM configuration posted by the author doesn't work for dual RTX 6000 Pro, so I'm leaving this here, which is what worked for me:

CUDA_VISIBLE_DEVICES=0,1 \
SAFETENSORS_FAST_GPU=1 \
NCCL_P2P_DISABLE=1 \
NCCL_DEBUG=INFO \
VLLM_LOGGING_LEVEL=INFO \
vllm serve lukealonso/MiniMax-M2.7-NVFP4 \
--trust-remote-code \
--enable_expert_parallel \
--tensor-parallel-size 2 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--disable-custom-all-reduce \
--kv-cache-dtype fp8 \
--max-num-seqs 2

Hope it's useful for someone!

Looks similar to the vllm config I settled on. I'm also running 2x RTX PRO 6000 Blackwell. I found the performance to be slightly slower than 2.5 with similar setup, same hardware. See my thread posted here yesterday. Another user posted a nice sglang docker-compose that is BLAZING FAST.

Thanks, will have a look at the SGLang compose YAML. Regarding the configuration I used, couldn't make it work without --disable-custom-all-reduce and the NCCL variables, because it got stuck during initialization otherwise.

Thanks, will have a look at the SGLang compose YAML. Regarding the configuration I used, couldn't make it work without --disable-custom-all-reduce and the NCCL variables, because it got stuck during initialization otherwise.

did you do the bios fix to enable the cards to talk to each other via the pcie? if you don't that'w when you tend to have hung in init. See here: https://www.reddit.com/r/LocalLLaMA/comments/1on7kol/troubleshooting_multigpu_with_2_rtx_pro_6000/

@luismiguelsaez

Thanks, will have a look at the SGLang compose YAML. Regarding the configuration I used, couldn't make it work without --disable-custom-all-reduce and the NCCL variables, because it got stuck during initialization otherwise.

I was fighting the same issue - freeze during warm-up, until I removed all kind of NCCL variables (that worked well for Qwen3.5-120B-A10B).
Make sure to remove what is not needed, and it should work even without --disable-custom-all-reduce:

docker run --rm \
  --name minimax-m2.7 \
  --ipc=host \
  --shm-size=32g \
  --runtime nvidia \
  --gpus device=all \
  -p 8000:8000 \
  -v /mnt/hfhub:/root/.cache/huggingface/hub \
  -e OMP_NUM_THREADS=16 \
  -e SGLANG_ENABLE_SPEC_V2=True \
  voipmonitor/sglang:cu130 \
  python3 -m sglang.launch_server \
    --model-path lukealonso/MiniMax-M2.7-NVFP4 \
    --served-model-name minimax-m2.7 \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --sleep-on-idle \
    --enable-torch-compile \
    --reasoning-parser minimax \
    --tool-call-parser minimax-m2 \
    --tensor-parallel-size 2 \
    --quantization modelopt_fp4 \
    --kv-cache-dtype bf16 \
    --mem-fraction-static 0.93 \
    --context-length 131072 \
    --max-running-requests 1 \
    --attention-backend flashinfer \
    --fp4-gemm-backend b12x \
    --moe-runner-backend b12x \
    --enable-pcie-oneshot-allreduce

That's a great suggestion, thanks! But also ... maybe they are now auto-discovering the GPUs topology, which is unpredictable while using Docker image tags that change over time; that's why I like to use more specific tags pinned to the version 😄

That's a great suggestion, thanks! But also ... maybe they are now auto-discovering the GPUs topology, which is unpredictable while using Docker image tags that change over time; that's why I like to use more specific tags pinned to the version 😄

Yes. Just reporting back what helped me :). The performance is great. Not as great as Qwen122B with MTP, but still good:

Metric avg min max p99 p90 p50 std
Time to First Token (ms) 620.80 148.28 1,149.04 1,114.62 989.63 624.48 273.44
Time to Second Token (ms) 6.78 5.60 7.29 7.28 7.18 6.97 0.43
Time to First Output Token (ms) 3,086.46 1,464.16 7,021.31 6,642.71 5,361.52 2,714.12 1,437.47
Request Latency (ms) 22,294.26 15,178.43 29,996.89 29,834.95 28,472.66 22,197.37 4,540.13
Inter Token Latency (ms) 10.59 7.34 14.09 14.03 13.43 10.54 2.09
Output Token Throughput Per User (tokens/sec/user) 98.34 70.96 136.19 135.81 128.59 94.90 20.05
Output Sequence Length (tokens) 2,048.00 2,048.00 2,048.00 2,048.00 2,048.00 2,048.00 0.00
Input Sequence Length (tokens) 39,487.87 1,062.00 81,200.00 80,325.07 72,627.90 39,000.00 24,454.61
Output Token Throughput (tokens/sec) 88.01 N/A N/A N/A N/A N/A N/A
Request Throughput (requests/sec) 0.04 N/A N/A N/A N/A N/A N/A
Request Count (requests) 30.00 N/A N/A N/A N/A N/A N/A
aiperf profile --model 'minimax-m2.7-nvfp4-sgl-128k-p1' --tokenizer 'MiniMaxAI/MiniMax-M2.7' --tokenizer-trust-remote-code --url 
'http://localhost:8080' --endpoint-type 'chat' --endpoint '/v1/chat/completions' --streaming --concurrency 1 --conversation-num 1 --conversation-turn-mean 30 
--conversation-turn-stddev 0 --conversation-turn-delay-mean 1000 --conversation-turn-delay-stddev 0 --synthetic-input-tokens-mean 1024 --synthetic-input-tokens-stddev 
0 --output-tokens-mean 2048 --num-dataset-entries 30 --warmup-request-count 1 --random-seed 42 --connection-reuse-strategy 'sticky-user-sessions' --extra-inputs 
'min_tokens:2048' --use-legacy-max-tokens --use-server-token-count

(2x RTX PRO 6000 @ 450W)

Looks like a solid performance, specially compared to Qwen3.5 122b ( faster ) which is a model way less intelligent, at least according to my real-life usage tests.

Btw, didn't know about that aiperf tool, looks great.

Sign up or log in to comment