How to run on vLLM for 4xSM120
#1
by zenmagnets - opened
Good NVFP4 Quant!
VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_USE_FLASHINFER_MOE_FP16=1 \
VLLM_FLASHINFER_MOE_BACKEND=throughput \
VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
vllm serve lukealonso/Qwen3.5-397B-A17B-NVFP4 \
--tensor-parallel-size 4 \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
--max-model-len 262144 \
--max-num-seqs 64 \
--max-num-batched-tokens 4096 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.85 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Tested on 4x RTX 6000 Blackwell
- vLLM: 0.17.0rc1.dev204+g04b67d8f6
- PyTorch: 2.10.0+cu130
- CUDA runtime: 13.0
- FlashInfer: 0.6.4
- transformers: 4.57.6
- safetensors: 0.7.0