How to run on vLLM for 4xSM120

#1
by zenmagnets - opened

Good NVFP4 Quant!

  VLLM_USE_FLASHINFER_MOE_FP4=1 \
  VLLM_USE_FLASHINFER_MOE_FP16=1 \
  VLLM_FLASHINFER_MOE_BACKEND=throughput \
  VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
  vllm serve lukealonso/Qwen3.5-397B-A17B-NVFP4 \
    --tensor-parallel-size 4 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
    --max-model-len 262144 \
    --max-num-seqs 64 \
    --max-num-batched-tokens 4096 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.85 \
    --enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder

Tested on 4x RTX 6000 Blackwell

  • vLLM: 0.17.0rc1.dev204+g04b67d8f6
  • PyTorch: 2.10.0+cu130
  • CUDA runtime: 13.0
  • FlashInfer: 0.6.4
  • transformers: 4.57.6
  • safetensors: 0.7.0

Sign up or log in to comment