vllm (SM70) V100 support

#5
by FayeQuant - opened

(SM70) V100 can not run

I made it work.

  1. First, install vllm from source, editable.
  2. Second, apply this bugfix:

    https://github.com/vllm-project/vllm/pull/36026

  3. Finally, use this script:
CUDA_VISIBLE_DEVICES=0,1,2,3 \
VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3000 \
vllm serve "/home/myname/models/Qwen3.5-35B-A3B-GPTQ-Int4" \
  --host 0.0.0.0 \
  --port 8018 \
  --served-model-name "Qwen3.5-35B-A3B-GPTQ-Int4" \
  --mm-encoder-attn-backend TORCH_SDPA \ # <<<< very important for multimodal on V100
  --max-model-len auto \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --gpu-memory-utilization 0.79 \  # <<<< try 0.5 the first time, then increase based on free VRAM
  --tensor-parallel-size 4

(SM70) V100 can not run

It could, on V100 32G, but you have to use vllm from source and fix a bug.

Sign up or log in to comment