vllm (SM70) V100 support
#5
by FayeQuant - opened
(SM70) V100 can not run
I made it work.
- First, install vllm from source, editable.
- Second, apply this bugfix:
- Finally, use this script:
CUDA_VISIBLE_DEVICES=0,1,2,3 \
VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3000 \
vllm serve "/home/myname/models/Qwen3.5-35B-A3B-GPTQ-Int4" \
--host 0.0.0.0 \
--port 8018 \
--served-model-name "Qwen3.5-35B-A3B-GPTQ-Int4" \
--mm-encoder-attn-backend TORCH_SDPA \ # <<<< very important for multimodal on V100
--max-model-len auto \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.79 \ # <<<< try 0.5 the first time, then increase based on free VRAM
--tensor-parallel-size 4
(SM70) V100 can not run
It could, on V100 32G, but you have to use vllm from source and fix a bug.