Poor performance in vLLM
#3
by sinebubble - opened
Is anyone else running this model in vLLM? I'm getting 0.3 tokens/s for prompting and a soul crushing 3.7 tok/s in generation... with 384G of vram.
--tensor-parallel-size 8
--max-model-len 131072
--served-model-name Qwen3.5-397B-A17B-GPTQ-Int4
--enable-prefix-caching
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--quantization moe_wna16
--max-num-batched-tokens 4096
--gpu-memory-utilization 0.85
--enforce-eager
You need to disable thinking when you make the request. tok/s doesn't count thinking token.
I disabled thinking but it had minimal impact on performance.
Prompt: 0.6 tok/s (20 tokens)
Generation: 3.9 tok/s (135 tokens)
Total time: 34.85s (155 total tokens)
Remove the enforce-eager bit and give --enable-expert-parallel a try
@raulalonsoctic that was the trick. I'm getting much better performance.
$ ./llm_benchmark.py --ip 10.250.11.75
BASE_URL: http://10.250.11.75:8000
MODEL: Qwen3.5-397B-A17B-AWQ
=== Prefill Test (long prompt, 1 token) ===
HTTP Status: 200
Response headers: {'date': 'Thu, 19 Mar 2026 22:32:16 GMT', 'server': 'uvicorn', 'content-length': '589', 'content-type': 'application/json'}
Prompt tokens: 3244
Total time: 68.12s
Prefill speed (estimate): 47.6 tok/s
=== Decode Test (short prompt, 256 tokens) ===
HTTP Status: 200
Prompt tokens: 55
Completion tokens: 256
Total time: 3.96s
Decode speed: 64.7 tok/s