Poor performance in vLLM

#3
by sinebubble - opened

Is anyone else running this model in vLLM? I'm getting 0.3 tokens/s for prompting and a soul crushing 3.7 tok/s in generation... with 384G of vram.

  --tensor-parallel-size 8
  --max-model-len 131072
  --served-model-name Qwen3.5-397B-A17B-GPTQ-Int4
  --enable-prefix-caching
  --enable-auto-tool-choice
  --tool-call-parser qwen3_coder
  --reasoning-parser qwen3
  --quantization moe_wna16
  --max-num-batched-tokens 4096
  --gpu-memory-utilization 0.85
  --enforce-eager

You need to disable thinking when you make the request. tok/s doesn't count thinking token.

I disabled thinking but it had minimal impact on performance.

Prompt:     0.6 tok/s  (20 tokens)
Generation: 3.9 tok/s  (135 tokens)
Total time: 34.85s  (155 total tokens)

Remove the enforce-eager bit and give --enable-expert-parallel a try

@raulalonsoctic that was the trick. I'm getting much better performance.

$ ./llm_benchmark.py --ip 10.250.11.75
BASE_URL: http://10.250.11.75:8000
MODEL: Qwen3.5-397B-A17B-AWQ

=== Prefill Test (long prompt, 1 token) ===
HTTP Status: 200
Response headers: {'date': 'Thu, 19 Mar 2026 22:32:16 GMT', 'server': 'uvicorn', 'content-length': '589', 'content-type': 'application/json'}
Prompt tokens: 3244
Total time: 68.12s
Prefill speed (estimate): 47.6 tok/s

=== Decode Test (short prompt, 256 tokens) ===
HTTP Status: 200
Prompt tokens: 55
Completion tokens: 256
Total time: 3.96s
Decode speed: 64.7 tok/s

Sign up or log in to comment