about speed
how to use vllm bench to evaluate this model's TTFT/TPOT.when i use vllm to evaluate TTFT/TPOT,it same as Qwen3
Hi,
Thanks for posting!
FlashHead is not correctly applied when you use vllm-bench, we have seen this issue before. If you follow the example, and run inference with the LLM class, you'll see it being applied and the speedups. We are working on a new package to support vllm bench which should ship very soon. I will notify you once it's ready!
Update on this as well - the flash-head plugin I mentioned in the other thread can be used here as well with a remark.
When we evaluate end-to-end latency we do something like:
vllm bench latency \
--model embedl/Qwen3-0.6B-FlashHead \
--batch-size 1 \
--max-model-len 4096 \
--gpu-mem 0.75 \
--output-json flashhead.json
Note the low batch size! FlashHead's speedup is most pronounced at batch_size < 10, which is the typical real-time / on-device use case. At higher batch sizes the relative gain shrinks.
To get a baseline comparison, set FLASHHEAD_ENABLED=0 to disable the plugin without uninstalling:
FLASHHEAD_ENABLED=0 vllm bench latency \
--model embedl/Qwen3-0.6B-FlashHead ...