about speed

by ParisXu - opened 20 days ago

Discussion

ParisXu

20 days ago

how to use vllm bench to evaluate this model's TTFT/TPOT.when i use vllm to evaluate TTFT/TPOT,it same as Qwen3

WilhelmT

Embedl org 19 days ago

Hi,

Thanks for posting!

FlashHead is not correctly applied when you use vllm-bench, we have seen this issue before. If you follow the example, and run inference with the LLM class, you'll see it being applied and the speedups. We are working on a new package to support vllm bench which should ship very soon. I will notify you once it's ready!

JonnaMat

Embedl org 10 days ago

Update on this as well - the flash-head plugin I mentioned in the other thread can be used here as well with a remark.

When we evaluate end-to-end latency we do something like:

vllm bench latency \                                                                                                                 
    --model embedl/Qwen3-0.6B-FlashHead \           
    --batch-size 1 \                                                                                                              
    --max-model-len 4096 \                                                                                                          
    --gpu-mem 0.75 \                                
    --output-json flashhead.json

Note the low batch size! FlashHead's speedup is most pronounced at batch_size < 10, which is the typical real-time / on-device use case. At higher batch sizes the relative gain shrinks.

To get a baseline comparison, set FLASHHEAD_ENABLED=0 to disable the plugin without uninstalling:

FLASHHEAD_ENABLED=0 vllm bench latency \
    --model embedl/Qwen3-0.6B-FlashHead ...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment