All the weights in model.safetensors should be used right? When I use this eagle model, the draft acceptance rate is very low.
#1
by Lucas199 - opened
All the weights in model.safetensors should be used right? When I use this eagle model, the draft acceptance rate is very low.
All the weights in model.safetensors should be used right? When I use this eagle model, the draft acceptance rate is very low.
Can you provide reproduce command.
All the weights in model.safetensors should be used right? When I use this eagle model, the draft acceptance rate is very low.
Can you provide reproduce command.
Environment:
- Python 3.12
- transformers 4.32.0
- vllm main
Command: - vllm serve
/workspace/models/Qwen2.5-VL-7B-Instruct
--port 5580 --host 0.0.0.0
--max-num-seqs 128 --dtype bfloat16 --max-model-len=32768
--no-enable-prefix-caching --trust-remote-code -tp 4
--speculative-config '{"method": "eagle3", "model": "/workspace/models/qwen2.5-vl-7b-eagle3-sgl", "prefill_token_shift": false, "num_speculative_tokens": 3, "draft_tensor_parallel_size": 4, "max_model_len": 8192}'
--num-lookahead-slots=3
--allowed-local-media-path /workspace/
--gpu-memory-utilization=0.93
I pull a request to vllm:https://github.com/vllm-project/vllm/pull/22872
I'm using sglang for evaluation, and my commands are as follows:
python -m sglang.launch_server \
--model-path Qwen/Qwen2.5-VL-7B-Instruct \
--chat-template qwen2-vl \
--speculative-draft-model-path Rayzl/qwen2.5-vl-7b-eagle3-sgl \
--speculative-algorithm EAGLE3 \
--speculative-num-steps 4 \
--speculative-eagle-topk 6 \
--speculative-num-draft-tokens 24 \
--trust-remote-code \
--chunked-prefill-size -1 \
--cuda-graph-max-bs 1 \
--tp 1 \
--mem-fraction-static 0.7 \
--host 0.0.0.0 \
--port 8080
python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 100
The results show that the token acceptance length is very poor. (with eagle3)
Average Latency: 165.544 s
Average Output throughput: 39.542 token/s
Average Accept length: 1.128
(without eagle3)
Average Latency: 110.203 s
Average Output throughput: 58.765 token/s
Average Accept length: 1.000
Was there something wrong with my operation?