All the weights in model.safetensors should be used right? When I use this eagle model, the draft acceptance rate is very low.

by Lucas199 - opened Aug 18, 2025

Aug 18, 2025

All the weights in model.safetensors should be used right? When I use this eagle model, the draft acceptance rate is very low.

Rayzl

Owner Aug 20, 2025

All the weights in model.safetensors should be used right? When I use this eagle model, the draft acceptance rate is very low.

Can you provide reproduce command.

Lucas199

Aug 20, 2025

•

edited Aug 20, 2025

All the weights in model.safetensors should be used right? When I use this eagle model, the draft acceptance rate is very low.

Can you provide reproduce command.

Environment:

Python 3.12
transformers 4.32.0
vllm main
Command:
vllm serve
/workspace/models/Qwen2.5-VL-7B-Instruct
--port 5580 --host 0.0.0.0
--max-num-seqs 128 --dtype bfloat16 --max-model-len=32768
--no-enable-prefix-caching --trust-remote-code -tp 4
--speculative-config '{"method": "eagle3", "model": "/workspace/models/qwen2.5-vl-7b-eagle3-sgl", "prefill_token_shift": false, "num_speculative_tokens": 3, "draft_tensor_parallel_size": 4, "max_model_len": 8192}'
--num-lookahead-slots=3
--allowed-local-media-path /workspace/
--gpu-memory-utilization=0.93

I pull a request to vllm:https://github.com/vllm-project/vllm/pull/22872

Rayzl

Owner Aug 20, 2025

May be you can try the sgl pr

huggingface4ch

Nov 17, 2025

•

edited Nov 17, 2025

I'm using sglang for evaluation, and my commands are as follows:

python -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-VL-7B-Instruct \
    --chat-template qwen2-vl \
    --speculative-draft-model-path Rayzl/qwen2.5-vl-7b-eagle3-sgl \
    --speculative-algorithm EAGLE3 \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 6 \
    --speculative-num-draft-tokens 24 \
    --trust-remote-code \
    --chunked-prefill-size -1 \
    --cuda-graph-max-bs 1 \
    --tp 1 \
    --mem-fraction-static 0.7 \
    --host 0.0.0.0 \
    --port 8080

python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 100

The results show that the token acceptance length is very poor. (with eagle3)

Average Latency: 165.544 s
Average Output throughput: 39.542 token/s
Average Accept length: 1.128

(without eagle3)

Average Latency: 110.203 s
Average Output throughput: 58.765 token/s
Average Accept length: 1.000

Was there something wrong with my operation?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment