All the weights in model.safetensors should be used right? When I use this eagle model, the draft acceptance rate is very low.

#1
by Lucas199 - opened

All the weights in model.safetensors should be used right? When I use this eagle model, the draft acceptance rate is very low.

All the weights in model.safetensors should be used right? When I use this eagle model, the draft acceptance rate is very low.

Can you provide reproduce command.

All the weights in model.safetensors should be used right? When I use this eagle model, the draft acceptance rate is very low.

Can you provide reproduce command.

Environment:

  • Python 3.12
  • transformers 4.32.0
  • vllm main
    Command:
  • vllm serve
    /workspace/models/Qwen2.5-VL-7B-Instruct
    --port 5580 --host 0.0.0.0
    --max-num-seqs 128 --dtype bfloat16 --max-model-len=32768
    --no-enable-prefix-caching --trust-remote-code -tp 4
    --speculative-config '{"method": "eagle3", "model": "/workspace/models/qwen2.5-vl-7b-eagle3-sgl", "prefill_token_shift": false, "num_speculative_tokens": 3, "draft_tensor_parallel_size": 4, "max_model_len": 8192}'
    --num-lookahead-slots=3
    --allowed-local-media-path /workspace/
    --gpu-memory-utilization=0.93

I pull a request to vllm:https://github.com/vllm-project/vllm/pull/22872

May be you can try the sgl pr

I'm using sglang for evaluation, and my commands are as follows:

python -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-VL-7B-Instruct \
    --chat-template qwen2-vl \
    --speculative-draft-model-path Rayzl/qwen2.5-vl-7b-eagle3-sgl \
    --speculative-algorithm EAGLE3 \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 6 \
    --speculative-num-draft-tokens 24 \
    --trust-remote-code \
    --chunked-prefill-size -1 \
    --cuda-graph-max-bs 1 \
    --tp 1 \
    --mem-fraction-static 0.7 \
    --host 0.0.0.0 \
    --port 8080
python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 100

The results show that the token acceptance length is very poor. (with eagle3)

Average Latency: 165.544 s
Average Output throughput: 39.542 token/s
Average Accept length: 1.128

(without eagle3)

Average Latency: 110.203 s
Average Output throughput: 58.765 token/s
Average Accept length: 1.000

Was there something wrong with my operation?

Sign up or log in to comment