Introduce

We trained a qwen2.5-vl-7b eagle3 draft model on 30k data random select from FreedomIntelligence/ALLaVA-4V with specforge

Usage

infer with sglang benchmark with mmstar

start server:

python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --speculative-draft Rayzl/qwen2.5-vl-7b-eagle3-sgl --trust-remote-code --chat-template qwen2-vl --chunked-prefill-size -1 --cuda-graph-max-bs 1 --speculative-algo EAGLE3 --speculative-num-steps 4 --speculative-eagle-topk 6 --speculative-num-draft-tokens 24 --tp 1 --mem-fraction-static 0.7 --host 0.0.0.0 --port 8080

benchmark:

python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 100

Speedup

  • with eagle
Latency: 34.241 s
Output throughput: 181.069 token/s
Accept length: 3.219
  • without eagle
Latency: 54.813 s
Output throughput: 121.230 token/s
Accept length: 1.000

e2e speed up 1.5x

train

follow the train qwen2.5-vl eagle3

Downloads last month
5,551
Safetensors
Model size
0.9B params
Tensor type
I64
BF16
BOOL
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support