Introduce

We trained a qwen2.5-vl-7b eagle3 draft model on 30k data random select from FreedomIntelligence/ALLaVA-4V with specforge

Usage

infer with sglang benchmark with mmstar

start server:

python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --speculative-draft Rayzl/qwen2.5-vl-7b-eagle3-sgl --trust-remote-code --chat-template qwen2-vl --chunked-prefill-size -1 --cuda-graph-max-bs 1 --speculative-algo EAGLE3 --speculative-num-steps 4 --speculative-eagle-topk 6 --speculative-num-draft-tokens 24 --tp 1 --mem-fraction-static 0.7 --host 0.0.0.0 --port 8080

benchmark:

python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 100

Speedup

with eagle

Latency: 34.241 s
Output throughput: 181.069 token/s
Accept length: 3.219

without eagle

Latency: 54.813 s
Output throughput: 121.230 token/s
Accept length: 1.000

e2e speed up 1.5x

train

follow the train qwen2.5-vl eagle3

Downloads last month: 5,551

Safetensors

Model size

0.9B params

Tensor type

I64

BF16

BOOL

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support