EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Paper • 2406.16858 • Published • 1
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
This is a conversion of lmsys/Qwen3-235B-A22B-EAGLE3 to the vLLM speculators format for use with Eagle3 speculative decoding.
This model is designed to be used with vLLM's Eagle3 speculative decoding implementation:
from vllm import LLM
llm = LLM(
model="Qwen/Qwen3-235B-A22B-Instruct-2507-FP8",
speculative_config={
"method": "eagle3",
"model": "nm-testing/Qwen3-235B-A22B-EAGLE3-converted-speculators-lmsys",
"num_speculative_tokens": 3,
},
tensor_parallel_size=2,
)
Or via command line:
python examples/offline_inference/spec_decode.py \
--method "eagle3" \
--tp 2 \
--model-dir "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8" \
--eagle-dir "nm-testing/Qwen3-235B-A22B-EAGLE3-converted-speculators-lmsys" \
--num-spec-tokens 3
The original Eagle3 config format has been converted to the vLLM speculators format with the following changes:
LlamaForCausalLMEagle3 to Eagle3Speculatortransformer_layer_config and speculators_config sectionseagle_config.eagle_aux_hidden_state_layer_ids to top-level eagle_aux_hidden_state_layer_idsconfig.json: Model configuration in speculators formatmodel.safetensors: Model weights (unchanged from original)If you use this model, please cite the original Eagle3 paper and the LMSYS team:
@article{li2024eagle,
title={EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees},
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
journal={arXiv preprint arXiv:2406.16858},
year={2024}
}
Same as the original model: lmsys/Qwen3-235B-A22B-EAGLE3