Eagle3-Qwen3-4B-Instruct-2507-zh
Introduction
Eagle3-Qwen3-4B-Instruct-2507-zh is a retrained version of the open-source Qwen3-4B-Instruct-2507. It is designed to work with the Eagle-3 speculative decoding algorithm and provides Chinese/English text acceleration capabilities. The model aims to speed up the inference process of large language models during the decoding phase. Trained on a mixed Chinese and English dataset, it optimizes Chinese acceptance rates and is suitable for inference tasks involving both Chinese and English text.
Note: The configuration files in the project follow the SGLang framework. If you intend to use other frameworks like vLLM, please modify the relevant configurations.
Training Configuration
We used the open-source codebase SpecForge to train and obtain Eagle-3 weights compatible with the Qwen3-4B-Instruct-2507 (non-thinking mode), providing Chinese/English text acceleration capabilities.
- Dataset: A training dataset with a 1:1 ratio of English and Chinese questions (68K data points in total) was constructed by selecting 32K samples each from the ShareGPT-68K (English) and ShareGPT-Chinese-English-90k (Chinese) datasets. For the training dataset, the prompt section uses content from the original datasets, while the output section is regenerated by the Qwen3-4B-Instruct-2507 model.
- Training environment: Training was conducted on a setup with 2 NVIDIA RTX 4090 GPUs, each with 24GB of VRAM (Video Random Access Memory). The total training duration was approximately 3 days.
Inference Launch Command
To launch the EAGLE-3 algorithm service using SGLang, here is the instruction:
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-4B-Instruct-2507 \
--speculative-algo EAGLE3 \
--speculative-draft Zjcxy-SmartAI/Eagle3-Qwen3-4B-Instruct-2507-zh \
--speculative-num-steps 5 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--dtype float16 \
--port 30000 \
--mem-fraction 0.7 \
--cuda-graph-max-bs 16 \
--cuda-graph-bs {1,2,3,4}
To launch the original model service (for comparative experiments) using SGLang, here is the instruction:
python -m sglang.launch_server \
--model Qwen/Qwen2.5-14B-Instruct \
--port 30000 \
--mem-fraction 0.7 \
--tp-size 2 \
--cuda-graph-max-bs 16 \
--cuda-graph-bs {1,2,3,4}
Performance Evaluation
We conducted performance testing on a single NVIDIA RTX 4090 GPU setup, using the multi-turn dialogue MT-bench dataset. Both the English and Chinese versions of the MT-bench dataset were utilized. Specifically, the Chinese version of the dataset was first processed via the official service of Tongyi Qianwen, followed by manual revisions.The test results are presented below:
| Dataset | Parallel | Throughput/parallel(tokens/s) | Accept length | Inference Speedup |
|---|---|---|---|---|
| MT-bench-zh | 1 | 211.18 | 3.33 | 2.08x |
| MT-bench-zh | 4 | 171.92 | 3.33 | 1.94x |
| Dataset | parallel | Throughput/parallel(tokens/s) | Accept length | Inference Speedup |
|---|---|---|---|---|
| MT-bench-en | 1 | 208.06 | 3.27 | 2.05x |
| MT-bench-en | 4 | 169.37 | 3.27 | 1.91x |
The original model performance is as follows:
| Dataset | Parallel | Throughput/parallel(tokens/s) |
|---|---|---|
| MT-bench-zh | 1 | 101.3 |
| MT-bench-zh | 4 | 88.62 |
| MT-bench-en | 1 | 101.32 |
| MT-bench-en | 4 | 88.51 |
Through comparison, we have drawn the following conclusions:
- With a training dataset of 68k, the model shows balanced performance on both Chinese and English datasets, achieving nearly a 2x improvement in inference efficiency, demonstrating its outstanding performance and stability in handling both Chinese and English text tasks.
Note:
- Calculation Formula: Inference Speedup = Optimized throughput (tokens/s) / Baseline throughput (tokens/s)
- Through extensive testing, the selected service parameter configuration achieves the optimal inference speed in the current setup, although it does sacrifice some Accept length.
Relevant Link
Qwen3-4B-Instruct-2507 Open-source Weights: https://www.modelscope.cn/models/Qwen/Qwen3-4B-Instruct-2507
Eagle Open-source Repository:https://github.com/SafeAILab/EAGLE
SpecForce Framework:https://github.com/sgl-project/SpecForge
- Downloads last month
- 24