Eagle3-Qwen3-4B-Instruct-2507-zh

Introduction

Eagle3-Qwen3-4B-Instruct-2507-zh is a retrained version of the open-source Qwen3-4B-Instruct-2507. It is designed to work with the Eagle-3 speculative decoding algorithm and provides Chinese/English text acceleration capabilities. The model aims to speed up the inference process of large language models during the decoding phase. Trained on a mixed Chinese and English dataset, it optimizes Chinese acceptance rates and is suitable for inference tasks involving both Chinese and English text.

Note: The configuration files in the project follow the SGLang framework. If you intend to use other frameworks like vLLM, please modify the relevant configurations.

Training Configuration

We used the open-source codebase SpecForge to train and obtain Eagle-3 weights compatible with the Qwen3-4B-Instruct-2507 (non-thinking mode), providing Chinese/English text acceleration capabilities.

  • Dataset: A training dataset with a 1:1 ratio of English and Chinese questions (68K data points in total) was constructed by selecting 32K samples each from the ShareGPT-68K (English) and ShareGPT-Chinese-English-90k (Chinese) datasets. For the training dataset, the prompt section uses content from the original datasets, while the output section is regenerated by the Qwen3-4B-Instruct-2507 model.
  • Training environment: Training was conducted on a setup with 2 NVIDIA RTX 4090 GPUs, each with 24GB of VRAM (Video Random Access Memory). The total training duration was approximately 3 days.

Inference Launch Command

To launch the EAGLE-3 algorithm service using SGLang, here is the instruction:

python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-4B-Instruct-2507 \
--speculative-algo EAGLE3 \
--speculative-draft Zjcxy-SmartAI/Eagle3-Qwen3-4B-Instruct-2507-zh \
--speculative-num-steps 5 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--dtype float16 \
--port 30000 \
--mem-fraction 0.7 \
--cuda-graph-max-bs 16 \
--cuda-graph-bs {1,2,3,4}

To launch the original model service (for comparative experiments) using SGLang, here is the instruction:

python -m sglang.launch_server \
--model Qwen/Qwen2.5-14B-Instruct \	
--port 30000 \
--mem-fraction 0.7 \	
--tp-size 2 \	   
--cuda-graph-max-bs 16 \	
--cuda-graph-bs {1,2,3,4}

Performance Evaluation

We conducted performance testing on a single NVIDIA RTX 4090 GPU setup, using the multi-turn dialogue MT-bench dataset. Both the English and Chinese versions of the MT-bench dataset were utilized. Specifically, the Chinese version of the dataset was first processed via the official service of Tongyi Qianwen, followed by manual revisions.The test results are presented below:

Dataset Parallel Throughput/parallel(tokens/s) Accept length Inference Speedup
MT-bench-zh 1 211.18 3.33 2.08x
MT-bench-zh 4 171.92 3.33 1.94x
Dataset parallel Throughput/parallel(tokens/s) Accept length Inference Speedup
MT-bench-en 1 208.06 3.27 2.05x
MT-bench-en 4 169.37 3.27 1.91x

The original model performance is as follows:

Dataset Parallel Throughput/parallel(tokens/s)
MT-bench-zh 1 101.3
MT-bench-zh 4 88.62
MT-bench-en 1 101.32
MT-bench-en 4 88.51

Through comparison, we have drawn the following conclusions:

  • With a training dataset of 68k, the model shows balanced performance on both Chinese and English datasets, achieving nearly a 2x improvement in inference efficiency, demonstrating its outstanding performance and stability in handling both Chinese and English text tasks.

Note:

  • Calculation Formula: Inference Speedup = Optimized throughput (tokens/s) / Baseline throughput (tokens/s)
  • Through extensive testing, the selected service parameter configuration achieves the optimal inference speed in the current setup, although it does sacrifice some Accept length.

Relevant Link

Qwen3-4B-Instruct-2507 Open-source Weights: https://www.modelscope.cn/models/Qwen/Qwen3-4B-Instruct-2507

Eagle Open-source Repository:https://github.com/SafeAILab/EAGLE

SpecForce Framework:https://github.com/sgl-project/SpecForge

Downloads last month
24
Safetensors
Model size
0.2B params
Tensor type
I64
·
BF16
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support