EAGLE3 Draft Head — MiniMax-M2.5
A lightweight EAGLE3 draft head for MiniMax-M2.5 (229B MoE, ~10B active parameters). Trained with SpecForge on 8x H200 GPUs using the EAGLE-3 training-time test objective.
Blog post: 2x Faster on a 229B MoE: EAGLE3 Speculative Decoding for MiniMax-M2.5
Usage
SGLang (GPU)
Requires our SGLang fork for MiniMax-M2.5 Eagle3 support + FP8 dtype fixes.
B=1 server (wide tree — optimal for single-user, real-time requests):
pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'
python -m sglang.launch_server \
--model-path MiniMaxAI/MiniMax-M2.5 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 8 \
--speculative-eagle-topk 4 \
--quantization fp8 \
--tp 4 \
--port 30000
B=32 server (narrow tree — optimal for batch workloads):
python -m sglang.launch_server \
--model-path MiniMaxAI/MiniMax-M2.5 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
--speculative-num-steps 5 \
--speculative-num-draft-tokens 6 \
--speculative-eagle-topk 1 \
--quantization fp8 \
--tp 4 \
--port 30002
Important: Use different speculative configs for B=1 vs B=32. A wider tree (topk=4) exploits idle GPU compute at low batch; a narrow tree (topk=1) minimizes MoE expert dispatch overhead at high batch.
Python Client
import requests
response = requests.post(
"http://localhost:30000/v1/chat/completions",
json={
"model": "default",
"messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
"max_tokens": 512,
"temperature": 0,
}
)
print(response.json()["choices"][0]["message"]["content"])
Training Details
| Parameter | Value |
|---|---|
| Framework | SpecForge (PyTorch), SGLang backend |
| Hardware | 8x NVIDIA H200 144GB (TP=4, DP=2) |
| Dataset | 20K regenerated samples (target-model responses at temp=0.8) |
| Pre-training | 9 epochs on 54K mixed data (ShareGPT 45% / UltraChat 35% / PerfectBlend 20%) |
| Fine-tuning | 6 epochs on 20K regenerated data |
| Learning rate | 2e-5 (final stage) |
| Optimizer | AdamW |
| Batch size | 1 (per device) |
| max_length | 2048 |
| TTT (tree training tokens) | 7 |
| Precision | bfloat16 |
Training Method
EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 1, 30, 58 — early, middle, and late). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.
Performance
Training Accuracy (base checkpoint, before regenerated data fine-tuning)
| Position | Accuracy |
|---|---|
| acc_0 | 0.820 |
| acc_1 | 0.809 |
| acc_2 | 0.781 |
| acc_3 | 0.789 |
| acc_4 | 0.777 |
| acc_5 | 0.761 |
| acc_6 | 0.730 |
The released model was fine-tuned for 6 additional epochs on 20K regenerated samples from the target model. The fine-tuned accuracy is expected to be equal or higher than these base values.
Inference Benchmarks (B=1, temp=0, TP=4)
With draft_tokens=8 (best B=1 config):
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---|---|---|---|
| HumanEval | 109.3 | 230.6 | 2.11x |
| MT-Bench | 109.9 | 195.6 | 1.78x |
| SWEBench-Verified | 109.6 | 191.8 | 1.75x |
| Aider | 109.9 | 186.8 | 1.70x |
Config: steps=3, topk=4, draft_tokens=8. 8x H200 (TP=4).
With draft_tokens=6 (verified 2026-04-12):
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---|---|---|---|
| HumanEval | 109.6 | 177.0 | 1.61x |
| Terminal-Bench | 108.9 | 160.8 | 1.48x |
| MT-Bench | 109.0 | 146.8 | 1.35x |
| SWEBench-Verified | 109.1 | 123.1 | 1.13x |
Config: steps=3, topk=4, draft_tokens=8. 4x H200 (TP=4). Server-side Prometheus metrics.
Model Architecture
| Parameter | Value |
|---|---|
| Architecture | LlamaForCausalLMEagle3 |
| Hidden size | 3072 |
| Num hidden layers | 1 |
| Num attention heads | 24 (8 KV heads) |
| Intermediate size | 8192 |
| Auxiliary layers | [1, 30, 58] |
| Vocab size | 200064 (target) / 32000 (draft) |
| Checkpoint size | ~464 MB |
Limitations
- TP=4 only. TP=8 fails due to FP8 block size constraint (
intermediate_size / 8 = 192, not divisible byblock_n=128). - Temperature sensitivity. Best performance at temp=0 (greedy). At temp=0.7, B=1 speedup drops to 1.27-1.80x and some B=32 datasets regress below baseline.
- Coding-focused benchmarks. All benchmarks use coding-oriented datasets (HumanEval, SWEBench, Aider). Conversational workloads may show different patterns.
- SPEC_V2 incompatible. The overlap scheduler (
SGLANG_ENABLE_SPEC_V2=true) is not supported — standard (non-overlapped) speculation only. - Requires SGLang fork. Upstream SGLang does not yet include the FP8 dtype patches needed for Eagle3 on this model.
License
This draft head is released under Apache 2.0, matching the MiniMax-M2.5 license.
Citation
@inproceedings{li2025eagle3,
title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}
- Downloads last month
- 480
Model tree for thoughtworks/MiniMax-M2.5-Eagle3
Base model
MiniMaxAI/MiniMax-M2.5