2x Faster on a 229B MoE: EAGLE3 Speculative Decoding for MiniMax-M2.5
Speculative Decoding in 60 Seconds
If you're already familiar with speculative decoding and EAGLE3, skip to the Results section below.
LLM inference is memory-bandwidth bound, not compute-bound. Your GPU spends most of its time loading model weights from memory, not doing math. Speculative decoding exploits this idle compute: a small draft model proposes multiple tokens cheaply, then the full target model verifies them all in a single forward pass — the same cost as generating one token normally.
The output is mathematically identical to what the target model would produce without speculation. This is a guarantee from the accept/reject algorithm, not an approximation.
EAGLE3 (NeurIPS 2025) trains a specialized draft head that conditions on the target model's own internal representations from three points — early, middle, and late layers — rather than being an independent smaller model. The draft head is tiny (~464 MB for MiniMax) and co-deploys on the same GPU.
For the full algorithm walkthrough, accept/reject rule, and math behind the speedup curve, see our previous post on EAGLE3 for GLM-4.7-Flash.
Results
We are releasing thoughtworks/MiniMax-M2.5-Eagle3 — an EAGLE3 draft head for the MiniMax-M2.5 Mixture-of-Experts model.
B=1: Up to 2.11x Throughput
Single-user (B=1), temperature 0, FP8, TP=4, server-side Prometheus metrics. Tree config: steps=3, topk=4, draft_tokens=8.
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---|---|---|---|
| HumanEval | 109.3 | 230.6 | 2.11x |
| MT-Bench | 109.9 | 195.6 | 1.78x |
| SWEBench-Verified | 109.6 | 191.8 | 1.75x |
| SWEBench-Multilingual | 109.7 | 170.0 | 1.55x |
| SWEBench-Pro | 110.0 | 183.7 | 1.67x |
| Terminal-Bench | 109.8 | 174.6 | 1.59x |
| Aider | 109.9 | 186.8 | 1.70x |
Hardware: 8x NVIDIA H200 144GB, TP=4 per server, native FP8 quantization. Draft head: ~464 MB, co-deployed on the same GPUs.
B=32: Parity Across All Datasets
The conventional wisdom is that MoE models regress under speculative decoding at batch because tree verification activates extra experts. At temperature 0, that regression disappears entirely. Tree config: steps=5, topk=1, draft_tokens=6.
| Dataset | Ratio vs Baseline |
|---|---|
| HumanEval | 1.14x |
| MT-Bench | 1.15x |
| SWEBench-Verified | 1.06x |
| SWEBench-Multilingual | 1.02x |
| SWEBench-Pro | 1.07x |
| Terminal-Bench | 1.22x |
| Aider | 1.06x |
The Mixed Inference Config
The key finding: use different tree shapes for different batch sizes.
At B=1, the GPU is mostly idle — a wider tree (topk=4, branching into 4 candidates per step) exploits that spare compute to propose more tokens in parallel. At B=32, the GPU is saturated — a wider tree triggers more MoE expert dispatches per verification step, creating overhead. A narrow chain (topk=1) minimizes this overhead while still getting the benefit of speculative decoding.
In production, this means running two server pools: one for real-time single-user requests (wide tree), one for batch workloads (narrow tree). Both use the same draft model checkpoint.
Configuration
| Parameter | Value |
|---|---|
| Target model | MiniMaxAI/MiniMax-M2.5 (229B total, ~10B active) |
| Architecture | MoE: 256 experts, top-8 routing, sigmoid scoring, 62 layers |
| Draft head | 1 layer, hidden_size=3072, aux layers [1, 30, 58] |
| Hardware | 8x H200 144GB, TP=4 per server |
| Training data | 20K regenerated samples (target-model responses at temp=0.8) |
| Training | 9 epochs original data + 6 epochs regenerated, LR=2e-5 |
| SGLang version | v0.5.6 (tails-mpt/sglang) |
Why Temperature Changes Everything
The EAGLE-3 paper reports 15-22% better speedups at temperature 0 on dense models. For MoE, the effect is far more dramatic.
Using the same draft model (Exp C) at both temperatures:
| Dataset | temp=0.7 | temp=0 |
|---|---|---|
| HumanEval | 1.12x | 1.33x |
| MT-Bench | 1.08x | 1.30x |
| SWEBench-Verified | 1.12x | 1.24x |
| SWEBench-Multilingual | 1.07x | 1.20x |
| SWEBench-Pro | 1.07x | 1.21x |
| Terminal-Bench | 1.14x | 1.29x |
| Aider | 1.13x | 1.34x |
At temperature 0 (greedy decoding), the draft model's job is easier — there is one right answer, not a distribution over plausible continuations. Acceptance rates go up, fewer tokens are rejected, and fewer distinct expert unions are activated during MoE verification. The "MoE verification wall" at B=32 essentially disappears: it goes from 0.84-1.07x (regression) at temp=0.7 to 0.99-1.05x (parity) at temp=0.
For coding use cases — which is MiniMax-M2.5's primary market — temperature 0 is the production setting. All our previous experiments at temp=0.7 (the benchmark script default) understated real-world performance.
Training Progression
We trained seven experimental checkpoints (Exp A through Exp G), progressively improving the draft head.
Training accuracy across the first three experiments (original data, same prompts):
| Position | Exp A (3 ep) | Exp B (6 ep) | Exp C (9 ep) |
|---|---|---|---|
| acc_0 | 0.778 | 0.813 | 0.820 |
| acc_3 | 0.678 | 0.749 | 0.789 |
| acc_6 | 0.626 | 0.693 | 0.730 |
Exp D (12 epochs) showed diminishing returns — accept rates slightly declined, suggesting overfitting. Exp C (9 epochs) is the sweet spot for original data.
The breakthrough came from regenerated training data: instead of generic assistant responses, we generated 20K new samples using MiniMax-M2.5 itself (temp=0.8). Continuing from Exp C on this regenerated data:
- Exp E (+3 epochs, LR=5e-5): B=32 improved from 0.99-1.05x to 0.99-1.15x
- Exp F (+3 more epochs, LR=2e-5): best overall — 1.19-1.45x B=1, 1.00-1.14x B=32 (uniform config)
The mixed config then pushed Exp F's B=1 from 1.45x to 2.11x.
Negative result: Exp G used teacher temperature T=2.0 to soften the training target distribution. It regressed at temp=0 (1.33-1.77x vs Exp F's 1.55-2.11x) because softer targets dilute the argmax accuracy that greedy decoding depends on. Exp F remains the best checkpoint.
Engineering Notes
FP8 Fix
Eagle3 + FP8 target models caused dtype mismatches in SGLang's CUDA graph replay. We patched four files:
logits_processor.py: bfloat16 cast in auxiliary hidden-state patheagle_worker.py: idle input dtype → bfloat16 (not target model dtype)eagle_info.py: same idle input dtype fixllama.py: cast shared embedding/lm_head to bfloat16 inset_embed()
These patches enable native FP8 serving with Eagle3 for any model — not MiniMax-specific.
TP=4 constraint
TP=8 fails for MiniMax-M2.5 because intermediate_size / 8 = 1536 / 8 = 192, which is not divisible by the FP8 block size (block_n=128). TP=4 works cleanly and uses 4 GPUs per server, leaving 4 GPUs for a second server (enabling the dual B=1/B=32 config on a single 8-GPU node).
SPEC_V2 overlap scheduler
SGLANG_ENABLE_SPEC_V2=true (the overlap scheduler that pipelines draft and verification) crashes with NotImplementedError in update_verify_buffers_to_fill_after_draft. The attention backend is incompatible with V2's buffer management. This is a known limitation — standard (non-overlapped) speculation works.
Caveats and Limitations
B=32 gains are modest. The 1.02-1.22x range at temp=0 is parity-to-slight-improvement, not a dramatic win. The real value is that it doesn't regress — speculative decoding is "free" at batch for this model.
temp=0.7 results are weaker. B=1 drops from 1.55-2.11x to 1.27-1.80x, and some B=32 datasets regress below 1.0x. If your use case requires sampling, expect lower gains.
All benchmarks are on coding-focused datasets (HumanEval, SWEBench variants, Aider, Terminal-Bench, MT-Bench). Conversational or general-knowledge workloads may show different acceptance patterns.
The parameter sweep was HumanEval-only. The 6-config sweep that identified topk=4 as optimal for B=1 was run on a single dataset. The full 7-dataset mixed-config benchmarks validate the finding, but the optimal config could differ for workloads with very different token distributions.
How to Use
MiniMax-M2.5 EAGLE3 support requires our SGLang fork. We recommend running two servers — one per batch regime:
B=1 server (wide tree — real-time, single-user requests)
pip install git+https://github.com/tails-mpt/sglang.git
python -m sglang.launch_server \
--model-path MiniMaxAI/MiniMax-M2.5 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 8 \
--speculative-eagle-topk 4 \
--dtype fp8 \
--tp 4 \
--port 30000
B=32 server (narrow tree — batch workloads)
python -m sglang.launch_server \
--model-path MiniMaxAI/MiniMax-M2.5 \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
--speculative-num-steps 5 \
--speculative-num-draft-tokens 6 \
--speculative-eagle-topk 1 \
--dtype fp8 \
--tp 4 \
--port 30002
Query
import requests
response = requests.post(
"http://localhost:30000/v1/chat/completions",
json={
"model": "default",
"messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
"max_tokens": 512,
"temperature": 0,
}
)
print(response.json()["choices"][0]["message"]["content"])
Links
- Draft model: thoughtworks/MiniMax-M2.5-Eagle3
- Target model: MiniMaxAI/MiniMax-M2.5
- Previous EAGLE3 posts: GLM-4.7-Flash (deep dive) | Gemma-4-31B (hybrid attention)
- SGLang fork: github.com/tails-mpt/sglang
- SpecForge fork (training): github.com/tails-mpt/SpecForge
- SpecJAX (TPU training): github.com/tails-mpt/SpecJAX
- EAGLE3 paper: arXiv:2503.01840
Citation
@inproceedings{li2025eagle3,
title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}





