2x Faster on a 229B MoE: EAGLE3 Speculative Decoding for MiniMax-M2.5

Community Article Published April 9, 2026

At Thoughtworks, we build inference optimization tools for production LLM deployments. MiniMax-M2.5 has 229 billion parameters but activates only 10 billion per forward pass — 256 experts, top-8 routing. We trained an EAGLE3 draft head for it and got 2.11x single-user throughput on H200 GPUs. But we only got there after discovering that the optimal speculative tree shape depends on your batch size, and that temperature changes everything for MoE models.


Speculative Decoding in 60 Seconds

If you're already familiar with speculative decoding and EAGLE3, skip to the Results section below.

LLM inference is memory-bandwidth bound, not compute-bound. Your GPU spends most of its time loading model weights from memory, not doing math. Speculative decoding exploits this idle compute: a small draft model proposes multiple tokens cheaply, then the full target model verifies them all in a single forward pass — the same cost as generating one token normally.

The output is mathematically identical to what the target model would produce without speculation. This is a guarantee from the accept/reject algorithm, not an approximation.

EAGLE3 (NeurIPS 2025) trains a specialized draft head that conditions on the target model's own internal representations from three points — early, middle, and late layers — rather than being an independent smaller model. The draft head is tiny (~464 MB for MiniMax) and co-deploys on the same GPU.

For the full algorithm walkthrough, accept/reject rule, and math behind the speedup curve, see our previous post on EAGLE3 for GLM-4.7-Flash.


Results

We are releasing thoughtworks/MiniMax-M2.5-Eagle3 — an EAGLE3 draft head for the MiniMax-M2.5 Mixture-of-Experts model.

B=1: Up to 2.11x Throughput

Single-user (B=1), temperature 0, FP8, TP=4, server-side Prometheus metrics. Tree config: steps=3, topk=4, draft_tokens=8.

Dataset Baseline (tok/s) EAGLE3 (tok/s) Speedup
HumanEval 109.3 230.6 2.11x
MT-Bench 109.9 195.6 1.78x
SWEBench-Verified 109.6 191.8 1.75x
SWEBench-Multilingual 109.7 170.0 1.55x
SWEBench-Pro 110.0 183.7 1.67x
Terminal-Bench 109.8 174.6 1.59x
Aider 109.9 186.8 1.70x

01-speedup-bars

Hardware: 8x NVIDIA H200 144GB, TP=4 per server, native FP8 quantization. Draft head: ~464 MB, co-deployed on the same GPUs.

B=32: Parity Across All Datasets

The conventional wisdom is that MoE models regress under speculative decoding at batch because tree verification activates extra experts. At temperature 0, that regression disappears entirely. Tree config: steps=5, topk=1, draft_tokens=6.

Dataset Ratio vs Baseline
HumanEval 1.14x
MT-Bench 1.15x
SWEBench-Verified 1.06x
SWEBench-Multilingual 1.02x
SWEBench-Pro 1.07x
Terminal-Bench 1.22x
Aider 1.06x

02-batch-parity

The Mixed Inference Config

The key finding: use different tree shapes for different batch sizes.

At B=1, the GPU is mostly idle — a wider tree (topk=4, branching into 4 candidates per step) exploits that spare compute to propose more tokens in parallel. At B=32, the GPU is saturated — a wider tree triggers more MoE expert dispatches per verification step, creating overhead. A narrow chain (topk=1) minimizes this overhead while still getting the benefit of speculative decoding.

03-tree-shapes

In production, this means running two server pools: one for real-time single-user requests (wide tree), one for batch workloads (narrow tree). Both use the same draft model checkpoint.

Configuration

Parameter Value
Target model MiniMaxAI/MiniMax-M2.5 (229B total, ~10B active)
Architecture MoE: 256 experts, top-8 routing, sigmoid scoring, 62 layers
Draft head 1 layer, hidden_size=3072, aux layers [1, 30, 58]
Hardware 8x H200 144GB, TP=4 per server
Training data 20K regenerated samples (target-model responses at temp=0.8)
Training 9 epochs original data + 6 epochs regenerated, LR=2e-5
SGLang version v0.5.6 (tails-mpt/sglang)

Why Temperature Changes Everything

The EAGLE-3 paper reports 15-22% better speedups at temperature 0 on dense models. For MoE, the effect is far more dramatic.

Using the same draft model (Exp C) at both temperatures:

04-temperature

Dataset temp=0.7 temp=0
HumanEval 1.12x 1.33x
MT-Bench 1.08x 1.30x
SWEBench-Verified 1.12x 1.24x
SWEBench-Multilingual 1.07x 1.20x
SWEBench-Pro 1.07x 1.21x
Terminal-Bench 1.14x 1.29x
Aider 1.13x 1.34x

At temperature 0 (greedy decoding), the draft model's job is easier — there is one right answer, not a distribution over plausible continuations. Acceptance rates go up, fewer tokens are rejected, and fewer distinct expert unions are activated during MoE verification. The "MoE verification wall" at B=32 essentially disappears: it goes from 0.84-1.07x (regression) at temp=0.7 to 0.99-1.05x (parity) at temp=0.

For coding use cases — which is MiniMax-M2.5's primary market — temperature 0 is the production setting. All our previous experiments at temp=0.7 (the benchmark script default) understated real-world performance.


Training Progression

We trained seven experimental checkpoints (Exp A through Exp G), progressively improving the draft head.

Training accuracy across the first three experiments (original data, same prompts):

Position Exp A (3 ep) Exp B (6 ep) Exp C (9 ep)
acc_0 0.778 0.813 0.820
acc_3 0.678 0.749 0.789
acc_6 0.626 0.693 0.730

Exp D (12 epochs) showed diminishing returns — accept rates slightly declined, suggesting overfitting. Exp C (9 epochs) is the sweet spot for original data.

The breakthrough came from regenerated training data: instead of generic assistant responses, we generated 20K new samples using MiniMax-M2.5 itself (temp=0.8). Continuing from Exp C on this regenerated data:

  • Exp E (+3 epochs, LR=5e-5): B=32 improved from 0.99-1.05x to 0.99-1.15x
  • Exp F (+3 more epochs, LR=2e-5): best overall — 1.19-1.45x B=1, 1.00-1.14x B=32 (uniform config)

The mixed config then pushed Exp F's B=1 from 1.45x to 2.11x.

Negative result: Exp G used teacher temperature T=2.0 to soften the training target distribution. It regressed at temp=0 (1.33-1.77x vs Exp F's 1.55-2.11x) because softer targets dilute the argmax accuracy that greedy decoding depends on. Exp F remains the best checkpoint.


Engineering Notes

FP8 Fix

Eagle3 + FP8 target models caused dtype mismatches in SGLang's CUDA graph replay. We patched four files:

  1. logits_processor.py: bfloat16 cast in auxiliary hidden-state path
  2. eagle_worker.py: idle input dtype → bfloat16 (not target model dtype)
  3. eagle_info.py: same idle input dtype fix
  4. llama.py: cast shared embedding/lm_head to bfloat16 in set_embed()

These patches enable native FP8 serving with Eagle3 for any model — not MiniMax-specific.

TP=4 constraint

TP=8 fails for MiniMax-M2.5 because intermediate_size / 8 = 1536 / 8 = 192, which is not divisible by the FP8 block size (block_n=128). TP=4 works cleanly and uses 4 GPUs per server, leaving 4 GPUs for a second server (enabling the dual B=1/B=32 config on a single 8-GPU node).

SPEC_V2 overlap scheduler

SGLANG_ENABLE_SPEC_V2=true (the overlap scheduler that pipelines draft and verification) crashes with NotImplementedError in update_verify_buffers_to_fill_after_draft. The attention backend is incompatible with V2's buffer management. This is a known limitation — standard (non-overlapped) speculation works.


Caveats and Limitations

B=32 gains are modest. The 1.02-1.22x range at temp=0 is parity-to-slight-improvement, not a dramatic win. The real value is that it doesn't regress — speculative decoding is "free" at batch for this model.

temp=0.7 results are weaker. B=1 drops from 1.55-2.11x to 1.27-1.80x, and some B=32 datasets regress below 1.0x. If your use case requires sampling, expect lower gains.

All benchmarks are on coding-focused datasets (HumanEval, SWEBench variants, Aider, Terminal-Bench, MT-Bench). Conversational or general-knowledge workloads may show different acceptance patterns.

The parameter sweep was HumanEval-only. The 6-config sweep that identified topk=4 as optimal for B=1 was run on a single dataset. The full 7-dataset mixed-config benchmarks validate the finding, but the optimal config could differ for workloads with very different token distributions.


How to Use

MiniMax-M2.5 EAGLE3 support requires our SGLang fork. We recommend running two servers — one per batch regime:

B=1 server (wide tree — real-time, single-user requests)

pip install git+https://github.com/tails-mpt/sglang.git

python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2.5 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
    --speculative-num-steps 3 \
    --speculative-num-draft-tokens 8 \
    --speculative-eagle-topk 4 \
    --dtype fp8 \
    --tp 4 \
    --port 30000

B=32 server (narrow tree — batch workloads)

python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2.5 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
    --speculative-num-steps 5 \
    --speculative-num-draft-tokens 6 \
    --speculative-eagle-topk 1 \
    --dtype fp8 \
    --tp 4 \
    --port 30002

Query

import requests

response = requests.post(
    "http://localhost:30000/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
        "max_tokens": 512,
        "temperature": 0,
    }
)
print(response.json()["choices"][0]["message"]["content"])

Links


Citation

@inproceedings{li2025eagle3,
  title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}

Community

Sign up or log in to comment