EAGLE3 Draft Head — MiniMax-M2.5

A lightweight EAGLE3 draft head for MiniMax-M2.5 (229B MoE, ~10B active parameters). Trained with SpecForge on 8x H200 GPUs using the EAGLE-3 training-time test objective.

Blog post: 2x Faster on a 229B MoE: EAGLE3 Speculative Decoding for MiniMax-M2.5

Usage

SGLang (GPU)

Requires our SGLang fork for MiniMax-M2.5 Eagle3 support + FP8 dtype fixes.

B=1 server (wide tree — optimal for single-user, real-time requests):

pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'

python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2.5 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
    --speculative-num-steps 3 \
    --speculative-num-draft-tokens 8 \
    --speculative-eagle-topk 4 \
    --quantization fp8 \
    --tp 4 \
    --port 30000

B=32 server (narrow tree — optimal for batch workloads):

python -m sglang.launch_server \
    --model-path MiniMaxAI/MiniMax-M2.5 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/MiniMax-M2.5-Eagle3 \
    --speculative-num-steps 5 \
    --speculative-num-draft-tokens 6 \
    --speculative-eagle-topk 1 \
    --quantization fp8 \
    --tp 4 \
    --port 30002

Important: Use different speculative configs for B=1 vs B=32. A wider tree (topk=4) exploits idle GPU compute at low batch; a narrow tree (topk=1) minimizes MoE expert dispatch overhead at high batch.

Python Client

import requests

response = requests.post(
    "http://localhost:30000/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
        "max_tokens": 512,
        "temperature": 0,
    }
)
print(response.json()["choices"][0]["message"]["content"])

Training Details

Parameter	Value
Framework	SpecForge (PyTorch), SGLang backend
Hardware	8x NVIDIA H200 144GB (TP=4, DP=2)
Dataset	20K regenerated samples (target-model responses at temp=0.8)
Pre-training	9 epochs on 54K mixed data (ShareGPT 45% / UltraChat 35% / PerfectBlend 20%)
Fine-tuning	6 epochs on 20K regenerated data
Learning rate	2e-5 (final stage)
Optimizer	AdamW
Batch size	1 (per device)
max_length	2048
TTT (tree training tokens)	7
Precision	bfloat16

Training Method

EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 1, 30, 58 — early, middle, and late). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.

Performance

Training Accuracy (base checkpoint, before regenerated data fine-tuning)

Position	Accuracy
acc_0	0.820
acc_1	0.809
acc_2	0.781
acc_3	0.789
acc_4	0.777
acc_5	0.761
acc_6	0.730

The released model was fine-tuned for 6 additional epochs on 20K regenerated samples from the target model. The fine-tuned accuracy is expected to be equal or higher than these base values.

Inference Benchmarks (B=1, temp=0, TP=4)

With draft_tokens=8 (best B=1 config):

Dataset	Baseline (tok/s)	EAGLE3 (tok/s)	Speedup
HumanEval	109.3	230.6	2.11x
MT-Bench	109.9	195.6	1.78x
SWEBench-Verified	109.6	191.8	1.75x
Aider	109.9	186.8	1.70x

Config: steps=3, topk=4, draft_tokens=8. 8x H200 (TP=4).

With draft_tokens=6 (verified 2026-04-12):

Dataset	Baseline (tok/s)	EAGLE3 (tok/s)	Speedup
HumanEval	109.6	177.0	1.61x
Terminal-Bench	108.9	160.8	1.48x
MT-Bench	109.0	146.8	1.35x
SWEBench-Verified	109.1	123.1	1.13x

Config: steps=3, topk=4, draft_tokens=8. 4x H200 (TP=4). Server-side Prometheus metrics.

Model Architecture

Parameter	Value
Architecture	LlamaForCausalLMEagle3
Hidden size	3072
Num hidden layers	1
Num attention heads	24 (8 KV heads)
Intermediate size	8192
Auxiliary layers	[1, 30, 58]
Vocab size	200064 (target) / 32000 (draft)
Checkpoint size	~464 MB

Limitations

TP=4 only. TP=8 fails due to FP8 block size constraint (intermediate_size / 8 = 192, not divisible by block_n=128).
Temperature sensitivity. Best performance at temp=0 (greedy). At temp=0.7, B=1 speedup drops to 1.27-1.80x and some B=32 datasets regress below baseline.
Coding-focused benchmarks. All benchmarks use coding-oriented datasets (HumanEval, SWEBench, Aider). Conversational workloads may show different patterns.
SPEC_V2 incompatible. The overlap scheduler (SGLANG_ENABLE_SPEC_V2=true) is not supported — standard (non-overlapped) speculation only.
Requires SGLang fork. Upstream SGLang does not yet include the FP8 dtype patches needed for Eagle3 on this model.

License

This draft head is released under Apache 2.0, matching the MiniMax-M2.5 license.

Citation

@inproceedings{li2025eagle3,
  title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}

Downloads last month: 480

Safetensors

Model size

0.2B params

Tensor type

I64

BF16

BOOL

Model tree for thoughtworks/MiniMax-M2.5-Eagle3

Base model

MiniMaxAI/MiniMax-M2.5

Finetuned

(24)

this model

Paper for thoughtworks/MiniMax-M2.5-Eagle3

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Paper • 2503.01840 • Published Mar 3, 2025 • 9