Google Released Gemma-4 Four Days Ago. We Already Made It 1.72× Faster.
Speculative Decoding in 60 Seconds
If you're already familiar with speculative decoding and EAGLE3, skip to the Results section below.
LLM inference is memory-bandwidth bound, not compute-bound. Your GPU spends most of its time loading model weights from memory, not doing math. Speculative decoding exploits this idle compute: a small draft model proposes multiple tokens cheaply, then the full target model verifies them all in a single forward pass — the same cost as generating one token normally.
The output is mathematically identical to what the target model would produce without speculation. This is a guarantee from the accept/reject algorithm, not an approximation.
EAGLE3 (NeurIPS 2025) trains a specialized draft head that conditions on the target model's own internal representations from three points — early, middle, and late layers — rather than being an independent smaller model. This makes the draft much better at predicting what the target would say. The draft head is tiny (~277 MB) and co-deploys on the same GPU.
For a deeper dive on how speculative decoding works, the accept/reject rule, and the math behind the speedup curve, see our previous post on EAGLE3 for GLM-4.7-Flash.
Results
We are releasing thoughtworks/Gemma-4-31B-Eagle3 — to our knowledge, the first publicly available EAGLE3 draft head for the Gemma-4 architecture.
All benchmarks: single-user (B=1), temperature 0, CUDA graphs enabled, TP=2, server-side Prometheus metrics.
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---|---|---|---|
| MT-Bench | 49.7 | 85.4 | 1.72× |
| HumanEval | 49.8 | 73.7 | 1.48× |
| SWEBench-Multilingual | 48.5 | 55.4 | 1.14× |
| SWEBench-Verified | 48.2 | 50.4 | 1.05× |
Hardware: 8× NVIDIA H200 144GB. Both baseline and EAGLE3 measured at TP=2 with CUDA graphs enabled — a fair apples-to-apples comparison. TP=2 is required because Gemma-4's 42 Q-heads are not divisible by 4. Draft head: 277 MB, co-deployed on the same GPUs.
Training acceptance rate: acc_0 = 0.75–0.82. Inference acceptance rates vary by dataset: MT-Bench (conversational) shows the highest speedup; SWEBench (code-heavy, less predictable token sequences) shows the lowest.
Training Configuration
| Parameter | Value |
|---|---|
| Framework | SpecForge (PyTorch), SGLang backend |
| Hardware | 8× H200 (TP=4 for target model, DP=2) |
| Dataset | 54K mixed (ShareGPT 45% / UltraChat 35% / PerfectBlend 20%) |
| Epochs | 3 |
| Learning rate | 5e-5 |
| max_length | 1024 |
| TTT (tree training tokens) | 7 |
| Training time | ~117 minutes |
| Checkpoint size | 277 MB |
Why Gemma-4 Is Different
Most EAGLE3 work targets standard dense transformers (Llama, Qwen) or MoE models (GLM-4.7-Flash, MiniMax). Gemma-4-31B is neither — it's a hybrid-attention dense model with two fundamentally different layer types:
- 50 sliding-window layers: 16 KV heads, head_dim=256, window=1024
- 10 global attention layers: 4 KV heads, head_dim=512, V=K (no separate v_proj)
This means the KV cache cannot be a uniform tensor. Each layer type has different shapes, different head counts, and different memory requirements. Every inference engine that serves Gemma-4 must maintain two separate memory pools — and when EAGLE3's tree verification starts rapidly allocating and freeing cache entries across both pools simultaneously, things break in ways nobody anticipated.
We chose Gemma-4 specifically because it exercises a code path that no existing EAGLE3 deployment has handled. If speculative decoding can work here, it can work on the next generation of hybrid-attention architectures too.
What We Learned
The training sequence length sweet spot
We trained three models identical in every way except sequence length:
| Experiment | max_length | MT-Bench | HumanEval | SWEBench-Verified |
|---|---|---|---|---|
| Exp A-SGLang | 512 | 1.64× | 1.38× | 1.01× |
| Exp B-SGLang | 1024 | 1.72× | 1.48× | 1.05× |
| Exp C-SGLang | 2048 | 1.67× | 1.47× | 1.08× |
max_length=1024 is the sweet spot. Shorter sequences (512) give the draft less context to learn from. Longer sequences (2048) don't improve acceptance rates for typical benchmark prompt lengths and take 3× longer to train.
Always train with the backend you'll serve with
EAGLE3 draft heads learn from the target model's hidden state distributions at specific layers. If you train with one backend (e.g., HuggingFace Transformers) and serve with another (e.g., SGLang), those hidden states can diverge significantly — we measured up to 32% difference at the layer closest to the output. The result: a draft that looks great during training (acc_0 = 0.85–0.87) but achieves only ~13% acceptance at inference time. Retraining with --target-model-backend sglang fixed it immediately (acc_0 = 0.75–0.82, real-world acceptance matching expectations). This applies to any EAGLE3 deployment, not just Gemma-4.
Three bugs in the serving stack
Getting Gemma-4 to run correctly in SGLang required fixing three issues that don't exist in standard transformer architectures:
Attention scaling = 1.0. Gemma-4 applies QK-normalization, so the standard $1/\sqrt{d}$ scaling factor is not used. SGLang was applying $256^{-0.5} = 0.0625$, shrinking attention outputs by 16× per layer — producing garbage output after 60 layers.
V = K in global layers. Global attention layers have
attention_k_eq_v = True— nov_projweights. The value tensor is a clone of the key tensor. No existing SGLang model needed per-layer conditional logic for this.Partial RoPE for global layers. Global layers use
partial_rotary_factor = 0.25— only 128 of 512 dimensions receive rotary position encoding. Standard RoPE implementations apply rotation to the full tensor.
The memory leak: SWAKVPool double-free
The hardest bug. EAGLE3's tree verification rapidly allocates and frees KV cache entries. With Gemma-4's dual memory pools, the alloc/free pattern during verification can trigger a double-free: the allocator frees an index already freed in a previous cycle, corrupting pool state and crashing with a CUDA device-side assert on the next allocation.
Compounding this: the pool-to-pool mapping was initialized with 0 — a valid index — so the filter swa_indices > 0 silently skipped freeing slot 0, causing the two pools to drift out of sync.
Fix: a double-free guard (check mapping before free), sentinel value changed from 0 to -1, and a boolean allocated mask to track pool state explicitly.
Caveats and Limitations
TP=2 only. Gemma-4-31B has 42 attention heads — not divisible by 4. EAGLE3's draft model inherits this constraint, so we serve at TP=2. A future draft with a TP=4-compatible head configuration (e.g., 32 heads × 168 head_dim = 5376) would unlock TP=4 serving and likely higher absolute throughput.
SWEBench speedups are modest. Code generation produces less predictable token sequences than conversational text. The speedup drops from 1.72× (MT-Bench) to 1.05–1.14× (SWEBench variants). This is consistent across models — speculative decoding helps more on natural language than on code.
How to Use
Gemma-4 support requires our SGLang and SpecForge forks — the patches for hybrid attention, SWAKVPool fixes, and the Gemma-4 chat template are not yet upstream.
- SGLang fork: github.com/tails-mpt/sglang
- SpecForge fork: github.com/tails-mpt/SpecForge
Launch the server
pip install git+https://github.com/tails-mpt/sglang.git
python -m sglang.launch_server \
--model-path google/gemma-4-31B \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/Gemma-4-31B-Eagle3 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 8 \
--speculative-eagle-topk 4 \
--attention-backend triton \
--tp 2 \
--trust-remote-code \
--port 30000
Note: --attention-backend triton is required — FlashInfer is incompatible with Gemma-4's head_dim=512 global layers.
Query
import requests
response = requests.post(
"http://localhost:30000/v1/chat/completions",
json={
"model": "default",
"messages": [{"role": "user", "content": "Write a Python function to find the longest common subsequence of two strings."}],
"max_tokens": 512,
}
)
print(response.json()["choices"][0]["message"]["content"])
What's Next
- TP=4-compatible draft head — retrain with a head configuration where Q-heads divide evenly by 4, enabling full tensor parallelism
- Regenerated training data — 5–10K samples generated by Gemma-4 itself, replacing generic assistant responses with on-distribution outputs
- Upstream patches — contribute the Gemma-4 hybrid attention fixes and SWAKVPool double-free guard back to SGLang
Links
- Draft model: thoughtworks/Gemma-4-31B-Eagle3
- Our previous EAGLE3 post (deeper spec decoding explainer): Speculative Decoding in Practice
- SGLang fork (Gemma-4 support): github.com/tails-mpt/sglang
- SpecForge fork (training): github.com/tails-mpt/SpecForge
- SpecJAX (TPU draft head training): github.com/tails-mpt/SpecJAX
- EAGLE3 paper: arXiv:2503.01840
Citation
@inproceedings{li2025eagle3,
title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}





