Perfomance question
Hello! I use MiniMax-M2.5-AWQ-4bit (cyankiwi) with Aurora-Spec-Minimax-M2.5 draft model, but I'm experiencing very low speculative decoding performance:
- Spec Accept Rate: ~34-35%
- Avg Accept Length: ~1.6 tokens
- Expected: ~2.62 tokens (per model card)
My setup:
SGLang: dev (latest)
Model: MiniMax-M2.5-AWQ-4bit (200K vocab)
Draft: Aurora-Spec-Minimax-M2.5 (32K draft vocab)
Algorithm: EAGLE3
speculative-num-steps: 4
speculative-eagle-topk: 1
speculative-num-draft-tokens: 6
Draft attention: flashinfer
TP/EP: 8
dtype: bfloat16
kv-cache-dtype: fp8_e4m3
Suspected root cause β Vocab Size Mismatch:
The Aurora draft model has draft_vocab_size: 32000 (32K tokens), while the target MiniMax model has vocab_size: 200064
(200K tokens). This means:
- Draft can only predict tokens from its 32K vocabulary
- Any token generated by target that's NOT in draft's vocab β auto-rejected
- With Cyrilic text (which uses many rare tokens), most tokens fall outside 32K β low acceptance pehaps?
Questions:
- Is there a version of Aurora draft trained on full 200K vocabulary?
- Or a way to map tokens between vocabularies?
- Any other parameters that could improve acceptance?
Thanks for any help!
This speculator is mainly released as a demo for the paper, where the key idea is that it adapts to online traffic rather than relying purely on a fixed offline match.
You can use the Aurora codebase to run serving and online training together so the draft model gradually adapts to your workload:https://github.com/togethercomputer/aurora
So while vocabulary mismatch may contribute to lower initial acceptance, the intended usage is to let the speculator continue training on your real traffic and improve there, rather than expecting the demo checkpoint to be fully optimized out of the box for every deployment.
Thanks! this sounds intresting, will do