--- license: cc-by-4.0 datasets: - Dogacel/nemotron-post-training-v2-gpt-oss-20b-regen language: - en base_model: - openai/gpt-oss-20b --- # Model Card for Model ID EAGLE-3 drafter model for GPT-oss-20b. This model is released as a part of _Attention Drift: What Speculative Decoding Models Learn_ paper. It has several minor architectural differences from the original EAGLE: Drafter hidden state is captured *after* the norm, additional norm injected before FC. ## Model Details ### Model Sources [optional] - **Repository:** [Dogacel/SpecDrift](https://github.com/Dogacel/SpecDrift) - **Paper:** https://arxiv.org/abs/2605.09992 ## Uses We recommend using SGLang to run the model, ``` export SGLANG_ENABLE_SPEC_V2=1 python -m sglang.launch_server \ --model-path openai/gpt-oss-20b \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path "Dogacel/specdrift-gpt-oss-20b-eagle3" \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \ --speculative-draft-sliding-window 2048 \ --port 30000 \ --dp-size 1 --tp-size 1 \ --max-running-requests 64 \ --cuda-graph-max-bs 64 \ --attention-backend fa3 \ --trust-remote-code \ --mem-fraction-static 0.9 --dtype bfloat16 ``` ## Training Details ### Training Data This model is trained on Nemoron Post Training V2 dataset, answers regenerated using gpt-oss-20b. Dataset publicly available at: https://huggingface.co/datasets/Dogacel/nemotron-post-training-v2-gpt-oss-20b-regen ### Training Procedure We've trained our model using [SpecForge](https://github.com/sgl-project/SpecForge) on 8xH200 within 8 hours. - **LR:** 1e-4 (warmup 0.2, cosine) - **Epochs:** 2 - **Batch Size:** 4 (Effective 4x8=32) - **Max Length:** 4096 - **TTT:** 4 TODO: Fill training parameters ## Evaluation Evaluation has run on: MT-Bench, 80 prompts, max tokens 2048, temperature 0.7 Scripts available at [SpecForge](https://github.com/sgl-project/SpecForge/pull/552). ### H100 @ BS=1 — Baseline vs Ours (1-3-1-4) | Metric | Baseline | Ours (1-3-1-4) | Δ | |---|---:|---:|---:| | **Latency (s)** | 444.05 | **373.11** | −16.0% | | **Throughput (tok/s)** | 304.93 | **371.90** | +22.0% | | **Accept Length** | 1.000 | **2.347** | +134.7% | ### Per-Category Throughput (H100, BS=1) | Category | Baseline → Ours | Δ | Accept Length | |---|---:|---:|---:| | Writing | 207.83 → 268.62 | +29.2% | 2.225 | | Roleplay | 301.01 → 380.61 | +26.4% | 2.210 | | Reasoning | 260.19 → 265.83 | +2.2% | 2.334 | | Math | 170.41 → 190.53 | +11.8% | **2.894** | | Coding | 427.36 → 487.45 | +14.1% | 2.672 | | Extraction | 164.69 → 233.76 | **+41.9%** | 2.634 | | STEM | 436.35 → 545.97 | +25.1% | 2.287 | | Humanities | 471.61 → 602.40 | +27.7% | 2.112 | Our evaluation on higher batch sizes has shown the model performance matches or slightly exceeds the baseline. ## Citation **BibTeX:** ```bibtex @misc{eldenk2026attentiondrift, title={Attention Drift: What Autoregressive Speculative Decoding Models Learn}, author={Doğaç Eldenk and Payal Mohapatra and Yigitcan Comlek and Kaan Oktay and Hongyang Zhang and Stephen Xia}, year={2026}, eprint={2605.09992}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2605.09992}, } ``` ## Acknowledgements We would like to thank fal and Lambda for their support.