Update README.md

dad2340 verified about 13 hours ago

3.98 kB

	---
	license: cc-by-4.0
	datasets:
	- Dogacel/nemotron-post-training-v2-gpt-oss-20b-regen
	language:
	- en
	base_model:
	- openai/gpt-oss-20b
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->

	EAGLE-3 drafter model for GPT-oss-20b. This model is released as a part of _Attention Drift: What Speculative Decoding Models Learn_ paper.
	It has several minor architectural differences from the original EAGLE: Drafter hidden state is captured after the norm, additional norm injected before FC.

	## Model Details

	### Model Sources [optional]

	- Repository: [Dogacel/SpecDrift](https://github.com/Dogacel/SpecDrift)
	- Paper: https://arxiv.org/abs/2605.09992

	## Uses

	We recommend using SGLang to run the model,

	```
	export SGLANG_ENABLE_SPEC_V2=1

	python -m sglang.launch_server \
	--model-path openai/gpt-oss-20b \
	--speculative-algorithm EAGLE3 \
	--speculative-draft-model-path "Dogacel/specdrift-gpt-oss-20b-eagle3" \
	--speculative-num-steps 3 \
	--speculative-eagle-topk 1 \
	--speculative-num-draft-tokens 4 \
	--speculative-draft-sliding-window 2048 \
	--port 30000 \
	--dp-size 1 --tp-size 1 \
	--max-running-requests 64 \
	--cuda-graph-max-bs 64 \
	--attention-backend fa3 \
	--trust-remote-code \
	--mem-fraction-static 0.9 --dtype bfloat16
	```

	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	This model is trained on Nemoron Post Training V2 dataset, answers regenerated using gpt-oss-20b.

	Dataset publicly available at: https://huggingface.co/datasets/Dogacel/nemotron-post-training-v2-gpt-oss-20b-regen

	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	We've trained our model using [SpecForge](https://github.com/sgl-project/SpecForge) on 8xH200 within 8 hours.

	- LR: 1e-4 (warmup 0.2, cosine)
	- Epochs: 2
	- Batch Size: 4 (Effective 4x8=32)
	- Max Length: 4096
	- TTT: 4

	TODO: Fill training parameters

	## Evaluation

	Evaluation has run on: MT-Bench, 80 prompts, max tokens 2048, temperature 0.7

	Scripts available at [SpecForge](https://github.com/sgl-project/SpecForge/pull/552).

	### H100 @ BS=1 — Baseline vs Ours (1-3-1-4)

	\| Metric \| Baseline \| Ours (1-3-1-4) \| Δ \|
	\|---\|---:\|---:\|---:\|
	\| Latency (s) \| 444.05 \| 373.11 \| −16.0% \|
	\| Throughput (tok/s) \| 304.93 \| 371.90 \| +22.0% \|
	\| Accept Length \| 1.000 \| 2.347 \| +134.7% \|

	### Per-Category Throughput (H100, BS=1)

	\| Category \| Baseline → Ours \| Δ \| Accept Length \|
	\|---\|---:\|---:\|---:\|
	\| Writing \| 207.83 → 268.62 \| +29.2% \| 2.225 \|
	\| Roleplay \| 301.01 → 380.61 \| +26.4% \| 2.210 \|
	\| Reasoning \| 260.19 → 265.83 \| +2.2% \| 2.334 \|
	\| Math \| 170.41 → 190.53 \| +11.8% \| 2.894 \|
	\| Coding \| 427.36 → 487.45 \| +14.1% \| 2.672 \|
	\| Extraction \| 164.69 → 233.76 \| +41.9% \| 2.634 \|
	\| STEM \| 436.35 → 545.97 \| +25.1% \| 2.287 \|
	\| Humanities \| 471.61 → 602.40 \| +27.7% \| 2.112 \|


	Our evaluation on higher batch sizes has shown the model performance matches or slightly exceeds the baseline.

	## Citation

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	```bibtex
	@misc{eldenk2026attentiondrift,
	title={Attention Drift: What Autoregressive Speculative Decoding Models Learn},
	author={Doğaç Eldenk and Payal Mohapatra and Yigitcan Comlek and Kaan Oktay and Hongyang Zhang and Stephen Xia},
	year={2026},
	eprint={2605.09992},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2605.09992},
	}
	```

	## Acknowledgements

	We would like to thank fal and Lambda for their support.