poolside
/

Laguna-XS.2-speculator.dflash

speculative-decoding

Model card Files Files and versions

Laguna-XS.2-speculator.dflash / README.md

dsikka's picture

Update README.md

b345a81 verified about 8 hours ago

|

history blame contribute delete

2.62 kB

	---
	library_name: speculators
	base_model:
	- poolside/Laguna-XS.2
	license: apache-2.0
	tags:
	- speculative-decoding
	- dflash
	- speculators
	---

	# poolside/Laguna-XS.2-speculator.dflash

	This is a DFlash speculator model for [poolside/Laguna-XS.2](https://huggingface.co/poolside/Laguna-XS.2).

	## Training Details

	This model was trained using the [Speculators](https://github.com/vllm-project/speculators) library on a combination of [Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) and the `train_sft` split of [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k). Responses were regenerated by Laguna-XS.2 (with reasoning).

	## Model Specifications

	\| \| \|
	\|---\|---\|
	\| Base Model \| poolside/Laguna-XS.2 \|
	\| Chat Template \| poolside/Laguna-XS.2 (use `/chat/completions` endpoint) \|
	\| Format \| Safetensors \|
	\| License \| Apache 2.0 \|
	\| Validation Hardware \| Nvidia A100 \|

	## Deployment

	```bash
	# Install vLLM from the required PR
	pip install git+https://github.com/vllm-project/vllm.git@refs/pull/41880/head

	# Deploy with speculative decoding
	VLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS.2 \
	--tensor-parallel-size 1 \
	--max-model-len 16384 \
	--tool-call-parser poolside_v1 \
	--reasoning-parser poolside_v1 \
	--enable-auto-tool-choice \
	--default-chat-template-kwargs '{"enable_thinking": true}' \
	--speculative-config '{
	"model": "poolside/Laguna-XS.2-speculator.dflash",
	"num_speculative_tokens": 7,
	"method": "dflash"
	}'
	```

	## Preliminary Evaluations

	Per-position token acceptance rates across datasets:
	(with reasoning enabled)

	\| Dataset \| Pos 1 \| Pos 2 \| Pos 3 \| Pos 4 \| Pos 5 \| Pos 6 \| Pos 7 \| Avg Length \|
	\|---------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|------------\|
	\| HumanEval \| 74.0% \| 48.6% \| 29.9% \| 17.7% \| 9.9% \| 5.1% \| 2.4% \| 2.876 \|
	\| math_reasoning \| 76.9% \| 53.2% \| 34.6% \| 21.2% \| 12.1% \| 6.0% \| 2.6% \| 3.066 \|
	\| qa \| 68.5% \| 41.8% \| 24.8% \| 14.7% \| 8.4% \| 4.6% \| 2.2% \| 2.650 \|
	\| question \| 70.6% \| 44.1% \| 26.2% \| 15.0% \| 8.4% \| 4.5% \| 2.3% \| 2.711 \|
	\| rag \| 71.7% \| 45.7% \| 27.6% \| 16.0% \| 8.9% \| 4.8% \| 2.3% \| 2.770 \|
	\| summarization \| 68.8% \| 40.8% \| 22.7% \| 12.3% \| 6.5% \| 3.3% \| 1.5% \| 2.559 \|
	\| translation \| 70.8% \| 44.3% \| 25.0% \| 13.0% \| 6.5% \| 3.1% \| 1.2% \| 2.639 \|
	\| writing \| 70.9% \| 44.6% \| 26.8% \| 15.8% \| 9.4% \| 5.4% \| 2.3% \| 2.752 \|

	## References

	Paper: [DFlash: Block Diffusion for Flash Speculative Decoding](https://arxiv.org/abs/2602.06036)