| --- |
| library_name: speculators |
| base_model: |
| - poolside/Laguna-XS.2 |
| license: apache-2.0 |
| tags: |
| - speculative-decoding |
| - dflash |
| - speculators |
| --- |
| |
| # poolside/Laguna-XS.2-speculator.dflash |
|
|
| This is a DFlash speculator model for [poolside/Laguna-XS.2](https://huggingface.co/poolside/Laguna-XS.2). |
|
|
| ## Training Details |
|
|
| This model was trained using the [Speculators](https://github.com/vllm-project/speculators) library on a combination of [Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) and the `train_sft` split of [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k). Responses were regenerated by Laguna-XS.2 (with reasoning). |
|
|
| ## Model Specifications |
|
|
| | | | |
| |---|---| |
| | **Base Model** | poolside/Laguna-XS.2 | |
| | **Chat Template** | poolside/Laguna-XS.2 (use `/chat/completions` endpoint) | |
| | **Format** | Safetensors | |
| | **License** | Apache 2.0 | |
| | **Validation Hardware** | Nvidia A100 | |
|
|
| ## Deployment |
|
|
| ```bash |
| # Install vLLM from the required PR |
| pip install git+https://github.com/vllm-project/vllm.git@refs/pull/41880/head |
| |
| # Deploy with speculative decoding |
| VLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS.2 \ |
| --tensor-parallel-size 1 \ |
| --max-model-len 16384 \ |
| --tool-call-parser poolside_v1 \ |
| --reasoning-parser poolside_v1 \ |
| --enable-auto-tool-choice \ |
| --default-chat-template-kwargs '{"enable_thinking": true}' \ |
| --speculative-config '{ |
| "model": "poolside/Laguna-XS.2-speculator.dflash", |
| "num_speculative_tokens": 7, |
| "method": "dflash" |
| }' |
| ``` |
|
|
| ## Preliminary Evaluations |
|
|
| Per-position token acceptance rates across datasets: |
| (with reasoning enabled) |
|
|
| | Dataset | Pos 1 | Pos 2 | Pos 3 | Pos 4 | Pos 5 | Pos 6 | Pos 7 | Avg Length | |
| |---------|-------|-------|-------|-------|-------|-------|-------|------------| |
| | HumanEval | 74.0% | 48.6% | 29.9% | 17.7% | 9.9% | 5.1% | 2.4% | 2.876 | |
| | math_reasoning | 76.9% | 53.2% | 34.6% | 21.2% | 12.1% | 6.0% | 2.6% | 3.066 | |
| | qa | 68.5% | 41.8% | 24.8% | 14.7% | 8.4% | 4.6% | 2.2% | 2.650 | |
| | question | 70.6% | 44.1% | 26.2% | 15.0% | 8.4% | 4.5% | 2.3% | 2.711 | |
| | rag | 71.7% | 45.7% | 27.6% | 16.0% | 8.9% | 4.8% | 2.3% | 2.770 | |
| | summarization | 68.8% | 40.8% | 22.7% | 12.3% | 6.5% | 3.3% | 1.5% | 2.559 | |
| | translation | 70.8% | 44.3% | 25.0% | 13.0% | 6.5% | 3.1% | 1.2% | 2.639 | |
| | writing | 70.9% | 44.6% | 26.8% | 15.8% | 9.4% | 5.4% | 2.3% | 2.752 | |
| |
| ## References |
| |
| **Paper**: [DFlash: Block Diffusion for Flash Speculative Decoding](https://arxiv.org/abs/2602.06036) |
| |