File size: 2,616 Bytes
3fda59f
 
 
 
 
 
 
 
 
 
 
b345a81
3fda59f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
library_name: speculators
base_model:
- poolside/Laguna-XS.2
license: apache-2.0
tags:
- speculative-decoding
- dflash
- speculators
---

# poolside/Laguna-XS.2-speculator.dflash

This is a DFlash speculator model for [poolside/Laguna-XS.2](https://huggingface.co/poolside/Laguna-XS.2).

## Training Details

This model was trained using the [Speculators](https://github.com/vllm-project/speculators) library on a combination of [Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) and the `train_sft` split of [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k). Responses were regenerated by Laguna-XS.2 (with reasoning).

## Model Specifications

| | |
|---|---|
| **Base Model** | poolside/Laguna-XS.2 |
| **Chat Template** | poolside/Laguna-XS.2 (use `/chat/completions` endpoint) |
| **Format** | Safetensors |
| **License** | Apache 2.0 |
| **Validation Hardware** | Nvidia A100 |

## Deployment

```bash
# Install vLLM from the required PR
pip install git+https://github.com/vllm-project/vllm.git@refs/pull/41880/head

# Deploy with speculative decoding
VLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS.2 \
    --tensor-parallel-size 1 \
    --max-model-len 16384 \
    --tool-call-parser poolside_v1 \
    --reasoning-parser poolside_v1 \
    --enable-auto-tool-choice \
    --default-chat-template-kwargs '{"enable_thinking": true}' \
    --speculative-config '{
        "model": "poolside/Laguna-XS.2-speculator.dflash",
        "num_speculative_tokens": 7,
        "method": "dflash"
    }'
```

## Preliminary Evaluations

Per-position token acceptance rates across datasets:
(with reasoning enabled)

| Dataset | Pos 1 | Pos 2 | Pos 3 | Pos 4 | Pos 5 | Pos 6 | Pos 7 | Avg Length |
|---------|-------|-------|-------|-------|-------|-------|-------|------------|
| HumanEval | 74.0% | 48.6% | 29.9% | 17.7% | 9.9% | 5.1% | 2.4% | 2.876 |
| math_reasoning | 76.9% | 53.2% | 34.6% | 21.2% | 12.1% | 6.0% | 2.6% | 3.066 |
| qa | 68.5% | 41.8% | 24.8% | 14.7% | 8.4% | 4.6% | 2.2% | 2.650 |
| question | 70.6% | 44.1% | 26.2% | 15.0% | 8.4% | 4.5% | 2.3% | 2.711 |
| rag | 71.7% | 45.7% | 27.6% | 16.0% | 8.9% | 4.8% | 2.3% | 2.770 |
| summarization | 68.8% | 40.8% | 22.7% | 12.3% | 6.5% | 3.3% | 1.5% | 2.559 |
| translation | 70.8% | 44.3% | 25.0% | 13.0% | 6.5% | 3.1% | 1.2% | 2.639 |
| writing | 70.9% | 44.6% | 26.8% | 15.8% | 9.4% | 5.4% | 2.3% | 2.752 |

## References

**Paper**: [DFlash: Block Diffusion for Flash Speculative Decoding](https://arxiv.org/abs/2602.06036)