File size: 3,215 Bytes
01ef2ab
 
 
 
 
 
 
e14a5df
755b5a7
 
 
 
e14a5df
755b5a7
e14a5df
755b5a7
e14a5df
755b5a7
e14a5df
755b5a7
e14a5df
b6e1bf5
e14a5df
755b5a7
454a624
 
6c8aa29
454a624
6c8aa29
 
454a624
6c8aa29
 
 
 
 
 
 
 
 
 
 
454a624
 
a7fdd60
 
 
6c8aa29
6aa2d27
a7fdd60
 
 
 
 
755b5a7
 
 
 
 
 
 
 
 
 
 
 
f8b1c87
 
 
 
 
 
 
 
 
 
 
 
 
755b5a7
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
license: apache-2.0
tags:
- dflash
- speculators
- gemma4
- redhat
name: RedHatAI/gemma-4-31B-it-speculator.dflash
---

# RedHatAI/gemma-4-31B-it-speculator.dflash

This is a preliminary (and subject to change) [DFlash](https://arxiv.org/abs/2602.06036) speculator model for [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it).

It was trained using the [Speculators](https://github.com/vllm-project/speculators) library on a combination of the [Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) dataset and the `train_sft` split of the [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset. Training data used **Magpie + UltraChat with responses from the gemma-4-31B-it model (no reasoning)**.

This model should be used with the [google/gemma-4-31b-it](https://huggingface.co/google/gemma-4-31B-it) chat template, specifically through the `/chat/completions` endpoint.

## Note:

It was validated on Nvidia H100, other hardware validation pending.  

We are continuing to train this model and will update with more evaluations and new weights in the future.  

## Deployment

Deploy with vLLM (main/nightly) using the speculator as a draft model.

<details><summary>First install vllm nightly</summary>
  
```bash
uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly
```

</details>

Then run:

```bash
vllm serve -tp 2 RedHatAI/gemma-4-31B-it-speculator.dflash
```

It can also be deployed with a quantized verifier for even better speedups:

```bash
vllm serve RedHatAI/gemma-4-31B-it-FP8-block   --tensor-parallel-size 2  --speculative-config '{
    "model": "RedHatAI/gemma-4-31B-it-speculator.dflash",
    "num_speculative_tokens": 8,
    "method": "dflash"
  }'
```

## Preliminary Evaluations

Evaluation command:
```bash
vllm bench serve --backend openai-chat --endpoint /v1/chat/completions \
  --dataset-name hf --tokenizer google/gemma-4-31B-it \
  --dataset-path "philschmid/mt-bench" --num-prompts 80 \
  --max-concurrency 1 --model RedHatAI/gemma-4-31B-it-speculator.dflash \
  --hf-output-len 2048 \
  --temperature 0 --save-result --save-detailed
```

Per-Position Acceptance Rate

| Dataset | Pos 0 | Pos 1 | Pos 2 | Pos 3 | Pos 4 | Pos 5 | Pos 6 | Pos 7 | Avg. Length |
|:-------|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:----------:|
| HumanEval | 85.8% | 72.1% | 60.3% | 50.4% | 41.8% | 34.3% | 26.9% | 19.6% | 4.91 |
| math_reasoning | 88.7% | 76.1% | 64.8% | 54.9% | 45.5% | 36.5% | 28.8% | 21.5% | 5.17 |
| qa | 67.5% | 41% | 23.8% | 13.8% | 8.1% | 4.5% | 2.6% | 1.3% | 2.63 |
| question | 75.1% | 51.1% | 34.7% | 24.5% | 17.9% | 13% | 9.4% | 6.5% | 3.32 |
| rag | 76.1% | 54.8% | 39.8% | 28.7% | 19.9% | 12.9% | 7% | 3.8% | 3.43 |
| summarization | 67.3% | 39.9% | 22.3% | 12% | 6.4% | 3.1% | 1.5% | 0.7% | 2.53 |
| tool_call | 65.7% | 45.7% | 31.6% | 21.7% | 15% | 9.6% | 6.2% | 3.6% | 2.99 |
| translation | 73.4% | 51.4% | 35.3% | 23.6% | 15.6% | 9.3% | 5.4% | 2.6% | 3.17 |
| writing | 75.3% | 51.6% | 35.1% | 24.5% | 17.8% | 13% | 9.4% | 6.5% | 3.33 |