File size: 3,215 Bytes
01ef2ab e14a5df 755b5a7 e14a5df 755b5a7 e14a5df 755b5a7 e14a5df 755b5a7 e14a5df 755b5a7 e14a5df b6e1bf5 e14a5df 755b5a7 454a624 6c8aa29 454a624 6c8aa29 454a624 6c8aa29 454a624 a7fdd60 6c8aa29 6aa2d27 a7fdd60 755b5a7 f8b1c87 755b5a7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | ---
license: apache-2.0
tags:
- dflash
- speculators
- gemma4
- redhat
name: RedHatAI/gemma-4-31B-it-speculator.dflash
---
# RedHatAI/gemma-4-31B-it-speculator.dflash
This is a preliminary (and subject to change) [DFlash](https://arxiv.org/abs/2602.06036) speculator model for [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it).
It was trained using the [Speculators](https://github.com/vllm-project/speculators) library on a combination of the [Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) dataset and the `train_sft` split of the [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset. Training data used **Magpie + UltraChat with responses from the gemma-4-31B-it model (no reasoning)**.
This model should be used with the [google/gemma-4-31b-it](https://huggingface.co/google/gemma-4-31B-it) chat template, specifically through the `/chat/completions` endpoint.
## Note:
It was validated on Nvidia H100, other hardware validation pending.
We are continuing to train this model and will update with more evaluations and new weights in the future.
## Deployment
Deploy with vLLM (main/nightly) using the speculator as a draft model.
<details><summary>First install vllm nightly</summary>
```bash
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
```
</details>
Then run:
```bash
vllm serve -tp 2 RedHatAI/gemma-4-31B-it-speculator.dflash
```
It can also be deployed with a quantized verifier for even better speedups:
```bash
vllm serve RedHatAI/gemma-4-31B-it-FP8-block --tensor-parallel-size 2 --speculative-config '{
"model": "RedHatAI/gemma-4-31B-it-speculator.dflash",
"num_speculative_tokens": 8,
"method": "dflash"
}'
```
## Preliminary Evaluations
Evaluation command:
```bash
vllm bench serve --backend openai-chat --endpoint /v1/chat/completions \
--dataset-name hf --tokenizer google/gemma-4-31B-it \
--dataset-path "philschmid/mt-bench" --num-prompts 80 \
--max-concurrency 1 --model RedHatAI/gemma-4-31B-it-speculator.dflash \
--hf-output-len 2048 \
--temperature 0 --save-result --save-detailed
```
Per-Position Acceptance Rate
| Dataset | Pos 0 | Pos 1 | Pos 2 | Pos 3 | Pos 4 | Pos 5 | Pos 6 | Pos 7 | Avg. Length |
|:-------|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:----------:|
| HumanEval | 85.8% | 72.1% | 60.3% | 50.4% | 41.8% | 34.3% | 26.9% | 19.6% | 4.91 |
| math_reasoning | 88.7% | 76.1% | 64.8% | 54.9% | 45.5% | 36.5% | 28.8% | 21.5% | 5.17 |
| qa | 67.5% | 41% | 23.8% | 13.8% | 8.1% | 4.5% | 2.6% | 1.3% | 2.63 |
| question | 75.1% | 51.1% | 34.7% | 24.5% | 17.9% | 13% | 9.4% | 6.5% | 3.32 |
| rag | 76.1% | 54.8% | 39.8% | 28.7% | 19.9% | 12.9% | 7% | 3.8% | 3.43 |
| summarization | 67.3% | 39.9% | 22.3% | 12% | 6.4% | 3.1% | 1.5% | 0.7% | 2.53 |
| tool_call | 65.7% | 45.7% | 31.6% | 21.7% | 15% | 9.6% | 6.2% | 3.6% | 2.99 |
| translation | 73.4% | 51.4% | 35.3% | 23.6% | 15.6% | 9.3% | 5.4% | 2.6% | 3.17 |
| writing | 75.3% | 51.6% | 35.1% | 24.5% | 17.8% | 13% | 9.4% | 6.5% | 3.33 |
|