| --- |
| license: apache-2.0 |
| tags: |
| - dflash |
| - speculators |
| - gemma4 |
| - redhat |
| name: RedHatAI/gemma-4-31B-it-speculator.dflash |
| --- |
| |
| # RedHatAI/gemma-4-31B-it-speculator.dflash |
|
|
| This is a preliminary (and subject to change) [DFlash](https://arxiv.org/abs/2602.06036) speculator model for [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it). |
|
|
| It was trained using the [Speculators](https://github.com/vllm-project/speculators) library on a combination of the [Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) dataset and the `train_sft` split of the [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset. Training data used **Magpie + UltraChat with responses from the gemma-4-31B-it model (no reasoning)**. |
|
|
| This model should be used with the [google/gemma-4-31b-it](https://huggingface.co/google/gemma-4-31B-it) chat template, specifically through the `/chat/completions` endpoint. |
|
|
| ## Note: |
|
|
| It was validated on Nvidia H100, other hardware validation pending. |
|
|
| We are continuing to train this model and will update with more evaluations and new weights in the future. |
|
|
| ## Deployment |
|
|
| Deploy with vLLM (main/nightly) using the speculator as a draft model. |
|
|
| <details><summary>First install vllm nightly</summary> |
| |
| ```bash |
| uv pip install -U vllm \ |
| --torch-backend=auto \ |
| --extra-index-url https://wheels.vllm.ai/nightly |
| ``` |
|
|
| </details> |
|
|
| Then run: |
|
|
| ```bash |
| vllm serve -tp 2 RedHatAI/gemma-4-31B-it-speculator.dflash |
| ``` |
|
|
| It can also be deployed with a quantized verifier for even better speedups: |
|
|
| ```bash |
| vllm serve RedHatAI/gemma-4-31B-it-FP8-block --tensor-parallel-size 2 --speculative-config '{ |
| "model": "RedHatAI/gemma-4-31B-it-speculator.dflash", |
| "num_speculative_tokens": 8, |
| "method": "dflash" |
| }' |
| ``` |
|
|
| ## Preliminary Evaluations |
|
|
| Evaluation command: |
| ```bash |
| vllm bench serve --backend openai-chat --endpoint /v1/chat/completions \ |
| --dataset-name hf --tokenizer google/gemma-4-31B-it \ |
| --dataset-path "philschmid/mt-bench" --num-prompts 80 \ |
| --max-concurrency 1 --model RedHatAI/gemma-4-31B-it-speculator.dflash \ |
| --hf-output-len 2048 \ |
| --temperature 0 --save-result --save-detailed |
| ``` |
|
|
| Per-Position Acceptance Rate |
|
|
| | Dataset | Pos 0 | Pos 1 | Pos 2 | Pos 3 | Pos 4 | Pos 5 | Pos 6 | Pos 7 | Avg. Length | |
| |:-------|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:----------:| |
| | HumanEval | 85.8% | 72.1% | 60.3% | 50.4% | 41.8% | 34.3% | 26.9% | 19.6% | 4.91 | |
| | math_reasoning | 88.7% | 76.1% | 64.8% | 54.9% | 45.5% | 36.5% | 28.8% | 21.5% | 5.17 | |
| | qa | 67.5% | 41% | 23.8% | 13.8% | 8.1% | 4.5% | 2.6% | 1.3% | 2.63 | |
| | question | 75.1% | 51.1% | 34.7% | 24.5% | 17.9% | 13% | 9.4% | 6.5% | 3.32 | |
| | rag | 76.1% | 54.8% | 39.8% | 28.7% | 19.9% | 12.9% | 7% | 3.8% | 3.43 | |
| | summarization | 67.3% | 39.9% | 22.3% | 12% | 6.4% | 3.1% | 1.5% | 0.7% | 2.53 | |
| | tool_call | 65.7% | 45.7% | 31.6% | 21.7% | 15% | 9.6% | 6.2% | 3.6% | 2.99 | |
| | translation | 73.4% | 51.4% | 35.3% | 23.6% | 15.6% | 9.3% | 5.4% | 2.6% | 3.17 | |
| | writing | 75.3% | 51.6% | 35.1% | 24.5% | 17.8% | 13% | 9.4% | 6.5% | 3.33 | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|