RedHatAI
/

gemma-4-31B-it-speculator.dflash

Model card Files Files and versions

gemma-4-31B-it-speculator.dflash / README.md

fynnsu's picture

Update README.md

6c8aa29 verified 9 days ago

|

history blame contribute delete

3.22 kB

	---
	license: apache-2.0
	tags:
	- dflash
	- speculators
	- gemma4
	- redhat
	name: RedHatAI/gemma-4-31B-it-speculator.dflash
	---

	# RedHatAI/gemma-4-31B-it-speculator.dflash

	This is a preliminary (and subject to change) [DFlash](https://arxiv.org/abs/2602.06036) speculator model for [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it).

	It was trained using the [Speculators](https://github.com/vllm-project/speculators) library on a combination of the [Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) dataset and the `train_sft` split of the [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset. Training data used Magpie + UltraChat with responses from the gemma-4-31B-it model (no reasoning).

	This model should be used with the [google/gemma-4-31b-it](https://huggingface.co/google/gemma-4-31B-it) chat template, specifically through the `/chat/completions` endpoint.

	## Note:

	It was validated on Nvidia H100, other hardware validation pending.

	We are continuing to train this model and will update with more evaluations and new weights in the future.

	## Deployment

	Deploy with vLLM (main/nightly) using the speculator as a draft model.

	<details><summary>First install vllm nightly</summary>

	```bash
	uv pip install -U vllm \
	--torch-backend=auto \
	--extra-index-url https://wheels.vllm.ai/nightly
	```

	</details>

	Then run:

	```bash
	vllm serve -tp 2 RedHatAI/gemma-4-31B-it-speculator.dflash
	```

	It can also be deployed with a quantized verifier for even better speedups:

	```bash
	vllm serve RedHatAI/gemma-4-31B-it-FP8-block --tensor-parallel-size 2 --speculative-config '{
	"model": "RedHatAI/gemma-4-31B-it-speculator.dflash",
	"num_speculative_tokens": 8,
	"method": "dflash"
	}'
	```

	## Preliminary Evaluations

	Evaluation command:
	```bash
	vllm bench serve --backend openai-chat --endpoint /v1/chat/completions \
	--dataset-name hf --tokenizer google/gemma-4-31B-it \
	--dataset-path "philschmid/mt-bench" --num-prompts 80 \
	--max-concurrency 1 --model RedHatAI/gemma-4-31B-it-speculator.dflash \
	--hf-output-len 2048 \
	--temperature 0 --save-result --save-detailed
	```

	Per-Position Acceptance Rate

	\| Dataset \| Pos 0 \| Pos 1 \| Pos 2 \| Pos 3 \| Pos 4 \| Pos 5 \| Pos 6 \| Pos 7 \| Avg. Length \|
	\|:-------\|:-----:\|:-----:\|:-----:\|:-----:\|:-----:\|:-----:\|:-----:\|:-----:\|:----------:\|
	\| HumanEval \| 85.8% \| 72.1% \| 60.3% \| 50.4% \| 41.8% \| 34.3% \| 26.9% \| 19.6% \| 4.91 \|
	\| math_reasoning \| 88.7% \| 76.1% \| 64.8% \| 54.9% \| 45.5% \| 36.5% \| 28.8% \| 21.5% \| 5.17 \|
	\| qa \| 67.5% \| 41% \| 23.8% \| 13.8% \| 8.1% \| 4.5% \| 2.6% \| 1.3% \| 2.63 \|
	\| question \| 75.1% \| 51.1% \| 34.7% \| 24.5% \| 17.9% \| 13% \| 9.4% \| 6.5% \| 3.32 \|
	\| rag \| 76.1% \| 54.8% \| 39.8% \| 28.7% \| 19.9% \| 12.9% \| 7% \| 3.8% \| 3.43 \|
	\| summarization \| 67.3% \| 39.9% \| 22.3% \| 12% \| 6.4% \| 3.1% \| 1.5% \| 0.7% \| 2.53 \|
	\| tool_call \| 65.7% \| 45.7% \| 31.6% \| 21.7% \| 15% \| 9.6% \| 6.2% \| 3.6% \| 2.99 \|
	\| translation \| 73.4% \| 51.4% \| 35.3% \| 23.6% \| 15.6% \| 9.3% \| 5.4% \| 2.6% \| 3.17 \|
	\| writing \| 75.3% \| 51.6% \| 35.1% \| 24.5% \| 17.8% \| 13% \| 9.4% \| 6.5% \| 3.33 \|