Update README.md

6c8aa29 verified 9 days ago

3.22 kB

license: apache-2.0
tags:
  - dflash
  - speculators
  - gemma4
  - redhat
name: RedHatAI/gemma-4-31B-it-speculator.dflash

RedHatAI/gemma-4-31B-it-speculator.dflash

This is a preliminary (and subject to change) DFlash speculator model for google/gemma-4-31B-it.

It was trained using the Speculators library on a combination of the Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered dataset and the train_sft split of the HuggingFaceH4/ultrachat_200k dataset. Training data used Magpie + UltraChat with responses from the gemma-4-31B-it model (no reasoning).

This model should be used with the google/gemma-4-31b-it chat template, specifically through the /chat/completions endpoint.

Note:

It was validated on Nvidia H100, other hardware validation pending.

We are continuing to train this model and will update with more evaluations and new weights in the future.

Deployment

Deploy with vLLM (main/nightly) using the speculator as a draft model.

First install vllm nightly

uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly

Then run:

vllm serve -tp 2 RedHatAI/gemma-4-31B-it-speculator.dflash

It can also be deployed with a quantized verifier for even better speedups:

vllm serve RedHatAI/gemma-4-31B-it-FP8-block   --tensor-parallel-size 2  --speculative-config '{
    "model": "RedHatAI/gemma-4-31B-it-speculator.dflash",
    "num_speculative_tokens": 8,
    "method": "dflash"
  }'

Preliminary Evaluations

Evaluation command:

vllm bench serve --backend openai-chat --endpoint /v1/chat/completions \
  --dataset-name hf --tokenizer google/gemma-4-31B-it \
  --dataset-path "philschmid/mt-bench" --num-prompts 80 \
  --max-concurrency 1 --model RedHatAI/gemma-4-31B-it-speculator.dflash \
  --hf-output-len 2048 \
  --temperature 0 --save-result --save-detailed

Per-Position Acceptance Rate

Dataset	Pos 0	Pos 1	Pos 2	Pos 3	Pos 4	Pos 5	Pos 6	Pos 7	Avg. Length
HumanEval	85.8%	72.1%	60.3%	50.4%	41.8%	34.3%	26.9%	19.6%	4.91
math_reasoning	88.7%	76.1%	64.8%	54.9%	45.5%	36.5%	28.8%	21.5%	5.17
qa	67.5%	41%	23.8%	13.8%	8.1%	4.5%	2.6%	1.3%	2.63
question	75.1%	51.1%	34.7%	24.5%	17.9%	13%	9.4%	6.5%	3.32
rag	76.1%	54.8%	39.8%	28.7%	19.9%	12.9%	7%	3.8%	3.43
summarization	67.3%	39.9%	22.3%	12%	6.4%	3.1%	1.5%	0.7%	2.53
tool_call	65.7%	45.7%	31.6%	21.7%	15%	9.6%	6.2%	3.6%	2.99
translation	73.4%	51.4%	35.3%	23.6%	15.6%	9.3%	5.4%	2.6%	3.17
writing	75.3%	51.6%	35.1%	24.5%	17.8%	13%	9.4%	6.5%	3.33