fynnsu's picture
Update README.md
6c8aa29 verified
metadata
license: apache-2.0
tags:
  - dflash
  - speculators
  - gemma4
  - redhat
name: RedHatAI/gemma-4-31B-it-speculator.dflash

RedHatAI/gemma-4-31B-it-speculator.dflash

This is a preliminary (and subject to change) DFlash speculator model for google/gemma-4-31B-it.

It was trained using the Speculators library on a combination of the Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered dataset and the train_sft split of the HuggingFaceH4/ultrachat_200k dataset. Training data used Magpie + UltraChat with responses from the gemma-4-31B-it model (no reasoning).

This model should be used with the google/gemma-4-31b-it chat template, specifically through the /chat/completions endpoint.

Note:

It was validated on Nvidia H100, other hardware validation pending.

We are continuing to train this model and will update with more evaluations and new weights in the future.

Deployment

Deploy with vLLM (main/nightly) using the speculator as a draft model.

First install vllm nightly
uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly

Then run:

vllm serve -tp 2 RedHatAI/gemma-4-31B-it-speculator.dflash

It can also be deployed with a quantized verifier for even better speedups:

vllm serve RedHatAI/gemma-4-31B-it-FP8-block   --tensor-parallel-size 2  --speculative-config '{
    "model": "RedHatAI/gemma-4-31B-it-speculator.dflash",
    "num_speculative_tokens": 8,
    "method": "dflash"
  }'

Preliminary Evaluations

Evaluation command:

vllm bench serve --backend openai-chat --endpoint /v1/chat/completions \
  --dataset-name hf --tokenizer google/gemma-4-31B-it \
  --dataset-path "philschmid/mt-bench" --num-prompts 80 \
  --max-concurrency 1 --model RedHatAI/gemma-4-31B-it-speculator.dflash \
  --hf-output-len 2048 \
  --temperature 0 --save-result --save-detailed

Per-Position Acceptance Rate

Dataset Pos 0 Pos 1 Pos 2 Pos 3 Pos 4 Pos 5 Pos 6 Pos 7 Avg. Length
HumanEval 85.8% 72.1% 60.3% 50.4% 41.8% 34.3% 26.9% 19.6% 4.91
math_reasoning 88.7% 76.1% 64.8% 54.9% 45.5% 36.5% 28.8% 21.5% 5.17
qa 67.5% 41% 23.8% 13.8% 8.1% 4.5% 2.6% 1.3% 2.63
question 75.1% 51.1% 34.7% 24.5% 17.9% 13% 9.4% 6.5% 3.32
rag 76.1% 54.8% 39.8% 28.7% 19.9% 12.9% 7% 3.8% 3.43
summarization 67.3% 39.9% 22.3% 12% 6.4% 3.1% 1.5% 0.7% 2.53
tool_call 65.7% 45.7% 31.6% 21.7% 15% 9.6% 6.2% 3.6% 2.99
translation 73.4% 51.4% 35.3% 23.6% 15.6% 9.3% 5.4% 2.6% 3.17
writing 75.3% 51.6% 35.1% 24.5% 17.8% 13% 9.4% 6.5% 3.33