RedHatAI
/

gemma-4-31B-it-speculator.dflash

Model card Files Files and versions

fynnsu commited on 12 days ago

Commit

6c8aa29

·

verified ·

1 Parent(s): f8b1c87

Update README.md

Files changed (1) hide show

README.md +15 -3

README.md CHANGED Viewed

@@ -24,16 +24,28 @@ We are continuing to train this model and will update with more evaluations and
 ## Deployment
-Deploy with vLLM (main) using the speculator as a draft model:
 ```bash
-vllm serve -tp 2 RedHatAI/gemma-4-31B-it-speculator.dflash --attention-backend FLASH_ATTN
 ```
 It can also be deployed with a quantized verifier for even better speedups:
 ```bash
-vllm serve RedHatAI/gemma-4-31B-it-FP8-block   --tensor-parallel-size 2   --attention-backend FLASH_ATTN   --speculative-config '{
     "model": "RedHatAI/gemma-4-31B-it-speculator.dflash",
     "num_speculative_tokens": 8,
     "method": "dflash"

 ## Deployment
+Deploy with vLLM (main/nightly) using the speculator as a draft model.
+<details><summary>First install vllm nightly</summary>
 ```bash
+uv pip install -U vllm \
+    --torch-backend=auto \
+    --extra-index-url https://wheels.vllm.ai/nightly
+```
+</details>
+Then run:
+```bash
+vllm serve -tp 2 RedHatAI/gemma-4-31B-it-speculator.dflash
 ```
 It can also be deployed with a quantized verifier for even better speedups:
 ```bash
+vllm serve RedHatAI/gemma-4-31B-it-FP8-block   --tensor-parallel-size 2  --speculative-config '{
     "model": "RedHatAI/gemma-4-31B-it-speculator.dflash",
     "num_speculative_tokens": 8,
     "method": "dflash"