Update README.md
Browse files
README.md
CHANGED
|
@@ -24,16 +24,28 @@ We are continuing to train this model and will update with more evaluations and
|
|
| 24 |
|
| 25 |
## Deployment
|
| 26 |
|
| 27 |
-
Deploy with vLLM (main) using the speculator as a draft model
|
| 28 |
|
|
|
|
|
|
|
| 29 |
```bash
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
```
|
| 32 |
|
| 33 |
It can also be deployed with a quantized verifier for even better speedups:
|
| 34 |
|
| 35 |
```bash
|
| 36 |
-
vllm serve RedHatAI/gemma-4-31B-it-FP8-block --tensor-parallel-size 2
|
| 37 |
"model": "RedHatAI/gemma-4-31B-it-speculator.dflash",
|
| 38 |
"num_speculative_tokens": 8,
|
| 39 |
"method": "dflash"
|
|
|
|
| 24 |
|
| 25 |
## Deployment
|
| 26 |
|
| 27 |
+
Deploy with vLLM (main/nightly) using the speculator as a draft model.
|
| 28 |
|
| 29 |
+
<details><summary>First install vllm nightly</summary>
|
| 30 |
+
|
| 31 |
```bash
|
| 32 |
+
uv pip install -U vllm \
|
| 33 |
+
--torch-backend=auto \
|
| 34 |
+
--extra-index-url https://wheels.vllm.ai/nightly
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
</details>
|
| 38 |
+
|
| 39 |
+
Then run:
|
| 40 |
+
|
| 41 |
+
```bash
|
| 42 |
+
vllm serve -tp 2 RedHatAI/gemma-4-31B-it-speculator.dflash
|
| 43 |
```
|
| 44 |
|
| 45 |
It can also be deployed with a quantized verifier for even better speedups:
|
| 46 |
|
| 47 |
```bash
|
| 48 |
+
vllm serve RedHatAI/gemma-4-31B-it-FP8-block --tensor-parallel-size 2 --speculative-config '{
|
| 49 |
"model": "RedHatAI/gemma-4-31B-it-speculator.dflash",
|
| 50 |
"num_speculative_tokens": 8,
|
| 51 |
"method": "dflash"
|