fynnsu commited on
Commit
6c8aa29
·
verified ·
1 Parent(s): f8b1c87

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -3
README.md CHANGED
@@ -24,16 +24,28 @@ We are continuing to train this model and will update with more evaluations and
24
 
25
  ## Deployment
26
 
27
- Deploy with vLLM (main) using the speculator as a draft model:
28
 
 
 
29
  ```bash
30
- vllm serve -tp 2 RedHatAI/gemma-4-31B-it-speculator.dflash --attention-backend FLASH_ATTN
 
 
 
 
 
 
 
 
 
 
31
  ```
32
 
33
  It can also be deployed with a quantized verifier for even better speedups:
34
 
35
  ```bash
36
- vllm serve RedHatAI/gemma-4-31B-it-FP8-block --tensor-parallel-size 2 --attention-backend FLASH_ATTN --speculative-config '{
37
  "model": "RedHatAI/gemma-4-31B-it-speculator.dflash",
38
  "num_speculative_tokens": 8,
39
  "method": "dflash"
 
24
 
25
  ## Deployment
26
 
27
+ Deploy with vLLM (main/nightly) using the speculator as a draft model.
28
 
29
+ <details><summary>First install vllm nightly</summary>
30
+
31
  ```bash
32
+ uv pip install -U vllm \
33
+ --torch-backend=auto \
34
+ --extra-index-url https://wheels.vllm.ai/nightly
35
+ ```
36
+
37
+ </details>
38
+
39
+ Then run:
40
+
41
+ ```bash
42
+ vllm serve -tp 2 RedHatAI/gemma-4-31B-it-speculator.dflash
43
  ```
44
 
45
  It can also be deployed with a quantized verifier for even better speedups:
46
 
47
  ```bash
48
+ vllm serve RedHatAI/gemma-4-31B-it-FP8-block --tensor-parallel-size 2 --speculative-config '{
49
  "model": "RedHatAI/gemma-4-31B-it-speculator.dflash",
50
  "num_speculative_tokens": 8,
51
  "method": "dflash"