Update non-Hopper FP8-KV serve command and link to vLLM recipes page
Browse filesBrings the non-Hopper FP8-KV workaround command in line with the new vllm serve flags from the main model card and adds --moe_backend marlin. Also links to https://recipes.vllm.ai/poolside/Laguna-XS.2.
README.md
CHANGED
|
@@ -94,17 +94,20 @@ Laguna XS.2 is supported in vLLM and Transformers, and TRT-LLM thanks to the sup
|
|
| 94 |
|
| 95 |
#### vLLM
|
| 96 |
|
| 97 |
-
The full vLLM recipe is on the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2). Quantization is detected automatically from `quantization_config` in this checkpoint, so the same command works with `poolside/Laguna-XS.2-FP8` substituted for the model ID. No extra flags required.
|
| 98 |
|
| 99 |
> [!NOTE]
|
| 100 |
> Please note that, during testing, we discovered that models with FP8-quantised KV caches can produce scrambled output when deployed on non-Hopper GPUs. We are actively investigating this issue with the vLLM team, but in the meantime, you can circumvent this issue by explicitly disabling FP8 KV cache (Laguna XS.2 has 40 layers, so list every layer in `--kv-cache-dtype-skip-layers`):
|
| 101 |
>
|
| 102 |
> ```shell
|
| 103 |
-
> vllm serve
|
| 104 |
-
> --
|
| 105 |
-
> --
|
| 106 |
> --reasoning-parser poolside_v1 \
|
| 107 |
-
> --
|
|
|
|
|
|
|
|
|
|
| 108 |
> ```
|
| 109 |
>
|
| 110 |
> The [BF16 checkpoint](https://huggingface.co/poolside/Laguna-XS.2) is unaffected as it does not declare an FP8 KV cache.
|
|
|
|
| 94 |
|
| 95 |
#### vLLM
|
| 96 |
|
| 97 |
+
The full vLLM recipe is on the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2) and on the [vLLM recipes page](https://recipes.vllm.ai/poolside/Laguna-XS.2). Quantization is detected automatically from `quantization_config` in this checkpoint, so the same command works with `poolside/Laguna-XS.2-FP8` substituted for the model ID. No extra flags required.
|
| 98 |
|
| 99 |
> [!NOTE]
|
| 100 |
> Please note that, during testing, we discovered that models with FP8-quantised KV caches can produce scrambled output when deployed on non-Hopper GPUs. We are actively investigating this issue with the vLLM team, but in the meantime, you can circumvent this issue by explicitly disabling FP8 KV cache (Laguna XS.2 has 40 layers, so list every layer in `--kv-cache-dtype-skip-layers`):
|
| 101 |
>
|
| 102 |
> ```shell
|
| 103 |
+
> VLLM_USE_DEEP_GEMM=0 vllm serve \
|
| 104 |
+
> --model poolside/Laguna-XS.2-FP8 \
|
| 105 |
+
> --tool-call-parser poolside_v1 \
|
| 106 |
> --reasoning-parser poolside_v1 \
|
| 107 |
+
> --enable-auto-tool-choice \
|
| 108 |
+
> --served-model-name laguna \
|
| 109 |
+
> --kv-cache-dtype-skip-layers $(seq 0 39) \
|
| 110 |
+
> --moe_backend marlin
|
| 111 |
> ```
|
| 112 |
>
|
| 113 |
> The [BF16 checkpoint](https://huggingface.co/poolside/Laguna-XS.2) is unaffected as it does not declare an FP8 KV cache.
|