joerowell commited on
Commit
2c0d22b
·
verified ·
1 Parent(s): f82b43d

Update non-Hopper FP8-KV serve command and link to vLLM recipes page

Browse files

Brings the non-Hopper FP8-KV workaround command in line with the new vllm serve flags from the main model card and adds --moe_backend marlin. Also links to https://recipes.vllm.ai/poolside/Laguna-XS.2.

Files changed (1) hide show
  1. README.md +8 -5
README.md CHANGED
@@ -94,17 +94,20 @@ Laguna XS.2-INT4 is supported in vLLM and Transformers. Use Laguna-XS.2 with Oll
94
 
95
  #### vLLM
96
 
97
- The full vLLM recipe is on the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2). Quantization is detected automatically from `quantization_config` in this checkpoint, so the same command works with `poolside/Laguna-XS.2-INT4` substituted for the model ID. No extra flags required.
98
 
99
  > [!NOTE]
100
  > Please note that, during testing, we discovered that models with FP8-quantised KV caches can produce scrambled output when deployed on non-Hopper GPUs. We are actively investigating this issue with the vLLM team, but in the meantime, you can circumvent this issue by explicitly disabling FP8 KV cache (Laguna XS.2 has 40 layers, so list every layer in `--kv-cache-dtype-skip-layers`):
101
  >
102
  > ```shell
103
- > vllm serve poolside/Laguna-XS.2-INT4 \
104
- > --kv-cache-dtype-skip-layers $(seq 0 39) \
105
- > --max-model-len 131072 \
106
  > --reasoning-parser poolside_v1 \
107
- > --tool-call-parser poolside_v1
 
 
 
108
  > ```
109
  >
110
  > The [BF16 checkpoint](https://huggingface.co/poolside/Laguna-XS.2) is unaffected as it does not declare an FP8 KV cache.
 
94
 
95
  #### vLLM
96
 
97
+ The full vLLM recipe is on the main [Laguna XS.2 model card](https://huggingface.co/poolside/Laguna-XS.2) and on the [vLLM recipes page](https://recipes.vllm.ai/poolside/Laguna-XS.2). Quantization is detected automatically from `quantization_config` in this checkpoint, so the same command works with `poolside/Laguna-XS.2-INT4` substituted for the model ID. No extra flags required.
98
 
99
  > [!NOTE]
100
  > Please note that, during testing, we discovered that models with FP8-quantised KV caches can produce scrambled output when deployed on non-Hopper GPUs. We are actively investigating this issue with the vLLM team, but in the meantime, you can circumvent this issue by explicitly disabling FP8 KV cache (Laguna XS.2 has 40 layers, so list every layer in `--kv-cache-dtype-skip-layers`):
101
  >
102
  > ```shell
103
+ > VLLM_USE_DEEP_GEMM=0 vllm serve \
104
+ > --model poolside/Laguna-XS.2-INT4 \
105
+ > --tool-call-parser poolside_v1 \
106
  > --reasoning-parser poolside_v1 \
107
+ > --enable-auto-tool-choice \
108
+ > --served-model-name laguna \
109
+ > --kv-cache-dtype-skip-layers $(seq 0 39) \
110
+ > --moe_backend marlin
111
  > ```
112
  >
113
  > The [BF16 checkpoint](https://huggingface.co/poolside/Laguna-XS.2) is unaffected as it does not declare an FP8 KV cache.