Add vLLM and Transformers usage snippets
Browse files
README.md
CHANGED
|
@@ -132,11 +132,56 @@ Thanks to support from Ollama and the mlx-lm team...
|
|
| 132 |
|
| 133 |
#### vLLM
|
| 134 |
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
#### Transformers
|
| 138 |
|
| 139 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
#### [Other frameworks]
|
| 142 |
|
|
|
|
| 132 |
|
| 133 |
#### vLLM
|
| 134 |
|
| 135 |
+
Serve Laguna XS.2 locally with vLLM and query it from any OpenAI-compatible client (see [Controlling reasoning](#controlling-reasoning) for tool calls, streaming, and reasoning extraction):
|
| 136 |
+
|
| 137 |
+
```shell
|
| 138 |
+
pip install "vllm>=<PENDING_VERSION>"
|
| 139 |
+
|
| 140 |
+
vllm serve poolside/Laguna-XS.2 \
|
| 141 |
+
--max-model-len 131072 \
|
| 142 |
+
--default-chat-template-kwargs '{"enable_thinking": true}'
|
| 143 |
+
```
|
| 144 |
|
| 145 |
#### Transformers
|
| 146 |
|
| 147 |
+
> Requires `transformers >= <PENDING_VERSION>` (Laguna support is on the upcoming release; until then install from source).
|
| 148 |
+
|
| 149 |
+
```python
|
| 150 |
+
import torch
|
| 151 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 152 |
+
|
| 153 |
+
model_id = "poolside/Laguna-XS.2"
|
| 154 |
+
|
| 155 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 156 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 157 |
+
model_id,
|
| 158 |
+
dtype=torch.bfloat16,
|
| 159 |
+
device_map="auto",
|
| 160 |
+
)
|
| 161 |
+
|
| 162 |
+
messages = [
|
| 163 |
+
{"role": "user", "content": "Write a Python retry wrapper with exponential backoff."},
|
| 164 |
+
]
|
| 165 |
+
|
| 166 |
+
# Reasoning is on by default; pass enable_thinking=False to skip the <think> block.
|
| 167 |
+
inputs = tokenizer.apply_chat_template(
|
| 168 |
+
messages,
|
| 169 |
+
add_generation_prompt=True,
|
| 170 |
+
return_tensors="pt",
|
| 171 |
+
enable_thinking=True,
|
| 172 |
+
).to(model.device)
|
| 173 |
+
|
| 174 |
+
outputs = model.generate(
|
| 175 |
+
inputs,
|
| 176 |
+
max_new_tokens=1024,
|
| 177 |
+
do_sample=True,
|
| 178 |
+
temperature=0.7,
|
| 179 |
+
top_k=20,
|
| 180 |
+
)
|
| 181 |
+
|
| 182 |
+
response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
|
| 183 |
+
print(response)
|
| 184 |
+
```
|
| 185 |
|
| 186 |
#### [Other frameworks]
|
| 187 |
|