Instructions to use poolside/Laguna-XS.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use poolside/Laguna-XS.2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="poolside/Laguna-XS.2", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use poolside/Laguna-XS.2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "poolside/Laguna-XS.2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside/Laguna-XS.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/poolside/Laguna-XS.2

SGLang

How to use poolside/Laguna-XS.2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "poolside/Laguna-XS.2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside/Laguna-XS.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "poolside/Laguna-XS.2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside/Laguna-XS.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use poolside/Laguna-XS.2 with Docker Model Runner:
```
docker model run hf.co/poolside/Laguna-XS.2
```

joerowell commited on 14 days ago

Commit

8048f76

verified ·

1 Parent(s): 76de427

Update README.md

Browse files

Files changed (1) hide show

README.md +230 -3

README.md CHANGED Viewed

@@ -1,3 +1,230 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+library_name: vllm
+inference: false
+extra_gated_description: >-
+  To learn more about how we process your personal data, please read our <a
+  href="https://poolside.ai/privacy">Privacy Policy</a>.
+tags:
+- laguna-xs.2
+---
+# Laguna XS.2
+Laguna XS.2 is a 33B total parameter Mixture-of-Experts model with 3B activated parameters per token designed for agentic coding and long-horizon work on a local machine. It uses Sliding Window Attention with per-head gating in 30 out of 40 layers for fast inference and low KV cache requirements.
+This is the instruct model with native reasoning support and interleaved thinking. For the base model, see [Laguna XS.2-base](https://huggingface.co/poolside/Laguna-XS.2-base).
+For more details on how we trained this model, including on data automixing and async off-policy agent RL, check out our [release blog post]().
+## Key features
+- **Mixed SWA and global attention layout**: Laguna XS.2 uses sigmoid gating with per-layer rotary scales, enabling mixed SWA (Sliding Window Attention) and global attention layers in a 3:1 ratio (across 40 total layers)
+- **KV cache in FP8**: All quantization formats use a KV cache quantized to FP8, reducing memory per token.
+- **Native reasoning support**: Interleaved thinking enabled by default
+- **Local-ready**: At 33B total parameters and 3B activated, Laguna XS.2 is compact enough to run on a Mac with 36 GB of RAM. [Available on Ollama](https://ollama.com/library/laguna-xs.2)
+- **Apache 2.0 license**: Use and modify freely for commerical and non-commercial purposes
+## Model overview
+- Training: pre-training, post-training and reinforcement learning stages (instruct)
+- Number of parameters: 33B total with 3B activated per token
+- Optimizer: Muon
+- Layers: 40 layers (10 layers with global attention, 30 layers with sliding window attention)
+- Experts: 256 experts with 1 shared expert
+- Sliding Window: 512 tokens
+- Modality: text-to-text
+- Context window: 131,072 tokens
+- Reasoning support: thinking default enabled; interleaved thinking with preserved thinking supported
+## Benchmark results (VR)
+We evaluate Laguna XS.2 with thinking enabled in our agent harness, pool (see the Usage section below to download and run locally), across all benchmarks. For other models, we use the best available publicly-reported score; if not available, we calculate baselines using OpenHands (SWE-bench family) or Terminus 2 (Terminal-Bench 2.0) using the settings below.
+| Model                     | Size (total params.) | SWE-bench Pro | SWE-bench Verified | SWE-bench Multilingual | Terminal-Bench 2.0 |
+|---------------------------|----------------------|---------------|--------------------|------------------------|--------------------|
+| **Laguna XS.2**           | 33B                  | xx.x%         | xx.x%              | xx.x%                  | xx.x%              |
+| Nemotron 3 Nano           | 30B                  | xx.x%         | xx.x%              | xx.x%                  | xx.x%              |
+| Devstral Small 2          | 24B dense            | -             | 68.0%              | 55.7%                  | 22.5%              |
+| Gemma 4 26B A4B IT        | 26B                  | xx.x%         | xx.x%              | xx.x%                  | xx.x%              |
+| Gemma 4 31B IT            | 31B dense            | xx.x%         | xx.x%              | xx.x%                  | xx.x%              |
+| Qwen3.6-35B-A3B           | 35B                  | 49.5%         | 73.4%              | 67.2%                  | 51.5%              |
+| Qwen3.6-27B               | 27B dense            | 53.2%         | 77.2%              | 71.3%                  | 59.3%              |
+| GPT-5.4 Nano              | -                    | 52.4%         | -                  | -                      | 46.3%              |
+\* SWE-bench series: [our configuration; any fixes applied, etc., avg. of k] Nemotron 3 Nano and Gemma 4 models evaluated in OpenHands with [configuration]. Terminal-Bench 2.0: [our configuration; any fixes applied, etc.] Nemotron 3 Nano and Gemma 4 models evaluated in Terminus 2 with [configuration].
+## Usage
+Laguna XS.2 has launch-day support in vLLM and Transformers, and TRT-LLM and SGLang thanks to the support of the team at NVIDIA.
+The fastest way to get started is with our API, directly or using OpenRouter, free for a limited time.
+## pool
+**pool** is a lightweight terminal-based coding agent and a dual [Agent Client Protocol](https://agentclientprotocol.com/get-started) client-server.
+Download and install for macOS and Linux:
+```
+curl -fsSL https://downloads.poolside.ai/pool/install.sh | sh
+```
+Launch and *Log in with Poolside* to get a free API key.
+```
+pool
+```
+[Placeholder for screenshot]
+Use in any [ACP client](https://agentclientprotocol.com/get-started/clients). Configure Zed and JetBrains automatically:
+```
+pool acp setup --editor zed|jetbrains
+```
+Use pool with Ollama with one-command setup:
+```
+ollama pull laguna.xs-2
+ollama launch pool --model laguna.xs-2
+```
+(requires Ollama 0.20.8 or later)
+### Feedback and issues
+Submit feedback with `/feedback` and read the [full documentation on GitHub](https://github.com/poolsideai/pool).
+*By downloading pool, you agree to Poolside's [End User License Agreement (EULA)](https://poolside.ai/legal/eula).*
+### Local deployment
+Laguna-XS.2 is supported on vLLM, Transformers v5, TRT-LLM, Ollama & mlx-lm. We would like to thank the teams at NVIDIA, Ollama and Newya Labs.
+[Device frameworks: Ollama, mlx-lm, ...]
+#### vLLM (JR)
+#### Transformers (JR)
+[...]
+#### [Other frameworks] (JR)
+## Controlling reasoning
+Laguna XS.2 has native reasoning support and is designed to work best with *preserved thinking*, where `reasoning` content from prior assistant messages is preserved in the message history. This model will generally reason before calling tools and between tool calls.
+```python
+import json
+from openai import OpenAI
+client = OpenAI(
+  base_url="https://inference.poolside.ai/v1",
+  api_key="...",
+)
+model = "poolside/laguna-xs.2"
+tools = [{"type": "function", "function": {
+  "name": "shell",
+  "description": "Execute a bash command and return the output.",
+  "parameters": {"type": "object", "properties": {"cmd": {"type": "string"}}, "required": ["cmd"]},
+}}]
+messages = [
+  {"role": "system", "content": "You are a coding agent with access to a shell tool."},
+  {"role": "user", "content": "Run uname -a"},
+]
+# Thinking is enabled by default when the server sets --default-chat-template-kwargs {"enable_thinking": True}
+# When using the Poolside API (https://inference.poolside.ai/v1), this flag is set by default
+response = client.chat.completions.create(
+  model=model,
+  messages=messages,
+  tools=tools,
+  stream=True,
+)
+reasoning, content, tool_calls = "", "", []
+for chunk in response:
+  delta = chunk.choices[0].delta
+  if hasattr(delta, "reasoning") and delta.reasoning:
+    reasoning += delta.reasoning
+  if hasattr(delta, "content") and delta.content:
+    content += delta.content
+  if hasattr(delta, "tool_calls") and delta.tool_calls:
+    for tc in delta.tool_calls:
+      if tc.index >= len(tool_calls):
+        tool_calls.append({"id": tc.id, "function": {"name": "", "arguments": ""}})
+      if tc.function.name:
+        tool_calls[tc.index]["function"]["name"] = tc.function.name
+      if tc.function.arguments:
+        tool_calls[tc.index]["function"]["arguments"] += tc.function.arguments
+print(f"Reasoning: {reasoning}\nContent: {content}\nTool calls: {tool_calls}\n")
+# Return reasoning in the next request for best performance
+messages.append({
+  "role": "assistant",
+  "content": content,
+  "reasoning": reasoning,
+  "tool_calls": [{"id": tc["id"], "type": "function", "function": tc["function"]} for tc in tool_calls]
+})
+messages.append({
+  "role": "tool",
+  "tool_call_id": tool_calls[0]["id"],
+  "content": json.dumps({"stdout": "Darwin arm64", "exit_code": "0"})
+})
+response = client.chat.completions.create(
+  model=model,
+  messages=messages,
+  tools=tools,
+  stream=True,
+)
+reasoning, content = "", ""
+for chunk in response:
+  delta = chunk.choices[0].delta
+  if hasattr(delta, "reasoning_content") and delta.reasoning_content:
+    reasoning += delta.reasoning_content
+  if hasattr(delta, "content") and delta.content:
+    content += delta.content
+print(f"Reasoning: {reasoning}\nContent: {content}")
+```
+### Disabling reasoning
+You can disable thinking by setting `enable_thinking` to `False` in a request or by not providing `--default-chat-template-kwargs {"enable_thinking": true}` or equivalent when starting the server.
+```python
+from openai import OpenAI
+client = OpenAI()
+completion = client.chat.completions.create(
+  model="poolside/laguna-xs.2",
+  messages=[
+    {"role": "user", "content": "Write a retry wrapper with exponential backoff."}
+  ],
+  extra_body={
+    "chat_template_kwargs": { "enable_thinking": False },
+  }
+  stream=True
+)
+for chunk in completion:
+    print(chunk.choices[0].delta)
+```
+For agentic coding use cases, we recommend enabling thinking and preserving reasoning in message history as outlined in the [Controlling reasoning] section.
+## License
+This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0.txt).
+[TBC: You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.]