Instructions to use poolside/Laguna-XS.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use poolside/Laguna-XS.2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="poolside/Laguna-XS.2", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use poolside/Laguna-XS.2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "poolside/Laguna-XS.2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside/Laguna-XS.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/poolside/Laguna-XS.2

SGLang

How to use poolside/Laguna-XS.2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "poolside/Laguna-XS.2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside/Laguna-XS.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "poolside/Laguna-XS.2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "poolside/Laguna-XS.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use poolside/Laguna-XS.2 with Docker Model Runner:
```
docker model run hf.co/poolside/Laguna-XS.2
```

joerowell commited on 10 days ago

Commit

94107a2

verified ·

1 Parent(s): 7a9028a

Sync bundled HF code with upstream Laguna PR (v5 schema)

Browse files

Files changed (1) hide show

configuration_laguna.py +172 -146

configuration_laguna.py CHANGED Viewed

@@ -1,5 +1,4 @@
-# ruff: noqa
-# Copyright 2025 Poolside and the HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,79 +11,44 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from transformers.configuration_utils import PreTrainedConfig
 from transformers.modeling_rope_utils import RopeParameters
 class LagunaConfig(PreTrainedConfig):
     r"""
-    Configuration class for Laguna model.
-    Laguna is Poolside's MoE architecture with:
-    - Attention output gating (softplus gate)
-    - Sigmoid routing instead of softmax
-    - No QKV bias
-    - Explicit head_dim parameter
-    Args:
-        head_dim (`int`, *optional*, defaults to 128):
-            Dimension of attention heads. Laguna uses explicit head_dim rather than
-            computing it from hidden_size // num_attention_heads.
-        qkv_bias (`bool`, *optional*, defaults to `False`):
-            Whether to add bias to QKV projections. Laguna uses no QKV bias.
-        attention_bias (`bool`, *optional*, defaults to `False`):
-            Whether to add bias to attention output projection. Laguna uses no attention bias.
-        gating (`bool`, *optional*, defaults to `True`):
-            Whether to use softplus output gating on attention. When True, a g_proj linear
-            layer is added and attn_output = attn_output * softplus(g_proj(x)).
-        sliding_window (`int`, *optional*):
-            Sliding window attention size. Used by layers whose type in ``layer_types``
-            is ``"sliding_attention"``. When ``None``, all layers use full attention.
-        layer_types (`list[str]`, *optional*):
-            Per-layer attention type. Each element should be ``"sliding_attention"`` or
-            ``"global_attention"``. Length must equal ``num_hidden_layers``. When ``None``,
-            all layers default to global attention.
-        swa_attention_sink_enabled (`bool`, *optional*, defaults to `False`):
-            Whether to enable learnable attention sinks on sliding-window attention layers.
-            When enabled, a per-head bias parameter is added that allows the model to attend
-            to position 0 even when it falls outside the sliding window.
-        swa_rope_parameters (`RopeParameters`, *optional*):
-            Separate RoPE configuration for sliding-window attention layers. When ``None``,
-            SWA layers use the same RoPE as global attention layers.
-        vocab_size (`int`, *optional*, defaults to 100352):
-            Vocabulary size of the Laguna model.
-        hidden_size (`int`, *optional*, defaults to 2048):
-            Dimension of the hidden representations.
-        intermediate_size (`int`, *optional*, defaults to 8192):
-            Dimension of the MLP representations for dense layers.
-        num_hidden_layers (`int`, *optional*, defaults to 48):
-            Number of hidden layers in the Transformer.
-        num_attention_heads (`int`, *optional*, defaults to 32):
-            Number of attention heads.
-        num_key_value_heads (`int`, *optional*, defaults to 8):
-            Number of key-value heads for GQA.
-        max_position_embeddings (`int`, *optional*, defaults to 4096):
-            Maximum sequence length.
-        rms_norm_eps (`float`, *optional*, defaults to 1e-6):
-            Epsilon for RMSNorm layers.
-        num_experts (`int`, *optional*, defaults to 256):
-            Number of routed experts.
-        num_experts_per_tok (`int`, *optional*, defaults to 16):
-            Number of experts selected per token (top-k).
-        moe_intermediate_size (`int`, *optional*, defaults to 1024):
-            Intermediate size of routed experts.
-        shared_expert_intermediate_size (`int`, *optional*, defaults to 1024):
-            Intermediate size of the shared expert.
-        norm_topk_prob (`bool`, *optional*, defaults to `True`):
-            Whether to normalize top-k routing probabilities.
-        decoder_sparse_step (`int`, *optional*, defaults to 1):
-            Frequency of MoE layers (1 = every layer is MoE after mlp_only_layers).
-        mlp_only_layers (`list[int]`, *optional*, defaults to `[0]`):
-            Layer indices that use dense MLP instead of MoE.
-        router_aux_loss_coef (`float`, *optional*, defaults to 0.001):
-            Auxiliary loss coefficient for load balancing.
-        rope_parameters (`RopeParameters`, *optional*):
-            RoPE configuration. Defaults to rope_theta=500000.0.
     """
     model_type = "laguna"
@@ -93,11 +57,19 @@ class LagunaConfig(PreTrainedConfig):
         "layers.*.self_attn.q_proj": "colwise",
         "layers.*.self_attn.k_proj": "colwise",
         "layers.*.self_attn.v_proj": "colwise",
-        "layers.*.self_attn.g_proj": "colwise",  # Laguna-specific gating projection
         "layers.*.self_attn.o_proj": "rowwise",
         "layers.*.mlp.gate_proj": "colwise",
         "layers.*.mlp.up_proj": "colwise",
         "layers.*.mlp.down_proj": "rowwise",
     }
     base_model_pp_plan = {
         "embed_tokens": (["input_ids"], ["inputs_embeds"]),
@@ -105,83 +77,137 @@ class LagunaConfig(PreTrainedConfig):
         "norm": (["hidden_states"], ["hidden_states"]),
     }
-    def __init__(
-        self,
-        vocab_size: int = 100352,
-        hidden_size: int = 2048,
-        intermediate_size: int = 8192,
-        num_hidden_layers: int = 48,
-        num_attention_heads: int = 32,
-        num_key_value_heads: int = 8,
-        head_dim: int = 128,
-        qkv_bias: bool = False,
-        attention_bias: bool = False,
-        gating: bool = True,
-        hidden_act: str = "silu",
-        max_position_embeddings: int = 4096,
-        initializer_range: float = 0.02,
-        rms_norm_eps: float = 1e-6,
-        use_cache: bool = True,
-        tie_word_embeddings: bool = False,
-        rope_parameters: RopeParameters | dict[str, RopeParameters] | None = None,
-        attention_dropout: float = 0.0,
-        sliding_window: int | None = None,
-        layer_types: list[str] | None = None,
-        swa_attention_sink_enabled: bool = False,
-        swa_rope_parameters: RopeParameters | None = None,
-        num_experts: int = 256,
-        num_experts_per_tok: int = 16,
-        moe_intermediate_size: int = 1024,
-        shared_expert_intermediate_size: int = 1024,
-        norm_topk_prob: bool = True,
-        decoder_sparse_step: int = 1,
-        mlp_only_layers: list[int] | None = None,
-        router_aux_loss_coef: float = 0.001,
-        output_router_logits: bool = False,
-        **kwargs,
-    ):
-        # Default mlp_only_layers: first layer is dense (moe_first_k_dense_replace=1)
-        if mlp_only_layers is None:
-            mlp_only_layers = [0]
-        # Default rope_parameters with Laguna's theta
-        if rope_parameters is None:
-            rope_parameters = {"rope_type": "default", "rope_theta": 500000.0}
-        self.vocab_size = vocab_size
-        self.hidden_size = hidden_size
-        self.intermediate_size = intermediate_size
-        self.num_hidden_layers = num_hidden_layers
-        self.num_attention_heads = num_attention_heads
-        self.num_key_value_heads = num_key_value_heads
-        self.head_dim = head_dim
-        self.qkv_bias = qkv_bias
-        self.attention_bias = attention_bias
-        self.gating = gating
-        self.hidden_act = hidden_act
-        self.max_position_embeddings = max_position_embeddings
-        self.initializer_range = initializer_range
-        self.rms_norm_eps = rms_norm_eps
-        self.use_cache = use_cache
         self.rope_parameters = rope_parameters
-        self.attention_dropout = attention_dropout
-        # Sliding window attention arguments
-        self.sliding_window = sliding_window
-        self.layer_types = layer_types
-        self.swa_attention_sink_enabled = swa_attention_sink_enabled
-        self.swa_rope_parameters = swa_rope_parameters
-        # MoE arguments
-        self.num_experts = num_experts
-        self.num_experts_per_tok = num_experts_per_tok
-        self.moe_intermediate_size = moe_intermediate_size
-        self.shared_expert_intermediate_size = shared_expert_intermediate_size
-        self.norm_topk_prob = norm_topk_prob
-        self.decoder_sparse_step = decoder_sparse_step
-        self.mlp_only_layers = mlp_only_layers
-        self.router_aux_loss_coef = router_aux_loss_coef
-        self.output_router_logits = output_router_logits
-        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
 __all__ = ["LagunaConfig"]

+# Copyright 2026 Poolside and the HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+from typing import Any, Literal
+from huggingface_hub.dataclasses import strict
 from transformers.configuration_utils import PreTrainedConfig
 from transformers.modeling_rope_utils import RopeParameters
+from transformers.utils import auto_docstring
+@auto_docstring(checkpoint="poolside/laguna-XS.2")
+@strict
 class LagunaConfig(PreTrainedConfig):
     r"""
+    partial_rotary_factor (`float`, *optional*):
+        Fraction of ``head_dim`` to rotate. Folded into each ``rope_parameters[layer_type]``
+        entry by ``__post_init__``.
+    num_attention_heads_per_layer (`list[int]`, *optional*):
+        Per-layer override for ``num_attention_heads``. Length must equal ``num_hidden_layers``.
+    mlp_layer_types (`list[str]`, *optional*):
+        Per-layer MLP type — ``"dense"`` or ``"sparse"``. Length must equal
+        ``num_hidden_layers``. Defaults to first layer dense, rest sparse.
+    moe_routed_scaling_factor (`float`, *optional*, defaults to 1.0):
+        Scalar applied to routed-expert output before combining with the shared-expert output.
+    moe_apply_router_weight_on_input (`bool`, *optional*, defaults to `False`):
+        Whether to apply router weights to the MoE input rather than the output. Not supported
+        in transformers yet; ``True`` will raise a ``NotImplementedError`` for now.
+    moe_router_logit_softcapping (`float`, *optional*, defaults to 0.0):
+        Scaling factor when applying tanh softcapping on the logits of the MoE router logits.
+    Example:
+    ```python
+    >>> from transformers import LagunaModel, LagunaConfig
+    >>> configuration = LagunaConfig()
+    >>> model = LagunaModel(configuration)
+    >>> configuration = model.config
+    ```
     """
     model_type = "laguna"
         "layers.*.self_attn.q_proj": "colwise",
         "layers.*.self_attn.k_proj": "colwise",
         "layers.*.self_attn.v_proj": "colwise",
+        "layers.*.self_attn.g_proj": "colwise",
         "layers.*.self_attn.o_proj": "rowwise",
+        "layers.*.self_attn.q_norm": "replicated_with_grad_allreduce",
+        "layers.*.self_attn.k_norm": "replicated_with_grad_allreduce",
         "layers.*.mlp.gate_proj": "colwise",
         "layers.*.mlp.up_proj": "colwise",
         "layers.*.mlp.down_proj": "rowwise",
+        "layers.*.mlp.experts.gate_up_proj": "packed_colwise",
+        "layers.*.mlp.experts.down_proj": "rowwise",
+        "layers.*.mlp.experts": "moe_tp_experts",
+        "layers.*.mlp.shared_experts.gate_proj": "colwise",
+        "layers.*.mlp.shared_experts.up_proj": "colwise",
+        "layers.*.mlp.shared_experts.down_proj": "rowwise",
     }
     base_model_pp_plan = {
         "embed_tokens": (["input_ids"], ["inputs_embeds"]),
         "norm": (["hidden_states"], ["hidden_states"]),
     }
+    # Qwen2Moe-inherited defaults we want to override for Laguna's typical shape.
+    vocab_size: int = 100352
+    hidden_size: int = 2048
+    intermediate_size: int = 8192
+    num_hidden_layers: int = 40
+    num_attention_heads: int = 48
+    num_key_value_heads: int = 8
+    hidden_act: str = "silu"
+    max_position_embeddings: int = 131072
+    initializer_range: float = 0.02
+    rms_norm_eps: float = 1e-6
+    use_cache: bool = True
+    tie_word_embeddings: bool = False
+    rope_parameters: RopeParameters | dict | None = None
+    sliding_window: int | None = None
+    attention_dropout: float | int = 0.0
+    moe_intermediate_size: int = 512
+    shared_expert_intermediate_size: int = 512
+    num_experts_per_tok: int = 8
+    num_experts: int = 256
+    output_router_logits: bool = False
+    router_aux_loss_coef: float = 0.001
+    layer_types: list[str] | None = None
+    pad_token_id: int | None = None
+    bos_token_id: int | None = None
+    eos_token_id: int | list[int] | None = None
+    # Laguna-specific attention
+    head_dim: int = 128
+    attention_bias: bool = False
+    partial_rotary_factor: float | None = None
+    num_attention_heads_per_layer: list[int] | None = None
+    # Laguna-specific MoE
+    mlp_layer_types: list[str] | None = None
+    moe_routed_scaling_factor: float = 1.0
+    moe_apply_router_weight_on_input: bool = False
+    moe_router_logit_softcapping: float = 0.0
+    def __post_init__(self, **kwargs):
+        if self.layer_types is None:
+            self.layer_types = ["full_attention"] * self.num_hidden_layers
+        if self.mlp_layer_types is None:
+            self.mlp_layer_types = ["dense"] + ["sparse"] * (self.num_hidden_layers - 1)
+        if self.num_attention_heads_per_layer is None:
+            self.num_attention_heads_per_layer = [self.num_attention_heads] * self.num_hidden_layers
+        default_rope_params: dict[Literal["full_attention", "sliding_attention"], dict[str, Any]] = {
+            "full_attention": {"rope_type": "default", "rope_theta": 500000.0},
+            "sliding_attention": {"rope_type": "default", "rope_theta": 10000.0},
+        }
+        if self.rope_parameters is None:
+            self.rope_parameters = default_rope_params
+        self._normalize_rope_parameters()
+        # Skip ``Qwen2MoeConfig.__post_init__`` — it references ``mlp_only_layers`` /
+        # ``use_sliding_window`` / ``max_window_layers`` which Laguna drops above.
+        super().__post_init__(**kwargs)
+    def _normalize_rope_parameters(self):
+        """Coerce ``rope_parameters`` to the nested ``{layer_type: {...}}`` shape.
+        Accepts an already-nested dict as-is, or a flat dict that gets broadcast to every
+        layer type. A top-level ``partial_rotary_factor`` is folded into each sub-dict as
+        a default.
+        """
+        layer_types = set(self.layer_types)
+        rope_params = self.rope_parameters or {}
+        is_nested = isinstance(rope_params, dict) and any(k in layer_types for k in rope_params)
+        if is_nested:
+            nested = {lt: dict(rope_params.get(lt, {})) for lt in layer_types}
+        else:
+            nested = {lt: dict(rope_params) for lt in layer_types}
+        if self.partial_rotary_factor is not None:
+            for params in nested.values():
+                params.setdefault("partial_rotary_factor", self.partial_rotary_factor)
+        for params in nested.values():
+            params.setdefault("rope_type", "default")
+        self.rope_parameters = nested
+        # Null the top-level field now that its value lives in each sub-dict — otherwise
+        # ``standardize_rope_params`` would overwrite per-type values with the global one.
+        self.partial_rotary_factor = None
+    def convert_rope_params_to_dict(self, **kwargs):
+        # No need to handle BC for new models, because they have no old-format `rope_scaling`
+        return kwargs
+    def _validate_yarn_rope_parameters(self, rope_parameters: dict, ignore_keys=None):
+        """Override: parent reads ``self.rope_parameters["original_max_position_embeddings"]``
+        for its post-hoc factor sanity-check, which works for flat rope configs but raises
+        ``KeyError`` when ``self.rope_parameters`` is the Laguna/Gemma3-style per-layer-type
+        map (its keys are layer types like ``"full_attention"``). Fix locally by reading
+        from the per-call ``rope_parameters`` dict that ``validate_rope`` already passes in.
+        """
+        # Delegate to parent for the shared checks by temporarily swapping in a flat
+        # ``self.rope_parameters`` that has the key the parent expects. Cheapest way to
+        # share the parent's logic without reimplementing it here.
+        flat = getattr(self, "rope_parameters", None)
         self.rope_parameters = rope_parameters
+        try:
+            super()._validate_yarn_rope_parameters(rope_parameters, ignore_keys=ignore_keys)
+        finally:
+            self.rope_parameters = flat
+    def validate_architecture(self):
+        """Part of ``@strict``-powered validation."""
+        if self.moe_apply_router_weight_on_input:
+            raise NotImplementedError(
+                "moe_apply_router_weight_on_input=True is not yet supported in the "
+                "transformers implementation of Laguna."
+            )
+        if (
+            self.num_attention_heads_per_layer is not None
+            and len(self.num_attention_heads_per_layer) != self.num_hidden_layers
+        ):
+            raise ValueError(
+                f"num_attention_heads_per_layer length ({len(self.num_attention_heads_per_layer)}) "
+                f"must equal num_hidden_layers ({self.num_hidden_layers})."
+            )
+        if len(self.layer_types) != self.num_hidden_layers:
+            raise ValueError(
+                f"layer_types length ({len(self.layer_types)}) "
+                f"must equal num_hidden_layers ({self.num_hidden_layers})."
+            )
+        if len(self.mlp_layer_types) != self.num_hidden_layers:
+            raise ValueError(
+                f"mlp_layer_types length ({len(self.mlp_layer_types)}) "
+                f"must equal num_hidden_layers ({self.num_hidden_layers})."
+            )
 __all__ = ["LagunaConfig"]