davide221 commited on
Commit
b42efde
·
verified ·
1 Parent(s): 7ed26b3

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +52 -0
README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: poolside/Laguna-XS.2
4
+ tags:
5
+ - gguf
6
+ - llama.cpp
7
+ - moe
8
+ - code
9
+ - quantized
10
+ ---
11
+
12
+ # Laguna-XS.2 GGUF (BF16 + Q4_K_M)
13
+
14
+ GGUF conversions of [poolside/Laguna-XS.2](https://huggingface.co/poolside/Laguna-XS.2) for use with [llama.cpp](https://github.com/ggerganov/llama.cpp).
15
+
16
+ ## Files
17
+
18
+ | File | Quant | Size | BPW | Notes |
19
+ |------|-------|------|-----|-------|
20
+ | `laguna-xs2-bf16.gguf` | BF16 | 62.3 GiB | 16.01 | reference, identical math to HF transformers fp/bf16 |
21
+ | `laguna-xs2-Q4_K_M.gguf` | Q4_K_M | 18.88 GiB | 4.85 | imatrix-calibrated, fits a single 24 GB GPU |
22
+ | `laguna-xs2.imatrix` | imatrix | 180 MB | — | Bartowski calibration_datav3 (134 chunks, 68608 tokens) |
23
+
24
+ ## Quality
25
+
26
+ | Metric | BF16 | Q4_K_M | Δ |
27
+ |--------|------|--------|---|
28
+ | Perplexity (Bartowski v3, 20×512) | 10.7594 ± 0.522 | 11.2854 ± 0.553 | +4.9% |
29
+
30
+ Imatrix calibration uses Bartowski `calibration_datav3.txt` (multilingual + code mix), the same corpus Unsloth-distributed quants use.
31
+
32
+ ## Required llama.cpp patch
33
+
34
+ Laguna-XS.2 is a NEW architecture (`LLM_ARCH_LAGUNA`) not present in upstream llama.cpp. Loading these GGUFs requires a llama.cpp build with the LAGUNA arch added. Patches available at: https://github.com/your-org/lucebox-hub (see `dflash/deps/llama.cpp/src/models/laguna.cpp` and related changes).
35
+
36
+ Architecture summary:
37
+ - 40 layers hybrid iSWA: pattern (full, sw, sw, sw) × 10, sliding window 512
38
+ - Per-layer head count: 48 (full) / 64 (sliding); 8 KV heads always
39
+ - Q-norm + K-norm RMSNorm at head_dim level
40
+ - Per-head softplus attention gate
41
+ - RoPE per layer-type: full layers YaRN (theta=500K, factor=32, partial_rotary=0.5); sliding default (theta=10K, partial_rotary=1.0)
42
+ - 256 experts, top-8, sigmoid router with score-correction bias, sum-normalize, scale=2.5
43
+ - Always-on shared expert (intermediate=512)
44
+ - Dense layer 0, sparse MoE layers 1–39
45
+
46
+ ## Quick test
47
+
48
+ ```bash
49
+ ./llama-simple -m laguna-xs2-Q4_K_M.gguf -ngl 99 -n 128 "def fibonacci(n):"
50
+ ```
51
+
52
+ Tested on RTX 3090 24GB and A100 80GB. Inference @ 154 tok/s on A100 SXM Q4_K_M.