DJLougen commited on
Commit
136e3db
·
verified ·
1 Parent(s): 0bf912f

Fix README: use ornstein3.6-27b.png image, move Ko-fi up, add YAML metadata

Browse files
Files changed (1) hide show
  1. README.md +37 -29
README.md CHANGED
@@ -16,6 +16,8 @@ tags:
16
  - sovereign-ai
17
  ---
18
 
 
 
19
  # Ornstein-3.6-27B-RYS-GGUF
20
 
21
  GGUF quantizations of [GestaltLabs/Ornstein-3.6-27B-RYS](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS) — the RYS-enhanced dense Ornstein model.
@@ -24,29 +26,49 @@ GGUF quantizations of [GestaltLabs/Ornstein-3.6-27B-RYS](https://huggingface.co/
24
 
25
  We are a proudly Canadian research collective working to advance **sovereign Canadian AI** — open-weight models that Canadians (and everyone else) can run locally, study, and build on without dependence on closed foreign APIs. All training, fine-tuning, and quantization is done on local and self-funded compute. By supporting this work, you help keep frontier model development accessible, transparent, and under Canadian stewardship.
26
 
27
- ## ⚠️ Requires Patched llama.cpp
 
 
28
 
29
- RYS duplicates layer 33, which breaks the hardcoded attention-interval logic in stock llama.cpp. **Stock llama.cpp, Ollama, LM Studio, and similar tools will fail to load these GGUFs.**
30
 
31
- **Use this patched fork:** https://github.com/DJLougen/llama.cpp/tree/rys-qwen35
 
 
32
 
33
- The fork is fully backward-compatible with non-RYS Qwen3.5 models. It now also includes an **SSM tensor probing fallback**, so even legacy GGUFs load correctly without per-layer metadata.
 
 
 
 
34
 
35
  ## Available Quantizations
36
 
37
  | File | Quant | Size | Notes |
38
  |------|-------|------|-------|
39
- | `ornstein-3.6-27b-rys-q8_0.gguf` | Q8_0 | ~27 GB | Near-lossless |
40
  | `ornstein-3.6-27b-rys-q6_k.gguf` | Q6_K | ~21 GB | Very high quality |
41
- | `ornstein-3.6-27b-rys-q5_k_m.gguf` | Q5_K_M | ~18 GB | Strong balance |
42
  | `ornstein-3.6-27b-rys-q4_k_m.gguf` | Q4_K_M | ~16 GB | Recommended default |
43
  | `ornstein-3.6-27b-rys-q3_k_m.gguf` | Q3_K_M | ~12 GB | Low-memory option |
44
 
45
- Sizes are approximate (exact depends on tokenizer overhead).
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
- ## Quick Start
48
 
49
- ### 1. Build the patched llama.cpp
50
 
51
  ```bash
52
  git clone https://github.com/DJLougen/llama.cpp.git
@@ -56,15 +78,15 @@ cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
56
  cmake --build build -j
57
  ```
58
 
59
- Drop `-DGGML_CUDA=ON` for CPU-only.
60
 
61
- ### 2. Download a GGUF
62
-
63
- Grab your preferred quant from the **Files** tab.
64
-
65
- ### 3. Launch llama-server
66
 
67
  ```bash
 
 
 
 
68
  ./build/bin/llama-server \
69
  -m ornstein-3.6-27b-rys-q4_k_m.gguf \
70
  --host 0.0.0.0 --port 8080 \
@@ -73,20 +95,6 @@ Grab your preferred quant from the **Files** tab.
73
  -ctk q4_0 -ctv q4_0
74
  ```
75
 
76
- * `--jinja` enables the Qwen3 thinking chat template.
77
- * `--flash-attn on` is recommended for long contexts.
78
- * `-ctk q4_0 -ctv q4_0` quantizes KV cache to 4-bit.
79
-
80
- ## Note: Thinking Model
81
-
82
- This is a Qwen3-Thinking derivative. If you see raw `<think>...</think>` blocks appearing inline in responses, ensure `--jinja` is enabled. Recent llama.cpp builds default `--reasoning-format deepseek`, which splits reasoning into a separate `reasoning_content` field.
83
-
84
- ## Support This Work
85
-
86
- Our training compute is entirely self-funded. If this model is useful to you, consider supporting the lab:
87
-
88
- **[Support on Ko-fi](https://ko-fi.com/djlougen)**
89
-
90
  ## License
91
 
92
  Apache 2.0
 
16
  - sovereign-ai
17
  ---
18
 
19
+ [![Ornstein-3.6-27B-RYS](/GestaltLabs/Ornstein-3.6-27B-RYS-GGUF/resolve/main/ornstein3.6-27b.png)](/GestaltLabs/Ornstein-3.6-27B-RYS-GGUF/blob/main/ornstein3.6-27b.png)
20
+
21
  # Ornstein-3.6-27B-RYS-GGUF
22
 
23
  GGUF quantizations of [GestaltLabs/Ornstein-3.6-27B-RYS](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS) — the RYS-enhanced dense Ornstein model.
 
26
 
27
  We are a proudly Canadian research collective working to advance **sovereign Canadian AI** — open-weight models that Canadians (and everyone else) can run locally, study, and build on without dependence on closed foreign APIs. All training, fine-tuning, and quantization is done on local and self-funded compute. By supporting this work, you help keep frontier model development accessible, transparent, and under Canadian stewardship.
28
 
29
+ ## Important: requires a patched llama.cpp
30
+
31
+ RYS duplicates one of the middle layers, which breaks the hardcoded `full_attention_interval = 4` assumption in stock llama.cpp's Qwen3.5 loader. These GGUFs are re-converted with **per-layer `head_count_kv` baked in**, and you need a llama.cpp that reads that per-layer metadata instead of falling back to the interval formula.
32
 
33
+ **Patched fork:** [https://github.com/DJLougen/llama.cpp](https://github.com/DJLougen/llama.cpp) (default branch `rys-qwen35`, one commit on top of `ggml-org/llama.cpp@d00685831`, fully backward-compatible).
34
 
35
+ Stock llama.cpp, Ollama, LM Studio, and any other inference runtime built on stock llama.cpp will currently fail to load these files with a `check_tensor_dims` error on `blk.33` — this is expected until/unless the patch is upstreamed.
36
+
37
+ ## Support This Work
38
 
39
+ Our training compute is entirely self-funded. If this model is useful to you, consider supporting the lab:
40
+
41
+ **[Support on Ko-fi](https://ko-fi.com/djlougen)**
42
+
43
+ * * *
44
 
45
  ## Available Quantizations
46
 
47
  | File | Quant | Size | Notes |
48
  |------|-------|------|-------|
49
+ | `ornstein-3.6-27b-rys-q8_0.gguf` | Q8_0 | ~27 GB | Near-lossless, largest |
50
  | `ornstein-3.6-27b-rys-q6_k.gguf` | Q6_K | ~21 GB | Very high quality |
51
+ | `ornstein-3.6-27b-rys-q5_k_m.gguf` | Q5_K_M | ~18 GB | Strong quality/size balance |
52
  | `ornstein-3.6-27b-rys-q4_k_m.gguf` | Q4_K_M | ~16 GB | Recommended default |
53
  | `ornstein-3.6-27b-rys-q3_k_m.gguf` | Q3_K_M | ~12 GB | Low-memory option |
54
 
55
+ ## Model Lineage
56
+
57
+ ```
58
+ Qwen 3.6 27B → Ornstein3.6 (DDM fine-tune) → RYS (layer 33 dup, +49%)
59
+ ```
60
+
61
+ ## Model Details
62
+
63
+ * **Architecture:** Qwen3.5 dense
64
+ * **Parameters:** ~27B active
65
+ * **Layers:** 65 (64 original + 1 RYS-duplicated layer 33)
66
+ * **Context:** 131,072 tokens
67
+ * **GGUF metadata:** per-layer `head_count_kv` array encoding the RYS-shifted attention pattern
68
 
69
+ ## Usage
70
 
71
+ ### Build the patched llama.cpp
72
 
73
  ```bash
74
  git clone https://github.com/DJLougen/llama.cpp.git
 
78
  cmake --build build -j
79
  ```
80
 
81
+ Drop `-DGGML_CUDA=ON` for a CPU-only build. The patch touches the GGUF loader and three model forward files; backend selection is independent.
82
 
83
+ ### Download + run
 
 
 
 
84
 
85
  ```bash
86
+ hf download GestaltLabs/Ornstein-3.6-27B-RYS-GGUF \
87
+ ornstein-3.6-27b-rys-q4_k_m.gguf \
88
+ --local-dir .
89
+
90
  ./build/bin/llama-server \
91
  -m ornstein-3.6-27b-rys-q4_k_m.gguf \
92
  --host 0.0.0.0 --port 8080 \
 
95
  -ctk q4_0 -ctv q4_0
96
  ```
97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
  ## License
99
 
100
  Apache 2.0