DJLougen commited on
Commit
ef339e8
·
verified ·
1 Parent(s): c352f80

Update model card: add Canadian lab mission, Ko-fi, patched llama.cpp fork, quant table

Browse files
Files changed (1) hide show
  1. README.md +42 -100
README.md CHANGED
@@ -1,132 +1,74 @@
1
- ---
2
- base_model: GestaltLabs/Ornstein-3.6-27B-RYS
3
- base_model_relation: quantized
4
- tags:
5
- - gguf
6
- - llama.cpp
7
- - qwen3_5
8
- - qwen3.6
9
- - multimodal
10
- - rys
11
- - quantized
12
- - 8-bit
13
- - 6-bit
14
- - 4-bit
15
- - 3-bit
16
- - 2-bit
17
- license: apache-2.0
18
- language:
19
- - en
20
- ---
21
-
22
- ![Ornstein-3.6-27B-RYS](ChatGPT%20Image%20Apr%2024,%202026,%2006_52_24%20PM(1).png)
23
-
24
  # Ornstein-3.6-27B-RYS-GGUF
25
 
26
- GGUF quantizations for **Ornstein-3.6-27B-RYS** — the RYS-modified variant of Ornstein-3.6-27B with 66 layers (layers 22 & 23 duplicated, zero weight change).
27
-
28
- See the base model repo for full details: **[GestaltLabs/Ornstein-3.6-27B-RYS](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS)**
29
-
30
- ---
31
 
32
- ## What is RYS?
33
 
34
- **Repeat-Your-Self (RYS)** — Ng, David Noel (2026) is a zero-training architecture modification for deep transformers. By duplicating a contiguous slice of layers, the model revisits an earlier representation mid-pass, effectively deepening the network without changing any weights.
35
 
36
- For Ornstein-3.6-27B, the optimal config is **i=22, j=24**:
37
- ```
38
- new_layer_order = [0..23, 22, 23, 24..63] → 66 layers total
39
- ```
40
 
41
- **Sweep results**: 2,080 configs evaluated, combined delta = **+0.3223** (math +0.010, IFO +0.312).
42
 
43
- ---
44
 
45
- ## Available Quants
46
 
47
- | Quant | Size | BPW | VRAM Required | Best For |
48
- |-------|------|-----|---------------|----------|
49
- | **Q8_0** | ~27.4 GB | 8.0 | ~32 GB | Maximum quality, near-lossless |
50
- | **Q6_K** | ~21.1 GB | 6.0 | ~26 GB | Strong quality, large context |
51
- | **Q4_K_M** | ~15.8 GB | 4.0 | ~20 GB | **Balanced default**, 24 GB cards |
52
- | **Q3_K_M** | ~13.0 GB | 3.0 | ~17 GB | Budget 24 GB VRAM |
53
- | **Q2_K** | ~10.7 GB | 2.0 | ~15 GB | Extreme budget, CPU offload |
54
 
55
- > **Recommendation**: Q4_K_M is the sweet spot for most users. Q8_0 if you have the VRAM and want minimal quantization error.
 
 
 
 
 
 
56
 
57
- ---
58
 
59
- ## Usage (llama.cpp)
60
-
61
- ```bash
62
- # Download a quant
63
- wget https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS-GGUF/resolve/main/Ornstein-3.6-27B-RYS-Q4_K_M.gguf
64
-
65
- # Run inference
66
- ./llama-cli -m Ornstein-3.6-27B-RYS-Q4_K_M.gguf \
67
- -p "Explain RYS layer duplication in one sentence." \
68
- -n 512 --temp 0.6
69
- ```
70
 
71
- ### With llama-server (OpenAI-compatible API)
72
 
73
  ```bash
74
- ./llama-server -m Ornstein-3.6-27B-RYS-Q4_K_M.gguf \
75
- --host 0.0.0.0 --port 8080 \
76
- -c 32768 --flash-attn
 
 
77
  ```
78
 
79
- ### Recommended settings per quant
80
-
81
- | Quant | Context | Flash Attention | Offload layers |
82
- |-------|---------|-----------------|----------------|
83
- | Q8_0 | 4096+ | Yes | Full GPU |
84
- | Q6_K | 8192+ | Yes | Full GPU |
85
- | Q4_K_M | 16384+ | Yes | Full GPU |
86
- | Q3_K_M | 8192 | Yes | CPU offload if needed |
87
- | Q2_K | 4096 | Recommended | CPU + GPU mix |
88
-
89
- ---
90
 
91
- ## Architecture Details
92
 
93
- - **Model**: Qwen 3.6 27B dense (hybrid attention)
94
- - **Layers**: 66 (64 original + 2 duplicated)
95
- - **Hidden size**: 5120
96
- - **Attention**: Gated Delta Net (linear) + full SDPA, full every 4 layers
97
- - **Heads**: 24 full / 4 KV / head_dim 256
98
- - **Context**: 262,144 tokens
99
- - **Vocab**: 248,320 tokens
100
- - **License**: Apache 2.0
101
 
102
- ---
103
 
104
- ## Citation
105
-
106
- ```bibtex
107
- @online{ng2026rys,
108
- author = {Ng, David Noel},
109
- title = {Retained-You-Seek (RYS)},
110
- year = {2026},
111
- url = {https://dnhkng.github.io/posts/rys/}
112
- }
113
  ```
114
 
115
- Base model: [GestaltLabs/Ornstein-3.6-27B](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B)
116
- Base architecture: [unsloth/Qwen3.6-27B](https://huggingface.co/unsloth/Qwen3.6-27B)
 
 
 
117
 
118
- ---
119
 
120
  ## Support This Work
121
 
122
- I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
123
 
124
  **[Support on Ko-fi](https://ko-fi.com/djlougen)**
125
 
126
- ---
127
-
128
  ## License
129
 
130
- Apache 2.0 — inherited from Qwen 3.6 and Ornstein-3.6-27B.
131
-
132
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Ornstein-3.6-27B-RYS-GGUF
2
 
3
+ GGUF quantizations of [GestaltLabs/Ornstein-3.6-27B-RYS](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS) the RYS-enhanced dense Ornstein model.
 
 
 
 
4
 
5
+ ## About GestaltLabs
6
 
7
+ We are a proudly Canadian research collective working to advance **sovereign Canadian AI** — open-weight models that Canadians (and everyone else) can run locally, study, and build on without dependence on closed foreign APIs. All training, fine-tuning, and quantization is done on local and self-funded compute. By supporting this work, you help keep frontier model development accessible, transparent, and under Canadian stewardship.
8
 
9
+ ## ⚠️ Requires Patched llama.cpp
 
 
 
10
 
11
+ RYS duplicates layer 33, which breaks the hardcoded attention-interval logic in stock llama.cpp. **Stock llama.cpp, Ollama, LM Studio, and similar tools will fail to load these GGUFs.**
12
 
13
+ **Use this patched fork:** https://github.com/DJLougen/llama.cpp/tree/rys-qwen35
14
 
15
+ The fork is fully backward-compatible with non-RYS Qwen3.5 models. It now also includes an **SSM tensor probing fallback**, so even legacy GGUFs load correctly without per-layer metadata.
16
 
17
+ ## Available Quantizations
 
 
 
 
 
 
18
 
19
+ | File | Quant | Size | Notes |
20
+ |------|-------|------|-------|
21
+ | `ornstein-3.6-27b-rys-q8_0.gguf` | Q8_0 | ~27 GB | Near-lossless |
22
+ | `ornstein-3.6-27b-rys-q6_k.gguf` | Q6_K | ~21 GB | Very high quality |
23
+ | `ornstein-3.6-27b-rys-q5_k_m.gguf` | Q5_K_M | ~18 GB | Strong balance |
24
+ | `ornstein-3.6-27b-rys-q4_k_m.gguf` | Q4_K_M | ~16 GB | Recommended default |
25
+ | `ornstein-3.6-27b-rys-q3_k_m.gguf` | Q3_K_M | ~12 GB | Low-memory option |
26
 
27
+ Sizes are approximate (exact depends on tokenizer overhead).
28
 
29
+ ## Quick Start
 
 
 
 
 
 
 
 
 
 
30
 
31
+ ### 1. Build the patched llama.cpp
32
 
33
  ```bash
34
+ git clone https://github.com/DJLougen/llama.cpp.git
35
+ cd llama.cpp
36
+ git checkout rys-qwen35
37
+ cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
38
+ cmake --build build -j
39
  ```
40
 
41
+ Drop `-DGGML_CUDA=ON` for CPU-only.
 
 
 
 
 
 
 
 
 
 
42
 
43
+ ### 2. Download a GGUF
44
 
45
+ Grab your preferred quant from the **Files** tab.
 
 
 
 
 
 
 
46
 
47
+ ### 3. Launch llama-server
48
 
49
+ ```bash
50
+ ./build/bin/llama-server \
51
+ -m ornstein-3.6-27b-rys-q4_k_m.gguf \
52
+ --host 0.0.0.0 --port 8080 \
53
+ --n-gpu-layers 99 --ctx-size 131072 \
54
+ --flash-attn on --jinja \
55
+ -ctk q4_0 -ctv q4_0
 
 
56
  ```
57
 
58
+ * `--jinja` enables the Qwen3 thinking chat template.
59
+ * `--flash-attn on` is recommended for long contexts.
60
+ * `-ctk q4_0 -ctv q4_0` quantizes KV cache to 4-bit.
61
+
62
+ ## Note: Thinking Model
63
 
64
+ This is a Qwen3-Thinking derivative. If you see raw `<think>...</think>` blocks appearing inline in responses, ensure `--jinja` is enabled. Recent llama.cpp builds default `--reasoning-format deepseek`, which splits reasoning into a separate `reasoning_content` field.
65
 
66
  ## Support This Work
67
 
68
+ Our training compute is entirely self-funded. If this model is useful to you, consider supporting the lab:
69
 
70
  **[Support on Ko-fi](https://ko-fi.com/djlougen)**
71
 
 
 
72
  ## License
73
 
74
+ Apache 2.0