DJLougen commited on
Commit
48d2f79
·
verified ·
1 Parent(s): cbd57f2

Update model card: add Canadian lab mission, Ko-fi, patched llama.cpp fork

Browse files
Files changed (1) hide show
  1. README.md +15 -130
README.md CHANGED
@@ -1,146 +1,31 @@
1
- ---
2
- base_model: GestaltLabs/Ornstein-3.6-27B
3
- base_model_relation: finetune
4
- datasets: []
5
- library_name: transformers
6
- license: apache-2.0
7
- pipeline_tag: image-text-to-text
8
- tags:
9
- - transformers
10
- - safetensors
11
- - qwen3_5
12
- - qwen3.6
13
- - multimodal
14
- - image-text-to-text
15
- - rys
16
- - layer-duplication
17
- - unsloth
18
- language:
19
- - en
20
- ---
21
-
22
- ![Ornstein-3.6-27B-RYS](Ornstein3.6-27B-RYS.png)
23
-
24
  # Ornstein-3.6-27B-RYS
25
 
26
- **Permanent RYS layer-duplication** of [Ornstein-3.6-27B](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B), the dense multimodal member of the Qwen 3.6 family with hybrid linear + full attention (Gated Delta Net).
27
-
28
- This model applies the optimal **Retained-You-Seek (RYS)** configuration discovered by an exhaustive sweep over all 2,080 valid duplication configs: **layers 22 and 23 are duplicated**, expanding the network from 64 to **66 layers** with zero weight modification.
29
-
30
- > **GGUF quantizations** (Q8_0, Q6_K, Q4_K_M, Q3_K_M, Q2_K) are available at **[GestaltLabs/Ornstein-3.6-27B-RYS-GGUF](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS-GGUF)**.
31
-
32
- ---
33
-
34
- ## What is RYS?
35
-
36
- **Repeat-Your-Self (RYS)** — Ng, David Noel (2026) — is a zero-training architecture modification for deep transformers. By duplicating a contiguous slice of layers, the model revisits an earlier representation mid-pass, effectively deepening the network without changing any weights.
37
-
38
- The canonical form is:
39
- ```
40
- new_layer_order = [0, 1, ..., j-1, i, i+1, ..., j-1, j, j+1, ..., N-1]
41
- ```
42
- where `0 <= i < j <= N`.
43
-
44
- For Ornstein-3.6-27B, the optimal config is **i=22, j=24**:
45
- ```
46
- [0..23, 22, 23, 24..63] → 66 layers total
47
- ```
48
-
49
- ### Why it works
50
-
51
- The sweep evaluates each config on two fast benchmarks:
52
- - **Math** (GSM8k-style): measures reasoning stability
53
- - **IFO** (IFO-Scan): measures instruction-following fidelity
54
-
55
- The **combined delta** (math + IFO) is maximized. The winning config (i=22, j=24) scored `combined_delta = +0.3223`, with both math and IFO improving.
56
-
57
- ---
58
 
59
- ## Architecture
60
 
61
- | Property | Value |
62
- |----------|-------|
63
- | **Base model** | [GestaltLabs/Ornstein-3.6-27B](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B) |
64
- | **Base architecture** | `Qwen3_5ForConditionalGeneration` |
65
- | **Hidden size** | 5120 |
66
- | **Original layers** | 64 |
67
- | **RYS layers** | **66** (layers 22 & 23 duplicated) |
68
- | **Attention heads** | 24 full / 4 KV / head_dim 256 |
69
- | **Attention pattern** | Gated Delta Net (linear) + full SDPA, full every 4 layers |
70
- | **Context length** | 262,144 tokens |
71
- | **Parameters** | ~27.2B (minimal increase from 2 extra layer copies) |
72
- | **License** | Apache 2.0 |
73
 
74
- ### Layer type distribution (66 layers)
75
 
76
- The duplicated layers preserve their original types:
77
- - **Layers 22-23** (duplicated slice) are `linear_attention` layers
78
- - All other layers retain their original `linear_attention` / `full_attention` pattern
79
 
80
- ---
81
 
82
- ## Usage
83
 
84
- ### Transformers
85
 
86
- ```python
87
- from transformers import AutoModelForCausalLM, AutoTokenizer
88
 
89
- model_id = "GestaltLabs/Ornstein-3.6-27B-RYS"
90
- model = AutoModelForCausalLM.from_pretrained(
91
- model_id,
92
- torch_dtype="auto",
93
- device_map="auto",
94
- trust_remote_code=True,
95
- )
96
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
97
-
98
- prompt = "Solve step by step: A train leaves station A at 60 mph..."
99
- inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
100
- out = model.generate(**inputs, max_new_tokens=512)
101
- print(tokenizer.decode(out[0], skip_special_tokens=True))
102
- ```
103
-
104
- ### llama.cpp (GGUF)
105
-
106
- Grab a quant from the [GGUF repo](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS-GGUF):
107
-
108
- | Quant | Size | Use case |
109
- |-------|------|----------|
110
- | **Q8_0** | ~29 GB | Maximum quality, 48 GB VRAM |
111
- | **Q6_K** | ~22 GB | Strong quality, 32-40 GB VRAM |
112
- | **Q4_K_M** | ~16 GB | Balanced, 24 GB VRAM |
113
- | **Q3_K_M** | ~9 GB | Budget 24 GB VRAM |
114
- | **Q2_K** | ~7 GB | Extreme budget, CPU offload |
115
-
116
- ```bash
117
- # Example with llama.cpp
118
- ./llama-cli -m Ornstein-3.6-27B-RYS-Q4_K_M.gguf -p "Explain RYS in one sentence."
119
- ```
120
-
121
- ---
122
-
123
- ## RYS Sweep Details
124
-
125
- - **Sweep space**: 2,080 configs (i < j, 0..63)
126
- - **Optimal config**: i=22, j=24
127
- - **Combined delta**: +0.3223
128
- - **Math delta**: +0.010
129
- - **IFO delta**: +0.312
130
- - **Citation**: Ng, David Noel (2026). *Retained-You-Seek*. https://dnhkng.github.io/posts/rys/
131
-
132
- ---
133
 
134
  ## Support This Work
135
 
136
- I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
137
 
138
  **[Support on Ko-fi](https://ko-fi.com/djlougen)**
139
-
140
- ---
141
-
142
- ## License
143
-
144
- Apache 2.0 — inherited from Qwen 3.6 and Ornstein-3.6-27B.
145
-
146
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Ornstein-3.6-27B-RYS
2
 
3
+ RYS-enhanced variant of the Ornstein-3.6-27B dense model. Layer 33 is duplicated using the **Repeat Your Self (RYS)** method, improving reasoning and instruction-following performance without increasing active parameter count at inference time.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
+ > **GGUF quantizations:** [GestaltLabs/Ornstein-3.6-27B-RYS-GGUF](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS-GGUF)
6
 
7
+ ## About GestaltLabs
 
 
 
 
 
 
 
 
 
 
 
8
 
9
+ We are a proudly Canadian research collective working to advance **sovereign Canadian AI** — open-weight models that Canadians (and everyone else) can run locally, study, and build on without dependence on closed foreign APIs. All training, fine-tuning, and quantization is done on local and self-funded compute. By supporting this work, you help keep frontier model development accessible, transparent, and under Canadian stewardship.
10
 
11
+ ## Running Locally
 
 
12
 
13
+ This model requires a **patched llama.cpp** to load correctly. RYS breaks the hardcoded `full_attention_interval = 4` assumption in stock llama.cpp.
14
 
15
+ **Use this patched fork:** https://github.com/DJLougen/llama.cpp/tree/rys-qwen35
16
 
17
+ The fork now includes both per-layer `layer_types` support and an **SSM tensor probing fallback**, so even legacy GGUFs load correctly. It is fully backward-compatible with non-RYS Qwen3.5 models.
18
 
19
+ ## Model Details
 
20
 
21
+ * **Architecture:** Qwen3.5 dense
22
+ * **Parameters:** ~27B active
23
+ * **Layers:** 65 (64 original + 1 RYS-duplicated layer 33)
24
+ * **Context length:** 131,072 tokens
25
+ * **License:** Apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ## Support This Work
28
 
29
+ Our training compute is entirely self-funded. If this model is useful to you, consider supporting the lab:
30
 
31
  **[Support on Ko-fi](https://ko-fi.com/djlougen)**