DJLougen commited on
Commit
84a900a
·
verified ·
1 Parent(s): e0088c5

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +144 -0
README.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: GestaltLabs/Ornstein-3.6-27B
3
+ base_model_relation: finetune
4
+ datasets: []
5
+ library_name: transformers
6
+ license: apache-2.0
7
+ pipeline_tag: image-text-to-text
8
+ tags:
9
+ - transformers
10
+ - safetensors
11
+ - qwen3_5
12
+ - qwen3.6
13
+ - multimodal
14
+ - image-text-to-text
15
+ - rys
16
+ - layer-duplication
17
+ - unsloth
18
+ language:
19
+ - en
20
+ ---
21
+
22
+ # Ornstein-3.6-27B-RYS
23
+
24
+ **Permanent RYS layer-duplication** of [Ornstein-3.6-27B](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B), the dense multimodal member of the Qwen 3.6 family with hybrid linear + full attention (Gated Delta Net).
25
+
26
+ This model applies the optimal **Retained-You-Seek (RYS)** configuration discovered by an exhaustive sweep over all 2,080 valid duplication configs: **layers 22 and 23 are duplicated**, expanding the network from 64 to **66 layers** with zero weight modification.
27
+
28
+ > **GGUF quantizations** (Q8_0, Q6_K, Q4_K_M, Q3_K_M, Q2_K) are available at **[GestaltLabs/Ornstein-3.6-27B-RYS-GGUF](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS-GGUF)**.
29
+
30
+ ---
31
+
32
+ ## What is RYS?
33
+
34
+ **Retained-You-Seek (RYS)** — Ng, David Noel (2026) — is a zero-training architecture modification for deep transformers. By duplicating a contiguous slice of layers, the model revisits an earlier representation mid-pass, effectively deepening the network without changing any weights.
35
+
36
+ The canonical form is:
37
+ ```
38
+ new_layer_order = [0, 1, ..., j-1, i, i+1, ..., j-1, j, j+1, ..., N-1]
39
+ ```
40
+ where `0 <= i < j <= N`.
41
+
42
+ For Ornstein-3.6-27B, the optimal config is **i=22, j=24**:
43
+ ```
44
+ [0..23, 22, 23, 24..63] → 66 layers total
45
+ ```
46
+
47
+ ### Why it works
48
+
49
+ The sweep evaluates each config on two fast benchmarks:
50
+ - **Math** (GSM8k-style): measures reasoning stability
51
+ - **IFO** (IFO-Scan): measures instruction-following fidelity
52
+
53
+ The **combined delta** (math + IFO) is maximized. The winning config (i=22, j=24) scored `combined_delta = +0.3223`, with both math and IFO improving.
54
+
55
+ ---
56
+
57
+ ## Architecture
58
+
59
+ | Property | Value |
60
+ |----------|-------|
61
+ | **Base model** | [GestaltLabs/Ornstein-3.6-27B](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B) |
62
+ | **Base architecture** | `Qwen3_5ForConditionalGeneration` |
63
+ | **Hidden size** | 5120 |
64
+ | **Original layers** | 64 |
65
+ | **RYS layers** | **66** (layers 22 & 23 duplicated) |
66
+ | **Attention heads** | 24 full / 4 KV / head_dim 256 |
67
+ | **Attention pattern** | Gated Delta Net (linear) + full SDPA, full every 4 layers |
68
+ | **Context length** | 262,144 tokens |
69
+ | **Parameters** | ~27.2B (minimal increase from 2 extra layer copies) |
70
+ | **License** | Apache 2.0 |
71
+
72
+ ### Layer type distribution (66 layers)
73
+
74
+ The duplicated layers preserve their original types:
75
+ - **Layers 22-23** (duplicated slice) are `linear_attention` layers
76
+ - All other layers retain their original `linear_attention` / `full_attention` pattern
77
+
78
+ ---
79
+
80
+ ## Usage
81
+
82
+ ### Transformers
83
+
84
+ ```python
85
+ from transformers import AutoModelForCausalLM, AutoTokenizer
86
+
87
+ model_id = "GestaltLabs/Ornstein-3.6-27B-RYS"
88
+ model = AutoModelForCausalLM.from_pretrained(
89
+ model_id,
90
+ torch_dtype="auto",
91
+ device_map="auto",
92
+ trust_remote_code=True,
93
+ )
94
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
95
+
96
+ prompt = "Solve step by step: A train leaves station A at 60 mph..."
97
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
98
+ out = model.generate(**inputs, max_new_tokens=512)
99
+ print(tokenizer.decode(out[0], skip_special_tokens=True))
100
+ ```
101
+
102
+ ### llama.cpp (GGUF)
103
+
104
+ Grab a quant from the [GGUF repo](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS-GGUF):
105
+
106
+ | Quant | Size | Use case |
107
+ |-------|------|----------|
108
+ | **Q8_0** | ~29 GB | Maximum quality, 48 GB VRAM |
109
+ | **Q6_K** | ~22 GB | Strong quality, 32-40 GB VRAM |
110
+ | **Q4_K_M** | ~16 GB | Balanced, 24 GB VRAM |
111
+ | **Q3_K_M** | ~9 GB | Budget 24 GB VRAM |
112
+ | **Q2_K** | ~7 GB | Extreme budget, CPU offload |
113
+
114
+ ```bash
115
+ # Example with llama.cpp
116
+ ./llama-cli -m Ornstein-3.6-27B-RYS-Q4_K_M.gguf -p "Explain RYS in one sentence."
117
+ ```
118
+
119
+ ---
120
+
121
+ ## RYS Sweep Details
122
+
123
+ - **Sweep space**: 2,080 configs (i < j, 0..63)
124
+ - **Optimal config**: i=22, j=24
125
+ - **Combined delta**: +0.3223
126
+ - **Math delta**: +0.010
127
+ - **IFO delta**: +0.312
128
+ - **Citation**: Ng, David Noel (2026). *Retained-You-Seek*. https://dnhkng.github.io/posts/rys/
129
+
130
+ ---
131
+
132
+ ## Support This Work
133
+
134
+ This is self-funded research by a PhD student in visual neuroscience at the University of Toronto. GPU time for sweeps, surgery, and quantization comes out of pocket.
135
+
136
+ **[Support on Ko-fi](https://ko-fi.com/djlougen)**
137
+
138
+ ---
139
+
140
+ ## License
141
+
142
+ Apache 2.0 — inherited from Qwen 3.6 and Ornstein-3.6-27B.
143
+
144
+ [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)