Instructions to use GestaltLabs/Ornstein-3.6-27B-RYS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use GestaltLabs/Ornstein-3.6-27B-RYS with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="GestaltLabs/Ornstein-3.6-27B-RYS")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GestaltLabs/Ornstein-3.6-27B-RYS")
model = AutoModelForCausalLM.from_pretrained("GestaltLabs/Ornstein-3.6-27B-RYS")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use GestaltLabs/Ornstein-3.6-27B-RYS with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "GestaltLabs/Ornstein-3.6-27B-RYS"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GestaltLabs/Ornstein-3.6-27B-RYS",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/GestaltLabs/Ornstein-3.6-27B-RYS

SGLang

How to use GestaltLabs/Ornstein-3.6-27B-RYS with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "GestaltLabs/Ornstein-3.6-27B-RYS" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GestaltLabs/Ornstein-3.6-27B-RYS",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "GestaltLabs/Ornstein-3.6-27B-RYS" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GestaltLabs/Ornstein-3.6-27B-RYS",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use GestaltLabs/Ornstein-3.6-27B-RYS with Docker Model Runner:
```
docker model run hf.co/GestaltLabs/Ornstein-3.6-27B-RYS
```

DJLougen commited on 20 days ago

Commit

48d2f79

verified ·

1 Parent(s): cbd57f2

Update model card: add Canadian lab mission, Ko-fi, patched llama.cpp fork

Browse files

Files changed (1) hide show

README.md +15 -130

README.md CHANGED Viewed

@@ -1,146 +1,31 @@
----
-base_model: GestaltLabs/Ornstein-3.6-27B
-base_model_relation: finetune
-datasets: []
-library_name: transformers
-license: apache-2.0
-pipeline_tag: image-text-to-text
-tags:
-- transformers
-- safetensors
-- qwen3_5
-- qwen3.6
-- multimodal
-- image-text-to-text
-- rys
-- layer-duplication
-- unsloth
-language:
-- en
----
-![Ornstein-3.6-27B-RYS](Ornstein3.6-27B-RYS.png)
 # Ornstein-3.6-27B-RYS
-**Permanent RYS layer-duplication** of [Ornstein-3.6-27B](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B), the dense multimodal member of the Qwen 3.6 family with hybrid linear + full attention (Gated Delta Net).
-This model applies the optimal **Retained-You-Seek (RYS)** configuration discovered by an exhaustive sweep over all 2,080 valid duplication configs: **layers 22 and 23 are duplicated**, expanding the network from 64 to **66 layers** with zero weight modification.
-> **GGUF quantizations** (Q8_0, Q6_K, Q4_K_M, Q3_K_M, Q2_K) are available at **[GestaltLabs/Ornstein-3.6-27B-RYS-GGUF](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS-GGUF)**.
----
-## What is RYS?
-**Repeat-Your-Self (RYS)** — Ng, David Noel (2026) — is a zero-training architecture modification for deep transformers. By duplicating a contiguous slice of layers, the model revisits an earlier representation mid-pass, effectively deepening the network without changing any weights.
-The canonical form is:
-```
-new_layer_order = [0, 1, ..., j-1, i, i+1, ..., j-1, j, j+1, ..., N-1]
-```
-where `0 <= i < j <= N`.
-For Ornstein-3.6-27B, the optimal config is **i=22, j=24**:
-```
-[0..23, 22, 23, 24..63]   →   66 layers total
-```
-### Why it works
-The sweep evaluates each config on two fast benchmarks:
-- **Math** (GSM8k-style): measures reasoning stability
-- **IFO** (IFO-Scan): measures instruction-following fidelity
-The **combined delta** (math + IFO) is maximized. The winning config (i=22, j=24) scored `combined_delta = +0.3223`, with both math and IFO improving.
----
-## Architecture
-| Property | Value |
-|----------|-------|
-| **Base model** | [GestaltLabs/Ornstein-3.6-27B](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B) |
-| **Base architecture** | `Qwen3_5ForConditionalGeneration` |
-| **Hidden size** | 5120 |
-| **Original layers** | 64 |
-| **RYS layers** | **66** (layers 22 & 23 duplicated) |
-| **Attention heads** | 24 full / 4 KV / head_dim 256 |
-| **Attention pattern** | Gated Delta Net (linear) + full SDPA, full every 4 layers |
-| **Context length** | 262,144 tokens |
-| **Parameters** | ~27.2B (minimal increase from 2 extra layer copies) |
-| **License** | Apache 2.0 |
-### Layer type distribution (66 layers)
-The duplicated layers preserve their original types:
-- **Layers 22-23** (duplicated slice) are `linear_attention` layers
-- All other layers retain their original `linear_attention` / `full_attention` pattern
----
-## Usage
-### Transformers
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_id = "GestaltLabs/Ornstein-3.6-27B-RYS"
-model = AutoModelForCausalLM.from_pretrained(
-    model_id,
-    torch_dtype="auto",
-    device_map="auto",
-    trust_remote_code=True,
-)
-tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
-prompt = "Solve step by step: A train leaves station A at 60 mph..."
-inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-out = model.generate(**inputs, max_new_tokens=512)
-print(tokenizer.decode(out[0], skip_special_tokens=True))
-```
-### llama.cpp (GGUF)
-Grab a quant from the [GGUF repo](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS-GGUF):
-| Quant | Size | Use case |
-|-------|------|----------|
-| **Q8_0** | ~29 GB | Maximum quality, 48 GB VRAM |
-| **Q6_K** | ~22 GB | Strong quality, 32-40 GB VRAM |
-| **Q4_K_M** | ~16 GB | Balanced, 24 GB VRAM |
-| **Q3_K_M** | ~9 GB | Budget 24 GB VRAM |
-| **Q2_K** | ~7 GB | Extreme budget, CPU offload |
-```bash
-# Example with llama.cpp
-./llama-cli -m Ornstein-3.6-27B-RYS-Q4_K_M.gguf -p "Explain RYS in one sentence."
-```
----
-## RYS Sweep Details
-- **Sweep space**: 2,080 configs (i < j, 0..63)
-- **Optimal config**: i=22, j=24
-- **Combined delta**: +0.3223
-- **Math delta**: +0.010
-- **IFO delta**: +0.312
-- **Citation**: Ng, David Noel (2026). *Retained-You-Seek*. https://dnhkng.github.io/posts/rys/
----
 ## Support This Work
-I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
 **[Support on Ko-fi](https://ko-fi.com/djlougen)**
----
-## License
-Apache 2.0 — inherited from Qwen 3.6 and Ornstein-3.6-27B.
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 # Ornstein-3.6-27B-RYS
+RYS-enhanced variant of the Ornstein-3.6-27B dense model. Layer 33 is duplicated using the **Repeat Your Self (RYS)** method, improving reasoning and instruction-following performance without increasing active parameter count at inference time.
+> **GGUF quantizations:** [GestaltLabs/Ornstein-3.6-27B-RYS-GGUF](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS-GGUF)
+## About GestaltLabs
+We are a proudly Canadian research collective working to advance **sovereign Canadian AI** — open-weight models that Canadians (and everyone else) can run locally, study, and build on without dependence on closed foreign APIs. All training, fine-tuning, and quantization is done on local and self-funded compute. By supporting this work, you help keep frontier model development accessible, transparent, and under Canadian stewardship.
+## Running Locally
+This model requires a **patched llama.cpp** to load correctly. RYS breaks the hardcoded `full_attention_interval = 4` assumption in stock llama.cpp.
+**Use this patched fork:** https://github.com/DJLougen/llama.cpp/tree/rys-qwen35
+The fork now includes both per-layer `layer_types` support and an **SSM tensor probing fallback**, so even legacy GGUFs load correctly. It is fully backward-compatible with non-RYS Qwen3.5 models.
+## Model Details
+* **Architecture:** Qwen3.5 dense
+* **Parameters:** ~27B active
+* **Layers:** 65 (64 original + 1 RYS-duplicated layer 33)
+* **Context length:** 131,072 tokens
+* **License:** Apache-2.0
 ## Support This Work
+Our training compute is entirely self-funded. If this model is useful to you, consider supporting the lab:
 **[Support on Ko-fi](https://ko-fi.com/djlougen)**