Instructions to use GestaltLabs/Ornstein-3.6-27B-RYS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use GestaltLabs/Ornstein-3.6-27B-RYS with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="GestaltLabs/Ornstein-3.6-27B-RYS")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GestaltLabs/Ornstein-3.6-27B-RYS")
model = AutoModelForCausalLM.from_pretrained("GestaltLabs/Ornstein-3.6-27B-RYS")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use GestaltLabs/Ornstein-3.6-27B-RYS with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "GestaltLabs/Ornstein-3.6-27B-RYS"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GestaltLabs/Ornstein-3.6-27B-RYS",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/GestaltLabs/Ornstein-3.6-27B-RYS

SGLang

How to use GestaltLabs/Ornstein-3.6-27B-RYS with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "GestaltLabs/Ornstein-3.6-27B-RYS" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GestaltLabs/Ornstein-3.6-27B-RYS",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "GestaltLabs/Ornstein-3.6-27B-RYS" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GestaltLabs/Ornstein-3.6-27B-RYS",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use GestaltLabs/Ornstein-3.6-27B-RYS with Docker Model Runner:
```
docker model run hf.co/GestaltLabs/Ornstein-3.6-27B-RYS
```

DJLougen commited on 24 days ago

Commit

84a900a

verified ·

1 Parent(s): e0088c5

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +144 -0

README.md ADDED Viewed

	@@ -0,0 +1,144 @@

+---
+base_model: GestaltLabs/Ornstein-3.6-27B
+base_model_relation: finetune
+datasets: []
+library_name: transformers
+license: apache-2.0
+pipeline_tag: image-text-to-text
+tags:
+- transformers
+- safetensors
+- qwen3_5
+- qwen3.6
+- multimodal
+- image-text-to-text
+- rys
+- layer-duplication
+- unsloth
+language:
+- en
+---
+# Ornstein-3.6-27B-RYS
+**Permanent RYS layer-duplication** of [Ornstein-3.6-27B](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B), the dense multimodal member of the Qwen 3.6 family with hybrid linear + full attention (Gated Delta Net).
+This model applies the optimal **Retained-You-Seek (RYS)** configuration discovered by an exhaustive sweep over all 2,080 valid duplication configs: **layers 22 and 23 are duplicated**, expanding the network from 64 to **66 layers** with zero weight modification.
+> **GGUF quantizations** (Q8_0, Q6_K, Q4_K_M, Q3_K_M, Q2_K) are available at **[GestaltLabs/Ornstein-3.6-27B-RYS-GGUF](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS-GGUF)**.
+---
+## What is RYS?
+**Retained-You-Seek (RYS)** — Ng, David Noel (2026) — is a zero-training architecture modification for deep transformers. By duplicating a contiguous slice of layers, the model revisits an earlier representation mid-pass, effectively deepening the network without changing any weights.
+The canonical form is:
+```
+new_layer_order = [0, 1, ..., j-1, i, i+1, ..., j-1, j, j+1, ..., N-1]
+```
+where `0 <= i < j <= N`.
+For Ornstein-3.6-27B, the optimal config is **i=22, j=24**:
+```
+[0..23, 22, 23, 24..63]   →   66 layers total
+```
+### Why it works
+The sweep evaluates each config on two fast benchmarks:
+- **Math** (GSM8k-style): measures reasoning stability
+- **IFO** (IFO-Scan): measures instruction-following fidelity
+The **combined delta** (math + IFO) is maximized. The winning config (i=22, j=24) scored `combined_delta = +0.3223`, with both math and IFO improving.
+---
+## Architecture
+| Property | Value |
+|----------|-------|
+| **Base model** | [GestaltLabs/Ornstein-3.6-27B](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B) |
+| **Base architecture** | `Qwen3_5ForConditionalGeneration` |
+| **Hidden size** | 5120 |
+| **Original layers** | 64 |
+| **RYS layers** | **66** (layers 22 & 23 duplicated) |
+| **Attention heads** | 24 full / 4 KV / head_dim 256 |
+| **Attention pattern** | Gated Delta Net (linear) + full SDPA, full every 4 layers |
+| **Context length** | 262,144 tokens |
+| **Parameters** | ~27.2B (minimal increase from 2 extra layer copies) |
+| **License** | Apache 2.0 |
+### Layer type distribution (66 layers)
+The duplicated layers preserve their original types:
+- **Layers 22-23** (duplicated slice) are `linear_attention` layers
+- All other layers retain their original `linear_attention` / `full_attention` pattern
+---
+## Usage
+### Transformers
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "GestaltLabs/Ornstein-3.6-27B-RYS"
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype="auto",
+    device_map="auto",
+    trust_remote_code=True,
+)
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+prompt = "Solve step by step: A train leaves station A at 60 mph..."
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+out = model.generate(**inputs, max_new_tokens=512)
+print(tokenizer.decode(out[0], skip_special_tokens=True))
+```
+### llama.cpp (GGUF)
+Grab a quant from the [GGUF repo](https://huggingface.co/GestaltLabs/Ornstein-3.6-27B-RYS-GGUF):
+| Quant | Size | Use case |
+|-------|------|----------|
+| **Q8_0** | ~29 GB | Maximum quality, 48 GB VRAM |
+| **Q6_K** | ~22 GB | Strong quality, 32-40 GB VRAM |
+| **Q4_K_M** | ~16 GB | Balanced, 24 GB VRAM |
+| **Q3_K_M** | ~9 GB | Budget 24 GB VRAM |
+| **Q2_K** | ~7 GB | Extreme budget, CPU offload |
+```bash
+# Example with llama.cpp
+./llama-cli -m Ornstein-3.6-27B-RYS-Q4_K_M.gguf -p "Explain RYS in one sentence."
+```
+---
+## RYS Sweep Details
+- **Sweep space**: 2,080 configs (i < j, 0..63)
+- **Optimal config**: i=22, j=24
+- **Combined delta**: +0.3223
+- **Math delta**: +0.010
+- **IFO delta**: +0.312
+- **Citation**: Ng, David Noel (2026). *Retained-You-Seek*. https://dnhkng.github.io/posts/rys/
+---
+## Support This Work
+This is self-funded research by a PhD student in visual neuroscience at the University of Toronto. GPU time for sweeps, surgery, and quantization comes out of pocket.
+**[Support on Ko-fi](https://ko-fi.com/djlougen)**
+---
+## License
+Apache 2.0 — inherited from Qwen 3.6 and Ornstein-3.6-27B.
+[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)