llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)Ornstein-3.6-27B-GGUF
GGUF quantizations of GestaltLabs/Ornstein-3.6-27B — a Qwen 3.6 27B dense multimodal fine-tune with hybrid linear + full attention.
Support This Work
I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
Model info
- Architecture:
Qwen3_5ForConditionalGeneration(linear + full attention interleaved, Gated Delta Net; text path extracted for GGUF) - Parameters: ~27 B dense
- Context: 262,144 tokens
- Hidden size / layers: 5120 / 64
- Attention: 24 heads, 4 KV heads, head_dim 256
These GGUFs expose the text path only. For the multimodal variant use the full safetensors in the base repo.
Quant index
Choose a quant that fits in your RAM/VRAM with room for context. For a dense 27B prefer Q4_K_M or higher on 24 GB cards; go to Q5_K_M or Q6_K if you have headroom.
| File | Bits | Notes |
|---|---|---|
Ornstein-3.6-27B-Q8_0.gguf |
8 | Reference, near-lossless |
Ornstein-3.6-27B-Q6_K.gguf |
6.5 | Great default for 32 GB+ systems |
Ornstein-3.6-27B-Q5_K_M.gguf |
5.5 | Excellent quality/size balance |
Ornstein-3.6-27B-Q5_K_S.gguf |
5.5 | Slightly smaller Q5 |
Ornstein-3.6-27B-Q5_0.gguf |
5 | Legacy 5-bit |
Ornstein-3.6-27B-Q4_K_M.gguf |
4.5 | Common 24 GB-card default |
Ornstein-3.6-27B-Q4_K_S.gguf |
4.5 | Smaller Q4 |
Ornstein-3.6-27B-Q4_0.gguf |
4 | Legacy 4-bit |
Ornstein-3.6-27B-IQ4_NL.gguf |
4.25 | Non-linear 4-bit I-quant |
Ornstein-3.6-27B-IQ4_XS.gguf |
4.25 | Smaller than Q4_K_S, comparable quality |
Ornstein-3.6-27B-Q3_K_L.gguf |
3.5 | Largest Q3 |
Ornstein-3.6-27B-Q3_K_M.gguf |
3.5 | Usable; quality below Q4 |
Ornstein-3.6-27B-Q3_K_S.gguf |
3.5 | Smaller Q3 |
Ornstein-3.6-27B-IQ3_M.gguf |
3.3 | Mixed I-quant, beats Q3_K_S at similar size |
Ornstein-3.6-27B-IQ3_S.gguf |
3.1 | 3-bit I-quant |
Ornstein-3.6-27B-IQ3_XS.gguf |
3.0 | Smaller 3-bit I-quant |
Ornstein-3.6-27B-IQ3_XXS.gguf |
3.0 | Aggressive 3-bit |
Ornstein-3.6-27B-Q2_K.gguf |
2.6 | Lowest K-quant; expect degraded quality |
BF16/F16 GGUF is not shipped here — if you want full precision, grab the safetensors from the base repo.
Usage
llama.cpp
# Interactive chat
llama-cli -m Ornstein-3.6-27B-Q4_K_M.gguf -cnv
# Single prompt
llama-cli -m Ornstein-3.6-27B-Q5_K_M.gguf -p "Write a haiku about hybrid attention."
# OpenAI-compatible server
llama-server -m Ornstein-3.6-27B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 -c 8192
Other runners
LM Studio, Ollama (via a Modelfile), koboldcpp, and text-generation-webui all load these GGUFs provided their bundled llama.cpp supports Qwen3_5ForConditionalGeneration with Gated Delta Net.
Reproducing the quants
# 1. Convert safetensors → BF16 GGUF
python llama.cpp/convert_hf_to_gguf.py <model_dir> \
--outtype bf16 --outfile Ornstein-3.6-27B-BF16.gguf
# 2. Quantize (example)
llama-quantize Ornstein-3.6-27B-BF16.gguf \
Ornstein-3.6-27B-Q4_K_M.gguf Q4_K_M
License
Apache 2.0 — inherited from the Qwen 3.6 base release.
- Downloads last month
- 4,088
3-bit
4-bit
5-bit
6-bit
8-bit

# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="GestaltLabs/Ornstein-3.6-27B-GGUF", filename="", )