llm.create_chat_completion(
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
)Dense 27B model with vision, thinking, and tool use — self-speculative decoding,
configurable KV cache (f16 for quality, q8_0/q4_0 for longer context), fixed Jinja template (tool calls and thinking actually work in C++ runtimes),
and a server with both OpenAI and Anthropic APIs.
One command. Both APIs. No cloud.
Warning: Vision (image input) crashes llama.cpp when used with MTP speculative decoding (PR #22673 bug, all platforms). Text-only MTP works at 2.5x speed. For vision, start the server without
--spec-type mtp— see the Vision section.
Start the server
You need llama.cpp built from PR #22673 or newer. Homebrew and stable releases do not support MTP GGUFs.
Build llama.cpp with MTP support
git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin pull/22673/head:mtp-pr && git checkout mtp-pr
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli llama-server
llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
--spec-type mtp --spec-draft-n-max 3 \
--cache-type-k q8_0 --cache-type-v q8_0 \
-np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081
That's it. Three optimizations in one command:
| Flag | What it does | Impact |
|---|---|---|
--spec-type mtp --spec-draft-n-max 3 |
Multi-Token Prediction (built into the model) | 2.5x faster generation |
--cache-type-k q8_0 --cache-type-v q8_0 |
8-bit KV cache (instead of 16-bit) | Half the KV memory, negligible quality loss |
-c 262144 |
262K context window | Full native context on 48 GB Mac with q8_0 KV |
Adjust -m, -c, and --cache-type-k/v for your hardware — see the Which quant should I download? table below.
Which quant should I download?
Find your hardware below — each row gives the best quant, KV cache type, and max context that fits.
Apple Silicon
Qwen3.6-27B is a hybrid model — only 16 of 65 layers use KV cache (verified). The other 48 are linear attention (fixed 898 MiB recurrent state). KV memory is ~4× less than a standard dense model. Runtimes that don't handle this (e.g. vllm) allocate KV for all 65 layers and show much higher memory usage.
Numbers below are total memory used (model + KV cache + 0.9 GB recurrent state). Must leave ≥ 8 GB for macOS (16 GB Macs excepted).
| RAM | Quant | KV cache | Max context | Total used | Vision |
|---|---|---|---|---|---|
| 16 GB | IQ2_M |
q8_0 |
42K | 12.0 GB | ✗ |
| 24 GB | IQ3_M |
46K | 16.0 GB | ✗ | |
| 24 GB | IQ3_M |
q8_0 |
91K | 16.0 GB | ✗ |
| 32 GB | Q5_K_M |
74K | 24.0 GB | ✗ | |
| 32 GB | Q5_K_M |
q8_0 |
147K | 24.0 GB | ✗ |
| 32 GB | Q4_K_M |
99K | 24.0 GB | ✓ | |
| 48 GB | Q6_K |
262K | 39.7 GB | ✓ | |
| 48 GB | Q8_0 |
173K | 40.0 GB | ✓ | |
| 48 GB | Q8_0 |
q8_0 |
262K | 37.3 GB | ✓ |
| 64 GB | Q8_0 |
262K | 45.8 GB | ✓ | |
| 96 GB | Q8_0 |
262K | 45.8 GB | ✓ |
NVIDIA GPU
Same model memory as Apple Silicon, plus ~1 GB CUDA overhead.
| VRAM | Quant | KV cache | Max context | Total VRAM used | Vision |
|---|---|---|---|---|---|
| 12 GB | IQ2_M |
q8_0 |
11K | 12.0 GB | ✗ |
| 16 GB | IQ3_M |
30K | 16.0 GB | ✗ | |
| 16 GB | IQ3_M |
q8_0 |
60K | 16.0 GB | ✗ |
| 24 GB | Q4_K_M |
83K | 24.0 GB | ✓ | |
| 24 GB | Q4_K_M |
q8_0 |
167K | 24.0 GB | ✓ |
| 24 GB | Q5_K_M |
58K | 24.0 GB | ✗ | |
| 48 GB | Q6_K |
262K | 40.7 GB | ✓ | |
| 48 GB | Q8_0 |
262K | 46.8 GB | ✓ | |
| 80 GB | Q8_0 |
262K | 46.8 GB | ✓ |
16 GB Mac:
IQ2_M/q8_0 — 42K text-only. No vision.24 GB Mac:
IQ3_M— 46K (f16 KV) or 91K (q8_0). Vision at 32–65K.32 GB Mac:
Q5_K_M— 74K text-only (f16 KV), 147K (q8_0).Q4_K_Mfor vision at 99K.48 GB Mac:
Q6_K/f16 KV — 262K with vision.Q8_0/q8_0 KV for 262K at higher model quality.64 GB+ Mac:
Q8_0/f16 KV — 262K with vision. Maximum quality at practical speed.12 GB GPU:
IQ2_M/q8_0 — 11K. Very limited, no vision.16 GB GPU:
IQ3_M— 30K (f16 KV) or 60K (q8_0). No vision.24 GB GPU:
Q4_K_M— 83K with vision (f16 KV).Q5_K_M— 58K text-only (f16 KV), 116K (q8_0).48 GB+ GPU:
Q6_K/f16 KV — 262K with vision.Q8_0for max quality.
Leave KV cache at f16 (blank column) for best quality. Use q8_0 KV only when f16 doesn't give enough context. q4_0 KV should not exceed 64K context.
Vision adds ~0.9 GB for mmproj. macOS needs ≥ 8 GB for itself (16 GB Macs excepted — use ~4 GB). You can increase available memory by raising the wired memory limit, e.g. for a 96 GB Mac: sudo sysctl iogpu.wired_limit_mb=90112 (88 GB). NVIDIA reserves ~1 GB for CUDA.
API usage
OpenAI-compatible (/v1/chat/completions)
curl http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen","messages":[{"role":"user","content":"Hello"}]}'
Works with any OpenAI client — just point it at http://localhost:8081/v1.
Anthropic-compatible (/v1/messages)
curl http://localhost:8081/v1/messages \
-H "Content-Type: application/json" \
-d '{"model":"qwen","max_tokens":1024,"messages":[{"role":"user","content":"Hello"}]}'
Works with any Anthropic client — the server natively speaks the Messages API with streaming, tool use, and vision.
Claude Code
ANTHROPIC_BASE_URL=http://127.0.0.1:8081 claude
Claude Code uses the Anthropic Messages API. With this one env var, it talks to your local Qwen3.6-27B instead of the cloud.
Tool use (both APIs)
curl http://localhost:8081/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"max_tokens": 1024,
"tools": [{
"name": "get_weather",
"description": "Get current weather for a location",
"input_schema": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}],
"messages": [{"role": "user", "content": "What is the weather in Paris?"}]
}'
Vision
MTP + vision crashes on PR #22673 (all platforms, confirmed bug). For image inputs, start the server without
--spec-type mtp:llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \ --mmproj mmproj-Qwen3.6-27B-f16.gguf \ --cache-type-k q8_0 --cache-type-v q8_0 \ -c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081
curl http://localhost:8081/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"max_tokens": 1024,
"messages": [{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": "'$(base64 < photo.jpg)'"}},
{"type": "text", "text": "Describe this image"}
]}]
}'
Direct CLI usage
# Text generation
llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
--spec-type mtp --spec-draft-n-max 3 \
--cache-type-k q8_0 --cache-type-v q8_0 \
-np 1 -c 4096 -n 2048 --temp 0.7 -ngl 99 \
-p "Your prompt here"
# Vision (MTP does not work with images — omit --spec-type mtp)
llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
--mmproj mmproj-Qwen3.6-27B-f16.gguf \
--cache-type-k q8_0 --cache-type-v q8_0 \
-c 4096 -n 2048 --temp 0.7 -ngl 99 \
--image photo.jpg \
-p "Describe this image"
KV cache options
The --cache-type-k and --cache-type-v flags control KV cache precision. Lower precision = less memory = longer context on the same hardware.
| Type | Bits/val | KV size (80K ctx) | Quality | Speed | When to use |
|---|---|---|---|---|---|
f16 |
16 | 5.3 GB | Full | Baseline | Best quality — use when RAM allows |
q8_0 |
8 | 2.8 GB | Negligible loss | Faster than f16 | When f16 KV doesn't give enough context |
q4_0 |
4 | 1.5 GB | Minor loss | Slightly slower | Max context on limited RAM (≤64K only) |
Recommendation: Leave KV at f16 for best quality. Use q8_0 when f16 doesn't give enough context. Reserve q4_0 for tight RAM — and only up to 64K context.
Effect on hardware requirements (Q5_K_M, 80K context):
| KV type | Model + recurrent + KV | Hardware |
|---|---|---|
| f16 | 24 GB | 48 GB Mac |
| q8_0 | 22 GB | 32 GB Mac |
Speculative decoding modes
MTP (recommended — 2.5x faster)
The model predicts 5 extra tokens per step using its own MTP heads, then verifies them in one pass. No extra model needed.
--spec-type mtp --spec-draft-n-max 3 -np 1
MTP currently requires -np 1 (single-sequence mode). Without it, you'll get: MTP currently supports only n_parallel=1.
Tune --spec-draft-n-max: 3 is optimal for general use (83% acceptance rate). Values of 1–2 are more conservative; 4–5 waste compute on rejected tokens.
Draft model (~2.3x faster)
Pair with a smaller Qwen 3.5/3.6 model that shares the same tokenizer.
llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
-md Qwen3.5-0.8B-Q8_0.gguf \
--spec-draft-n-max 10 -ngl 99 -ngld 99 \
-c 4096 -n 2048 --temp 0.7 \
-p "Your prompt"
ngram-mod (no extra model, benefits repeat prompts)
Uses cached n-grams from previous prompts.
--spec-type ngram-mod \
--spec-ngram-mod-n-match 24 \
--spec-ngram-mod-n-min 48 \
--spec-ngram-mod-n-max 64 \
--repeat-penalty 1.0
Downloads
| File | Size | Min. (4K ctx) | Recommended (80K ctx) | Max (262K ctx) |
|---|---|---|---|---|
Qwen3.6-27B-F16-mtp.gguf |
51 GB | 64 GB Mac · 80 GB GPU | 64 GB Mac · 80 GB GPU | 96 GB Mac · 80 GB GPU |
Qwen3.6-27B-Q8_0-mtp.gguf |
27 GB | 48 GB Mac · 48 GB GPU | 48 GB Mac · 48 GB GPU | 48 GB Mac · 48 GB GPU |
Qwen3.6-27B-Q6_K-mtp.gguf |
21 GB | 32 GB Mac · 24 GB GPU | 48 GB Mac · 48 GB GPU | 48 GB Mac · 48 GB GPU |
Qwen3.6-27B-Q5_K_M-mtp.gguf |
18 GB | 32 GB Mac · 24 GB GPU | 32 GB Mac · 24 GB GPU | 48 GB Mac · 48 GB GPU |
Qwen3.6-27B-Q4_K_M-mtp.gguf |
16 GB | 32 GB Mac · 24 GB GPU | 32 GB Mac · 24 GB GPU | 48 GB Mac · 48 GB GPU |
Qwen3.6-27B-IQ4_XS-mtp.gguf |
14 GB | 24 GB Mac · 24 GB GPU | 32 GB Mac · 24 GB GPU | 32 GB Mac · 48 GB GPU |
Qwen3.6-27B-IQ3_M-mtp.gguf |
12 GB | 24 GB Mac · 16 GB GPU | 24 GB Mac · 24 GB GPU | 32 GB Mac · 24 GB GPU |
Qwen3.6-27B-IQ2_M-mtp.gguf |
10 GB | 16 GB Mac · 16 GB GPU | 24 GB Mac · 16 GB GPU | 32 GB Mac · 24 GB GPU |
mmproj-Qwen3.6-27B-f16.gguf |
885 MB | Vision encoder (optional, any tier) | — | — |
All tiers include MTP heads. F16 and Q8_0 are direct conversions; all other tiers were quantized from Q8_0 with an importance matrix. Q5_K_M is the sweet spot — use Q4_K_M if you're tight on RAM, Q8_0 if you want maximum quality. F16 is available for experimentation but is significantly slower than Q8_0. GPU means NVIDIA (RTX 3060 = 12 GB, RTX 3090/4090 = 24 GB, A6000 = 48 GB, A100 = 80 GB).
Hardware numbers assume f16 KV for "Min." (4K) and q8_0 KV for "Recommended" (80K) and "Max" (262K). Add --cache-type-k q8_0 --cache-type-v q8_0 to reach the recommended or max context on smaller hardware.
Memory requirements
Approximate VRAM on Apple Silicon (unified memory), using Q5_K_M as reference. Includes 0.9 GB recurrent state (constant, does not scale with context). Only 16 of 65 layers use KV cache — the other 48 use linear attention.
| Context | Model | KV (f16) | KV (q8_0) | Total (f16) | Total (q8_0) | Min. Mac |
|---|---|---|---|---|---|---|
| 4K | 18 GB | 0.3 GB | 0.1 GB | 19 GB | 19 GB | 32 GB |
| 8K | 18 GB | 0.5 GB | 0.3 GB | 19 GB | 19 GB | 32 GB |
| 32K | 18 GB | 2.1 GB | 1.0 GB | 20 GB | 20 GB | 32 GB |
| 64K | 18 GB | 4.1 GB | 2.1 GB | 21 GB | 21 GB | 32 GB |
| 80K (recommended) | 18 GB | 5.2 GB | 2.6 GB | 22 GB | 22 GB | 32 GB |
| 128K | 18 GB | 8.3 GB | 4.1 GB | 25 GB | 23 GB | 32 GB |
| 262K (max native) | 18 GB | 17.0 GB | 8.5 GB | 34 GB | 27 GB | 48 GB |
"Total" = model + recurrent state + KV cache. macOS needs ≥ 8 GB (16 GB Macs excepted). With vision: add 0.9 GB for the mmproj.
Memory for all quant tiers (4K context, q8_0 KV)
| Quant | Model | KV + recurrent | Total | Min. Mac |
|---|---|---|---|---|
| Q8_0 | 27 GB | 1.0 GB | 28 GB | 48 GB |
| Q6_K | 21 GB | 1.0 GB | 22 GB | 32 GB |
| Q5_K_M | 18 GB | 1.0 GB | 19 GB | 32 GB |
| Q4_K_M | 16 GB | 1.0 GB | 17 GB | 32 GB |
| IQ4_XS | 14 GB | 1.0 GB | 15 GB | 24 GB |
| IQ3_M | 12 GB | 1.0 GB | 13 GB | 24 GB |
| IQ2_M | 10 GB | 1.0 GB | 11 GB | 16 GB |
System prompt
The first line must be:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
The model underperforms without it. Append anything after that line.
Thinking toggle
Drop <|think_on|> or <|think_off|> in any message to toggle thinking. The template strips the tag so the model never sees it.
System: You are a coding assistant. <|think_off|>
User: What's 2+2?
Fast answer, no internal reasoning.
System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.
The model thinks step by step, then answers.
Sampling
From the official Qwen authors. Reserve 128K+ context for thinking mode.
| Mode | temp | top_p | top_k | repeat_penalty |
|---|---|---|---|---|
| Thinking (coding) | 0.6 | 0.95 | 20 | 1.0 |
| Thinking (general) | 1.0 | 0.95 | 20 | 1.0 |
| Non-thinking (general) | 0.7 | 0.8 | 20 | 1.0 |
Compatibility
| Runtime | Status | Why |
|---|---|---|
| llama.cpp (PR #22673+) | Works fully | This is the target runtime |
| llama.cpp (stable / homebrew) | Does not load | missing tensor — MTP heads not recognized |
| LM Studio | Does not load | Same issue — bundled llama.cpp rejects MTP GGUFs |
| Ollama | Does not load | No speculative decoding support yet |
| koboldcpp | Unknown | Depends on bundled llama.cpp version |
LM Studio users: use the MLX 8-bit or MLX 4-bit instead — full vision + tools + thinking, no MTP.
Chat template fixes
The bundled Jinja template fixes several bugs in the official Qwen 3.6 template:
- Tool calls crash on C++ engines. The official template uses Python's
|itemsfilter and|safe, which don't exist in C++ Jinja runtimes (llama.cpp, LM Studio). This template uses direct dictionary key lookups. - The
developerrole crashes. Modern APIs sendmessage.role == "developer". The official template throws an exception. This template maps it tosystem. - Empty
preserve_thinkingspam. The official template wraps every past turn in empty<think/>blocks, wasting context tokens. This template only emits thinking blocks with actual content. </thinking>hallucination handling. The model sometimes generates</thinking>instead of the expected closing tag. Both are handled gracefully.
See Qwen-Fixed-Chat-Templates for the standalone template repo.
Architecture details
| Spec | Value |
|---|---|
| Total params | 27.8B (dense, all active) |
| Layers | 64 (3x linear attention + 1x full attention, 16 repetitions) + 1 MTP layer |
| Attention | 24 Q heads, 4 KV heads (GQA), head_dim 256 |
| Linear attention | 16 QK heads, 48 V heads, head_dim 128 |
| FFN | intermediate_size 17408 |
| Context | 262K native, 1M+ with YaRN |
| RoPE | theta 10M, partial_rotary_factor 0.25, mrope_interleaved |
| Vocab | 248K tokens |
| Multi-token prediction | 1 MTP draft layer (15 tensors) |
| model_type | qwen3_5 |
Conversion details
Converted from official Qwen3.6-27B safetensors using the modified convert_hf_to_gguf.py from llama.cpp PR #22673. Standard converters skip MTP tensors — the PR includes them. Q8_0 is the direct conversion; all lower tiers were quantized from it.
Quantized using llama-quantize with unsloth's importance matrix (calibrated with chat template at 6K–12K context, 76 chunks, 496 entries). I-quant tiers keep MTP tensors at Q8_0 for stability.
The chat template was replaced with the fixed version from Qwen-Fixed-Chat-Templates before conversion.
Links
- Original model
- MLX 8-bit (LM Studio, Apple Silicon native, no MTP)
- MLX 4-bit
- Fixed chat templates
- Qwen3.6 blog post
Authorship
| Role | Author |
|---|---|
| Original model | Alibaba Cloud (Qwen team) |
| GGUF conversion + MTP + vision + fixed chat template + quantization | froggeric |
| Importance matrix | unsloth |
License
Apache-2.0, inherited from Qwen3.6.
- Downloads last month
- 41,122
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for froggeric/Qwen3.6-27B-MTP-GGUF
Base model
Qwen/Qwen3.6-27B
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="froggeric/Qwen3.6-27B-MTP-GGUF", filename="", )