Qwen3.6-27B with MTP

2.5x faster with MTP · 262K context on 48 GB · Fixed chat template

Dense 27B model with vision, thinking, and tool use — self-speculative decoding,
configurable KV cache (f16 for quality, q8_0/q4_0 for longer context), fixed Jinja template (tool calls and thinking actually work in C++ runtimes),
and a server with both OpenAI and Anthropic APIs.

One command. Both APIs. No cloud.


Warning: Vision (image input) crashes llama.cpp when used with MTP speculative decoding (PR #22673 bug, all platforms). Text-only MTP works at 2.5x speed. For vision, start the server without --spec-type mtp — see the Vision section.


Start the server

You need llama.cpp built from PR #22673 or newer. Homebrew and stable releases do not support MTP GGUFs.

Build llama.cpp with MTP support
git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git fetch origin pull/22673/head:mtp-pr && git checkout mtp-pr

cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-cli llama-server
llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  --spec-type mtp --spec-draft-n-max 3 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081

That's it. Three optimizations in one command:

Flag What it does Impact
--spec-type mtp --spec-draft-n-max 3 Multi-Token Prediction (built into the model) 2.5x faster generation
--cache-type-k q8_0 --cache-type-v q8_0 8-bit KV cache (instead of 16-bit) Half the KV memory, negligible quality loss
-c 262144 262K context window Full native context on 48 GB Mac with q8_0 KV

Adjust -m, -c, and --cache-type-k/v for your hardware — see the Which quant should I download? table below.


Which quant should I download?

Find your hardware below — each row gives the best quant, KV cache type, and max context that fits.

Apple Silicon

Qwen3.6-27B is a hybrid model — only 16 of 65 layers use KV cache (verified). The other 48 are linear attention (fixed 898 MiB recurrent state). KV memory is ~4× less than a standard dense model. Runtimes that don't handle this (e.g. vllm) allocate KV for all 65 layers and show much higher memory usage.

Numbers below are total memory used (model + KV cache + 0.9 GB recurrent state). Must leave ≥ 8 GB for macOS (16 GB Macs excepted).

RAM Quant KV cache Max context Total used Vision
16 GB IQ2_M q8_0 42K 12.0 GB
24 GB IQ3_M 46K 16.0 GB
24 GB IQ3_M q8_0 91K 16.0 GB
32 GB Q5_K_M 74K 24.0 GB
32 GB Q5_K_M q8_0 147K 24.0 GB
32 GB Q4_K_M 99K 24.0 GB
48 GB Q6_K 262K 39.7 GB
48 GB Q8_0 173K 40.0 GB
48 GB Q8_0 q8_0 262K 37.3 GB
64 GB Q8_0 262K 45.8 GB
96 GB Q8_0 262K 45.8 GB

NVIDIA GPU

Same model memory as Apple Silicon, plus ~1 GB CUDA overhead.

VRAM Quant KV cache Max context Total VRAM used Vision
12 GB IQ2_M q8_0 11K 12.0 GB
16 GB IQ3_M 30K 16.0 GB
16 GB IQ3_M q8_0 60K 16.0 GB
24 GB Q4_K_M 83K 24.0 GB
24 GB Q4_K_M q8_0 167K 24.0 GB
24 GB Q5_K_M 58K 24.0 GB
48 GB Q6_K 262K 40.7 GB
48 GB Q8_0 262K 46.8 GB
80 GB Q8_0 262K 46.8 GB

16 GB Mac: IQ2_M/q8_0 — 42K text-only. No vision.

24 GB Mac: IQ3_M — 46K (f16 KV) or 91K (q8_0). Vision at 32–65K.

32 GB Mac: Q5_K_M — 74K text-only (f16 KV), 147K (q8_0). Q4_K_M for vision at 99K.

48 GB Mac: Q6_K/f16 KV — 262K with vision. Q8_0/q8_0 KV for 262K at higher model quality.

64 GB+ Mac: Q8_0/f16 KV — 262K with vision. Maximum quality at practical speed.

12 GB GPU: IQ2_M/q8_0 — 11K. Very limited, no vision.

16 GB GPU: IQ3_M — 30K (f16 KV) or 60K (q8_0). No vision.

24 GB GPU: Q4_K_M — 83K with vision (f16 KV). Q5_K_M — 58K text-only (f16 KV), 116K (q8_0).

48 GB+ GPU: Q6_K/f16 KV — 262K with vision. Q8_0 for max quality.

Leave KV cache at f16 (blank column) for best quality. Use q8_0 KV only when f16 doesn't give enough context. q4_0 KV should not exceed 64K context.

Vision adds ~0.9 GB for mmproj. macOS needs ≥ 8 GB for itself (16 GB Macs excepted — use ~4 GB). You can increase available memory by raising the wired memory limit, e.g. for a 96 GB Mac: sudo sysctl iogpu.wired_limit_mb=90112 (88 GB). NVIDIA reserves ~1 GB for CUDA.


API usage

OpenAI-compatible (/v1/chat/completions)

curl http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen","messages":[{"role":"user","content":"Hello"}]}'

Works with any OpenAI client — just point it at http://localhost:8081/v1.

Anthropic-compatible (/v1/messages)

curl http://localhost:8081/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen","max_tokens":1024,"messages":[{"role":"user","content":"Hello"}]}'

Works with any Anthropic client — the server natively speaks the Messages API with streaming, tool use, and vision.

Claude Code

ANTHROPIC_BASE_URL=http://127.0.0.1:8081 claude

Claude Code uses the Anthropic Messages API. With this one env var, it talks to your local Qwen3.6-27B instead of the cloud.

Tool use (both APIs)

curl http://localhost:8081/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen",
    "max_tokens": 1024,
    "tools": [{
      "name": "get_weather",
      "description": "Get current weather for a location",
      "input_schema": {
        "type": "object",
        "properties": {"location": {"type": "string"}},
        "required": ["location"]
      }
    }],
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}]
  }'

Vision

MTP + vision crashes on PR #22673 (all platforms, confirmed bug). For image inputs, start the server without --spec-type mtp:

llama-server -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  --mmproj mmproj-Qwen3.6-27B-f16.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 262144 --temp 0.7 --top-k 20 -ngl 99 --port 8081
curl http://localhost:8081/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": [
      {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": "'$(base64 < photo.jpg)'"}},
      {"type": "text", "text": "Describe this image"}
    ]}]
  }'

Direct CLI usage

# Text generation
llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  --spec-type mtp --spec-draft-n-max 3 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -np 1 -c 4096 -n 2048 --temp 0.7 -ngl 99 \
  -p "Your prompt here"

# Vision (MTP does not work with images — omit --spec-type mtp)
llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  --mmproj mmproj-Qwen3.6-27B-f16.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -c 4096 -n 2048 --temp 0.7 -ngl 99 \
  --image photo.jpg \
  -p "Describe this image"

KV cache options

The --cache-type-k and --cache-type-v flags control KV cache precision. Lower precision = less memory = longer context on the same hardware.

Type Bits/val KV size (80K ctx) Quality Speed When to use
f16 16 5.3 GB Full Baseline Best quality — use when RAM allows
q8_0 8 2.8 GB Negligible loss Faster than f16 When f16 KV doesn't give enough context
q4_0 4 1.5 GB Minor loss Slightly slower Max context on limited RAM (≤64K only)

Recommendation: Leave KV at f16 for best quality. Use q8_0 when f16 doesn't give enough context. Reserve q4_0 for tight RAM — and only up to 64K context.

Effect on hardware requirements (Q5_K_M, 80K context):

KV type Model + recurrent + KV Hardware
f16 24 GB 48 GB Mac
q8_0 22 GB 32 GB Mac

Speculative decoding modes

MTP (recommended — 2.5x faster)

The model predicts 5 extra tokens per step using its own MTP heads, then verifies them in one pass. No extra model needed.

--spec-type mtp --spec-draft-n-max 3 -np 1

MTP currently requires -np 1 (single-sequence mode). Without it, you'll get: MTP currently supports only n_parallel=1.

Tune --spec-draft-n-max: 3 is optimal for general use (83% acceptance rate). Values of 1–2 are more conservative; 4–5 waste compute on rejected tokens.

Draft model (~2.3x faster)

Pair with a smaller Qwen 3.5/3.6 model that shares the same tokenizer.

llama-cli -m Qwen3.6-27B-Q5_K_M-mtp.gguf \
  -md Qwen3.5-0.8B-Q8_0.gguf \
  --spec-draft-n-max 10 -ngl 99 -ngld 99 \
  -c 4096 -n 2048 --temp 0.7 \
  -p "Your prompt"

ngram-mod (no extra model, benefits repeat prompts)

Uses cached n-grams from previous prompts.

--spec-type ngram-mod \
--spec-ngram-mod-n-match 24 \
--spec-ngram-mod-n-min 48 \
--spec-ngram-mod-n-max 64 \
--repeat-penalty 1.0

Downloads

File Size Min. (4K ctx) Recommended (80K ctx) Max (262K ctx)
Qwen3.6-27B-F16-mtp.gguf 51 GB 64 GB Mac · 80 GB GPU 64 GB Mac · 80 GB GPU 96 GB Mac · 80 GB GPU
Qwen3.6-27B-Q8_0-mtp.gguf 27 GB 48 GB Mac · 48 GB GPU 48 GB Mac · 48 GB GPU 48 GB Mac · 48 GB GPU
Qwen3.6-27B-Q6_K-mtp.gguf 21 GB 32 GB Mac · 24 GB GPU 48 GB Mac · 48 GB GPU 48 GB Mac · 48 GB GPU
Qwen3.6-27B-Q5_K_M-mtp.gguf 18 GB 32 GB Mac · 24 GB GPU 32 GB Mac · 24 GB GPU 48 GB Mac · 48 GB GPU
Qwen3.6-27B-Q4_K_M-mtp.gguf 16 GB 32 GB Mac · 24 GB GPU 32 GB Mac · 24 GB GPU 48 GB Mac · 48 GB GPU
Qwen3.6-27B-IQ4_XS-mtp.gguf 14 GB 24 GB Mac · 24 GB GPU 32 GB Mac · 24 GB GPU 32 GB Mac · 48 GB GPU
Qwen3.6-27B-IQ3_M-mtp.gguf 12 GB 24 GB Mac · 16 GB GPU 24 GB Mac · 24 GB GPU 32 GB Mac · 24 GB GPU
Qwen3.6-27B-IQ2_M-mtp.gguf 10 GB 16 GB Mac · 16 GB GPU 24 GB Mac · 16 GB GPU 32 GB Mac · 24 GB GPU
mmproj-Qwen3.6-27B-f16.gguf 885 MB Vision encoder (optional, any tier)

All tiers include MTP heads. F16 and Q8_0 are direct conversions; all other tiers were quantized from Q8_0 with an importance matrix. Q5_K_M is the sweet spot — use Q4_K_M if you're tight on RAM, Q8_0 if you want maximum quality. F16 is available for experimentation but is significantly slower than Q8_0. GPU means NVIDIA (RTX 3060 = 12 GB, RTX 3090/4090 = 24 GB, A6000 = 48 GB, A100 = 80 GB).

Hardware numbers assume f16 KV for "Min." (4K) and q8_0 KV for "Recommended" (80K) and "Max" (262K). Add --cache-type-k q8_0 --cache-type-v q8_0 to reach the recommended or max context on smaller hardware.


Memory requirements

Approximate VRAM on Apple Silicon (unified memory), using Q5_K_M as reference. Includes 0.9 GB recurrent state (constant, does not scale with context). Only 16 of 65 layers use KV cache — the other 48 use linear attention.

Context Model KV (f16) KV (q8_0) Total (f16) Total (q8_0) Min. Mac
4K 18 GB 0.3 GB 0.1 GB 19 GB 19 GB 32 GB
8K 18 GB 0.5 GB 0.3 GB 19 GB 19 GB 32 GB
32K 18 GB 2.1 GB 1.0 GB 20 GB 20 GB 32 GB
64K 18 GB 4.1 GB 2.1 GB 21 GB 21 GB 32 GB
80K (recommended) 18 GB 5.2 GB 2.6 GB 22 GB 22 GB 32 GB
128K 18 GB 8.3 GB 4.1 GB 25 GB 23 GB 32 GB
262K (max native) 18 GB 17.0 GB 8.5 GB 34 GB 27 GB 48 GB

"Total" = model + recurrent state + KV cache. macOS needs ≥ 8 GB (16 GB Macs excepted). With vision: add 0.9 GB for the mmproj.

Memory for all quant tiers (4K context, q8_0 KV)
Quant Model KV + recurrent Total Min. Mac
Q8_0 27 GB 1.0 GB 28 GB 48 GB
Q6_K 21 GB 1.0 GB 22 GB 32 GB
Q5_K_M 18 GB 1.0 GB 19 GB 32 GB
Q4_K_M 16 GB 1.0 GB 17 GB 32 GB
IQ4_XS 14 GB 1.0 GB 15 GB 24 GB
IQ3_M 12 GB 1.0 GB 13 GB 24 GB
IQ2_M 10 GB 1.0 GB 11 GB 16 GB

System prompt

The first line must be:

You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

The model underperforms without it. Append anything after that line.


Thinking toggle

Drop <|think_on|> or <|think_off|> in any message to toggle thinking. The template strips the tag so the model never sees it.

System: You are a coding assistant. <|think_off|>
User: What's 2+2?

Fast answer, no internal reasoning.

System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.

The model thinks step by step, then answers.


Sampling

From the official Qwen authors. Reserve 128K+ context for thinking mode.

Mode temp top_p top_k repeat_penalty
Thinking (coding) 0.6 0.95 20 1.0
Thinking (general) 1.0 0.95 20 1.0
Non-thinking (general) 0.7 0.8 20 1.0

Compatibility

Runtime Status Why
llama.cpp (PR #22673+) Works fully This is the target runtime
llama.cpp (stable / homebrew) Does not load missing tensor — MTP heads not recognized
LM Studio Does not load Same issue — bundled llama.cpp rejects MTP GGUFs
Ollama Does not load No speculative decoding support yet
koboldcpp Unknown Depends on bundled llama.cpp version

LM Studio users: use the MLX 8-bit or MLX 4-bit instead — full vision + tools + thinking, no MTP.


Chat template fixes

The bundled Jinja template fixes several bugs in the official Qwen 3.6 template:

  • Tool calls crash on C++ engines. The official template uses Python's |items filter and |safe, which don't exist in C++ Jinja runtimes (llama.cpp, LM Studio). This template uses direct dictionary key lookups.
  • The developer role crashes. Modern APIs send message.role == "developer". The official template throws an exception. This template maps it to system.
  • Empty preserve_thinking spam. The official template wraps every past turn in empty <think/> blocks, wasting context tokens. This template only emits thinking blocks with actual content.
  • </thinking> hallucination handling. The model sometimes generates </thinking> instead of the expected closing tag. Both are handled gracefully.

See Qwen-Fixed-Chat-Templates for the standalone template repo.


Architecture details
Spec Value
Total params 27.8B (dense, all active)
Layers 64 (3x linear attention + 1x full attention, 16 repetitions) + 1 MTP layer
Attention 24 Q heads, 4 KV heads (GQA), head_dim 256
Linear attention 16 QK heads, 48 V heads, head_dim 128
FFN intermediate_size 17408
Context 262K native, 1M+ with YaRN
RoPE theta 10M, partial_rotary_factor 0.25, mrope_interleaved
Vocab 248K tokens
Multi-token prediction 1 MTP draft layer (15 tensors)
model_type qwen3_5
Conversion details

Converted from official Qwen3.6-27B safetensors using the modified convert_hf_to_gguf.py from llama.cpp PR #22673. Standard converters skip MTP tensors — the PR includes them. Q8_0 is the direct conversion; all lower tiers were quantized from it.

Quantized using llama-quantize with unsloth's importance matrix (calibrated with chat template at 6K–12K context, 76 chunks, 496 entries). I-quant tiers keep MTP tensors at Q8_0 for stability.

The chat template was replaced with the fixed version from Qwen-Fixed-Chat-Templates before conversion.


Links


Authorship

Role Author
Original model Alibaba Cloud (Qwen team)
GGUF conversion + MTP + vision + fixed chat template + quantization froggeric
Importance matrix unsloth

License

Apache-2.0, inherited from Qwen3.6.

Downloads last month
36,920
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for froggeric/Qwen3.6-27B-MTP-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(286)
this model