Qwen3.6-27B NVFP4 GGUF β MTP variant
NVFP4 GGUF quantizations of Qwen/Qwen3.6-27B with Multi-Token Prediction (MTP) support for speculative decoding in llama.cpp.
Same NVFP4 trunk layout as our Qwen3.6-27B-NVFP4-GGUF repo, plus the MTP draft head extracted from the source for use with --spec-type draft-mtp. The MTP head lets llama.cpp speculatively draft the next 1β4 tokens at near-zero cost and verify them in a single trunk forward pass β yielding ~+22% token-generation throughput for single-stream serving on an RTX 5090.
About LibertAI
LibertAI is a decentralized AI platform β private inference, an OpenAI-compatible API, and a chat UI, all running on community GPUs over Aleph Cloud instead of a single company's servers.
If you want to put this model (or any other) to work as an autonomous agent without running your own infrastructure, check out LiberClaw β Hermes-style agents hosted on Aleph Cloud with LibertAI inference. Free tier: 2 agents, no credit card, 5 minutes to deploy. Open source.
Files
| File | Size | FFN | Other tensors | When to pick |
|---|---|---|---|---|
Qwen3.6-27B-NVFP4-Q4_K_M-mtp.gguf |
15 GB | NVFP4 | Q4_K_M | Recommended trunk. Pair with the MTP draft for the best perf/VRAM ratio |
Qwen3.6-27B-NVFP4-Q8_0-mtp.gguf |
19 GB | NVFP4 | Q8_0 | Higher-precision attention/embeddings if you have the VRAM |
Qwen3.6-27B-NVFP4-BF16-mtp.gguf |
28 GB | NVFP4 | BF16 | Max quality (source-precision non-FFN tensors); slower in practice |
mtp-Qwen3.6-27B-NVFP4.gguf |
5.6 GB | BF16 | BF16/F32 | MTP draft head β required for --spec-type draft-mtp |
mmproj-Qwen3.6-27B-F16.gguf |
889 MB | β | F16 vision tower | Required for image/video input β reusable across all Qwen3.6-27B GGUFs |
The trunk files are built with convert_hf_to_gguf.py --no-mtp, so the MTP weights are split out into the separate mtp-*.gguf file. This lets llama.cpp load trunk + draft as a paired model and lets future llama.cpp versions handle the draft on a separate device/stream.
Performance
Measured on an NVIDIA RTX 5090 (32 GB, Blackwell, sm_120, CUDA 13.0), llama.cpp build dbe9c0c8c.
Single-stream serving, 28-token prompt β 512-token completion (n=3, prod settings: -c 262144 -fa on -ctk q4_0 -ctv q4_0 -ub 256 -b 1024 --parallel 1):
| Config | TG (tok/s) | Draft accept rate | Ξ vs no-MTP |
|---|---|---|---|
| NVFP4-Q4_K_M, no MTP | 74.4 | β | baseline |
| NVFP4-Q4_K_M + MTP (n-max=4, p-min=0.5) | 90.8 | 69.5% | +22% |
MTP gives a clean win for this dense model β the draft head is small enough that its forward pass is near-free relative to the trunk, and the ~70% accept rate means most drafts land.
Real-world workloads
The 28-token synthetic prompt above is adversarial for MTP β it forces the model into free-form generation where draft accept rates are lowest. On natural workloads (chat, code completion, structured output, anything with predictable phrasing) we observe ~115 tok/s on the same hardware/config, corresponding to ~82β85% draft accept. Treat the +22% bench number as a floor; real gains are typically larger.
Usage
Server (single-stream, recommended)
llama-server \
-m Qwen3.6-27B-NVFP4-Q4_K_M-mtp.gguf \
--model-draft mtp-Qwen3.6-27B-NVFP4.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 4 \
--spec-draft-p-min 0.5 \
-ngl 999 -ngld 999 \
-fa on -c 32768 \
--host 0.0.0.0 --port 8080
The -ngl and -ngld flags push the trunk and the draft to GPU respectively. --spec-draft-n-max 4 drafts up to 4 tokens per cycle; --spec-draft-p-min 0.5 only commits drafts whose probability is above 0.5.
Multimodal (vision + text)
llama-server \
-m Qwen3.6-27B-NVFP4-Q4_K_M-mtp.gguf \
--model-draft mtp-Qwen3.6-27B-NVFP4.gguf \
--spec-type draft-mtp --spec-draft-n-max 4 --spec-draft-p-min 0.5 \
--mmproj mmproj-Qwen3.6-27B-F16.gguf \
-ngl 999 -ngld 999 -c 32768 \
--host 0.0.0.0 --port 8080
Recommended sampler
Qwen3.6 is a thinking model. The default chat template enables <think> blocks. For non-thinking usage, set chat_template_kwargs.enable_thinking=false in the API.
What is MTP?
Multi-Token Prediction adds a small extra block to the model (here, 1 layer at index 64) that runs alongside the main forward pass and produces a candidate next-token distribution from the same hidden state. llama.cpp uses this as a free draft for speculative decoding: it generates up to N candidate tokens, then verifies all of them in a single trunk forward pass. Accepted drafts ride along for free, so the effective per-token cost drops by the accept rate.
The MTP path landed in llama.cpp via PR #22673; NVFP4 scale-tensor support for the MTP block landed in #23563.
Architecture notes
Qwen3.6-27B is a hybrid attention + SSM dense model: every 4th layer is conventional attention; the remaining 48 of 64 layers use Mamba-style linear_attn blocks. The NVFP4 source from mmangkad keeps the SSM in_proj_* projections and standard attention projections at higher precision β only the FFN matmul (192 tensors) is NVFP4. The MTP block at layer 64 contains its own conventional attention + dense FFN + the nextn heads (eh_proj, enorm, hnorm, shared_head_norm), kept at BF16 since it's small.
Sources & credits
- Base model: Qwen/Qwen3.6-27B by Alibaba Qwen team β Apache 2.0
- NVFP4 calibration source: mmangkad/Qwen3.6-27B-NVFP4 (NVIDIA ModelOpt v0.42.0)
- mmproj source: official BF16 weights from
Qwen/Qwen3.6-27B - Tooling: llama.cpp
convert_hf_to_gguf.py --no-mtp/--mtpandllama-quantize
License
Apache 2.0, inherited from the upstream model.
- Downloads last month
- -
4-bit
8-bit
16-bit
Model tree for LibertAIDAI/Qwen3.6-27B-NVFP4-MTP-GGUF
Base model
Qwen/Qwen3.6-27B