Qwen3.6-27B NVFP4 GGUF — MTP variant

NVFP4 GGUF quantizations of Qwen/Qwen3.6-27B with Multi-Token Prediction (MTP) support for speculative decoding in llama.cpp.

Same NVFP4 trunk layout as our Qwen3.6-27B-NVFP4-GGUF repo, plus the MTP draft head extracted from the source for use with --spec-type draft-mtp. The MTP head lets llama.cpp speculatively draft the next 1–4 tokens at near-zero cost and verify them in a single trunk forward pass — yielding ~+22% token-generation throughput for single-stream serving on an RTX 5090.

About LibertAI

LibertAI is a decentralized AI platform — private inference, an OpenAI-compatible API, and a chat UI, all running on community GPUs over Aleph Cloud instead of a single company's servers.

If you want to put this model (or any other) to work as an autonomous agent without running your own infrastructure, check out LiberClaw — Hermes-style agents hosted on Aleph Cloud with LibertAI inference. Free tier: 2 agents, no credit card, 5 minutes to deploy. Open source.

Files

File	Size	FFN	Other tensors	When to pick
`Qwen3.6-27B-NVFP4-Q4_K_M-mtp.gguf`	15 GB	NVFP4	Q4_K_M	Recommended trunk. Pair with the MTP draft for the best perf/VRAM ratio
`Qwen3.6-27B-NVFP4-Q8_0-mtp.gguf`	19 GB	NVFP4	Q8_0	Higher-precision attention/embeddings if you have the VRAM
`Qwen3.6-27B-NVFP4-BF16-mtp.gguf`	28 GB	NVFP4	BF16	Max quality (source-precision non-FFN tensors); slower in practice
`mtp-Qwen3.6-27B-NVFP4.gguf`	5.6 GB	BF16	BF16/F32	MTP draft head — required for `--spec-type draft-mtp`
`mmproj-Qwen3.6-27B-F16.gguf`	889 MB	—	F16 vision tower	Required for image/video input — reusable across all Qwen3.6-27B GGUFs

The trunk files are built with convert_hf_to_gguf.py --no-mtp, so the MTP weights are split out into the separate mtp-*.gguf file. This lets llama.cpp load trunk + draft as a paired model and lets future llama.cpp versions handle the draft on a separate device/stream.

Performance

Measured on an NVIDIA RTX 5090 (32 GB, Blackwell, sm_120, CUDA 13.0), llama.cpp build dbe9c0c8c.

Single-stream serving, 28-token prompt → 512-token completion (n=3, prod settings: -c 262144 -fa on -ctk q4_0 -ctv q4_0 -ub 256 -b 1024 --parallel 1):

Config	TG (tok/s)	Draft accept rate	Δ vs no-MTP
NVFP4-Q4_K_M, no MTP	74.4	—	baseline
NVFP4-Q4_K_M + MTP (n-max=4, p-min=0.5)	90.8	69.5%	+22%

MTP gives a clean win for this dense model — the draft head is small enough that its forward pass is near-free relative to the trunk, and the ~70% accept rate means most drafts land.

Real-world workloads

The 28-token synthetic prompt above is adversarial for MTP — it forces the model into free-form generation where draft accept rates are lowest. On natural workloads (chat, code completion, structured output, anything with predictable phrasing) we observe ~115 tok/s on the same hardware/config, corresponding to ~82–85% draft accept. Treat the +22% bench number as a floor; real gains are typically larger.

Usage

Server (single-stream, recommended)

llama-server \
  -m Qwen3.6-27B-NVFP4-Q4_K_M-mtp.gguf \
  --model-draft mtp-Qwen3.6-27B-NVFP4.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 4 \
  --spec-draft-p-min 0.5 \
  -ngl 999 -ngld 999 \
  -fa on -c 32768 \
  --host 0.0.0.0 --port 8080

The -ngl and -ngld flags push the trunk and the draft to GPU respectively. --spec-draft-n-max 4 drafts up to 4 tokens per cycle; --spec-draft-p-min 0.5 only commits drafts whose probability is above 0.5.

Multimodal (vision + text)

llama-server \
  -m Qwen3.6-27B-NVFP4-Q4_K_M-mtp.gguf \
  --model-draft mtp-Qwen3.6-27B-NVFP4.gguf \
  --spec-type draft-mtp --spec-draft-n-max 4 --spec-draft-p-min 0.5 \
  --mmproj mmproj-Qwen3.6-27B-F16.gguf \
  -ngl 999 -ngld 999 -c 32768 \
  --host 0.0.0.0 --port 8080

Recommended sampler

Qwen3.6 is a thinking model. The default chat template enables <think> blocks. For non-thinking usage, set chat_template_kwargs.enable_thinking=false in the API.

What is MTP?

Multi-Token Prediction adds a small extra block to the model (here, 1 layer at index 64) that runs alongside the main forward pass and produces a candidate next-token distribution from the same hidden state. llama.cpp uses this as a free draft for speculative decoding: it generates up to N candidate tokens, then verifies all of them in a single trunk forward pass. Accepted drafts ride along for free, so the effective per-token cost drops by the accept rate.

The MTP path landed in llama.cpp via PR #22673; NVFP4 scale-tensor support for the MTP block landed in #23563.

Architecture notes

Qwen3.6-27B is a hybrid attention + SSM dense model: every 4th layer is conventional attention; the remaining 48 of 64 layers use Mamba-style linear_attn blocks. The NVFP4 source from mmangkad keeps the SSM in_proj_* projections and standard attention projections at higher precision — only the FFN matmul (192 tensors) is NVFP4. The MTP block at layer 64 contains its own conventional attention + dense FFN + the nextn heads (eh_proj, enorm, hnorm, shared_head_norm), kept at BF16 since it's small.

Sources & credits

Base model: Qwen/Qwen3.6-27B by Alibaba Qwen team — Apache 2.0
NVFP4 calibration source: mmangkad/Qwen3.6-27B-NVFP4 (NVIDIA ModelOpt v0.42.0)
mmproj source: official BF16 weights from Qwen/Qwen3.6-27B
Tooling: llama.cpp convert_hf_to_gguf.py --no-mtp / --mtp and llama-quantize

License

Apache 2.0, inherited from the upstream model.

Downloads last month: -

GGUF

Hardware compatibility

4-bit

8-bit

16-bit

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LibertAIDAI/Qwen3.6-27B-NVFP4-MTP-GGUF

Base model

Qwen/Qwen3.6-27B

Quantized

(394)

this model