Qwen3.6-27B NVFP4 GGUF β€” MTP variant

NVFP4 GGUF quantizations of Qwen/Qwen3.6-27B with Multi-Token Prediction (MTP) support for speculative decoding in llama.cpp.

Same NVFP4 trunk layout as our Qwen3.6-27B-NVFP4-GGUF repo, plus the MTP draft head extracted from the source for use with --spec-type draft-mtp. The MTP head lets llama.cpp speculatively draft the next 1–4 tokens at near-zero cost and verify them in a single trunk forward pass β€” yielding ~+22% token-generation throughput for single-stream serving on an RTX 5090.

About LibertAI

LibertAI is a decentralized AI platform β€” private inference, an OpenAI-compatible API, and a chat UI, all running on community GPUs over Aleph Cloud instead of a single company's servers.

If you want to put this model (or any other) to work as an autonomous agent without running your own infrastructure, check out LiberClaw β€” Hermes-style agents hosted on Aleph Cloud with LibertAI inference. Free tier: 2 agents, no credit card, 5 minutes to deploy. Open source.

Files

File Size FFN Other tensors When to pick
Qwen3.6-27B-NVFP4-Q4_K_M-mtp.gguf 15 GB NVFP4 Q4_K_M Recommended trunk. Pair with the MTP draft for the best perf/VRAM ratio
Qwen3.6-27B-NVFP4-Q8_0-mtp.gguf 19 GB NVFP4 Q8_0 Higher-precision attention/embeddings if you have the VRAM
Qwen3.6-27B-NVFP4-BF16-mtp.gguf 28 GB NVFP4 BF16 Max quality (source-precision non-FFN tensors); slower in practice
mtp-Qwen3.6-27B-NVFP4.gguf 5.6 GB BF16 BF16/F32 MTP draft head β€” required for --spec-type draft-mtp
mmproj-Qwen3.6-27B-F16.gguf 889 MB β€” F16 vision tower Required for image/video input β€” reusable across all Qwen3.6-27B GGUFs

The trunk files are built with convert_hf_to_gguf.py --no-mtp, so the MTP weights are split out into the separate mtp-*.gguf file. This lets llama.cpp load trunk + draft as a paired model and lets future llama.cpp versions handle the draft on a separate device/stream.

Performance

Measured on an NVIDIA RTX 5090 (32 GB, Blackwell, sm_120, CUDA 13.0), llama.cpp build dbe9c0c8c.

Single-stream serving, 28-token prompt β†’ 512-token completion (n=3, prod settings: -c 262144 -fa on -ctk q4_0 -ctv q4_0 -ub 256 -b 1024 --parallel 1):

Config TG (tok/s) Draft accept rate Ξ” vs no-MTP
NVFP4-Q4_K_M, no MTP 74.4 β€” baseline
NVFP4-Q4_K_M + MTP (n-max=4, p-min=0.5) 90.8 69.5% +22%

MTP gives a clean win for this dense model β€” the draft head is small enough that its forward pass is near-free relative to the trunk, and the ~70% accept rate means most drafts land.

Real-world workloads

The 28-token synthetic prompt above is adversarial for MTP β€” it forces the model into free-form generation where draft accept rates are lowest. On natural workloads (chat, code completion, structured output, anything with predictable phrasing) we observe ~115 tok/s on the same hardware/config, corresponding to ~82–85% draft accept. Treat the +22% bench number as a floor; real gains are typically larger.

Usage

Server (single-stream, recommended)

llama-server \
  -m Qwen3.6-27B-NVFP4-Q4_K_M-mtp.gguf \
  --model-draft mtp-Qwen3.6-27B-NVFP4.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 4 \
  --spec-draft-p-min 0.5 \
  -ngl 999 -ngld 999 \
  -fa on -c 32768 \
  --host 0.0.0.0 --port 8080

The -ngl and -ngld flags push the trunk and the draft to GPU respectively. --spec-draft-n-max 4 drafts up to 4 tokens per cycle; --spec-draft-p-min 0.5 only commits drafts whose probability is above 0.5.

Multimodal (vision + text)

llama-server \
  -m Qwen3.6-27B-NVFP4-Q4_K_M-mtp.gguf \
  --model-draft mtp-Qwen3.6-27B-NVFP4.gguf \
  --spec-type draft-mtp --spec-draft-n-max 4 --spec-draft-p-min 0.5 \
  --mmproj mmproj-Qwen3.6-27B-F16.gguf \
  -ngl 999 -ngld 999 -c 32768 \
  --host 0.0.0.0 --port 8080

Recommended sampler

Qwen3.6 is a thinking model. The default chat template enables <think> blocks. For non-thinking usage, set chat_template_kwargs.enable_thinking=false in the API.

What is MTP?

Multi-Token Prediction adds a small extra block to the model (here, 1 layer at index 64) that runs alongside the main forward pass and produces a candidate next-token distribution from the same hidden state. llama.cpp uses this as a free draft for speculative decoding: it generates up to N candidate tokens, then verifies all of them in a single trunk forward pass. Accepted drafts ride along for free, so the effective per-token cost drops by the accept rate.

The MTP path landed in llama.cpp via PR #22673; NVFP4 scale-tensor support for the MTP block landed in #23563.

Architecture notes

Qwen3.6-27B is a hybrid attention + SSM dense model: every 4th layer is conventional attention; the remaining 48 of 64 layers use Mamba-style linear_attn blocks. The NVFP4 source from mmangkad keeps the SSM in_proj_* projections and standard attention projections at higher precision β€” only the FFN matmul (192 tensors) is NVFP4. The MTP block at layer 64 contains its own conventional attention + dense FFN + the nextn heads (eh_proj, enorm, hnorm, shared_head_norm), kept at BF16 since it's small.

Sources & credits

  • Base model: Qwen/Qwen3.6-27B by Alibaba Qwen team β€” Apache 2.0
  • NVFP4 calibration source: mmangkad/Qwen3.6-27B-NVFP4 (NVIDIA ModelOpt v0.42.0)
  • mmproj source: official BF16 weights from Qwen/Qwen3.6-27B
  • Tooling: llama.cpp convert_hf_to_gguf.py --no-mtp / --mtp and llama-quantize

License

Apache 2.0, inherited from the upstream model.

Downloads last month
-
GGUF
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for LibertAIDAI/Qwen3.6-27B-NVFP4-MTP-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(394)
this model