---
language:
  - en
license: apache-2.0
library_name: mlx
base_model: Qwen/Qwen3.6-27B
tags:
  - mlx
  - apple-silicon
  - speculative-decoding
  - mtp
  - multi-token-prediction
  - qwen3
  - qwen
  - mtplx
pipeline_tag: text-generation
---

# Qwen3.6-27B MTPLX Optimized

## Run this with MTPLX

**MTPLX** is an MLX-native runtime for native Multi-Token-Prediction speculative decoding on Apple Silicon. Up to **2.24× faster decode** at real coding temperatures (`temp=0.6 / top_p=0.95 / top_k=20`) using the model's own built-in MTP heads — no external drafter, no greedy hack.

```bash
pip install mtplx
mtplx start
```

**Project:** [github.com/youssofal/MTPLX](https://github.com/youssofal/MTPLX)

**Other MTPLX checkpoints:**

- [Qwen3.6-27B-MTPLX-Optimized-Speed](https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed) — 4-bit flagship speed (63 TPS on M5 Max)
- [Qwen3.5-4B-MTPLX-Optimized-Speed](https://huggingface.co/Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed) — small 4-bit speed-test
- [Qwen3.5-4B-Optimized-MTPLX](https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX) — small 8-bit

---

This artifact pairs the Qwen3.6-27B trunk — MLX-quantized with MTPLX's `gdn8-speed4` policy (8-bit Gated Delta Network linears, 4-bit MLP, BF16 norms) — with a **calibrated INT4 Multi-Token-Prediction sidecar** grafted onto the trunk. The MTP head is what enables *native* speculative decoding: the model drafts its own tokens, with no external draft model required.

MTPLX accepts those draft tokens with **mathematically exact** probability-ratio acceptance and residual correction, so the speculative path stays distribution-preserving at realistic coding settings (`temperature=0.6`, `top_p=0.95`, `top_k=20`) — not just greedy.

You can also:

- Inspect the architecture and MTP tensors with any `safetensors` reader.
- Use the trunk weights with [`mlx-lm`](https://github.com/ml-explore/mlx-lm) for ordinary autoregressive decoding (the MTP head is sidecar-only and ignored by `mlx-lm`).
- Read the calibration / quantization metadata in `mtplx_runtime.json` and `config.json` to understand the build.

## What's in this checkpoint

| Component | Format |
| --- | --- |
| Trunk text + vision weights | MLX-affine mixed-precision: 8-bit Gated Delta Network linears, 4-bit MLP linears, BF16 norms |
| MTP head sidecar (`mtp.safetensors`) | Calibrated CyanKiwi prequantized INT4 with BF16 MTP norms |
| Vision encoder (`model-vision-*.safetensors`) | BF16, intact for multimodal use |
| Runtime contract (`mtplx_runtime.json`) | Pins architecture, recommended profile, and exactness baseline |
| Tokenizer + chat template | Qwen3.6 vocabulary (248k tokens) |

The MTP head is grafted from a separately calibrated INT4 sidecar (`Qwen3.6-27B-MTPLX-CyanKiwi-Packed-BF16-INT4-v3`) onto the MTPLX-specific GDN8-Speed4 trunk. This combination outperforms BF16 MTP on D2/D3/D4 acceptance under MTPLX's committed-history cache contract.

## MTP draft acceptance

These numbers describe the **MTP head's draft quality** — a property of the model itself, independent of any runtime's wall-clock throughput. Per-position acceptance under exact probability-ratio sampling at `temperature=0.6, top_p=0.95, top_k=20`:

| Depth | This checkpoint | vLLM MTP-5 oracle (3090, same temp) |
| --- | --- | --- |
| 1 | **97.62%** | 92.7% |
| 2 | **95.24%** | 77.0% |
| 3 | **88.10%** | 63.0% |
| 4 | **75.61%** | 50.9% |
| 5 | — | 43.0% |

Higher acceptance at every depth than vLLM's MTP-5 implementation on the same Qwen3.6 family, measured on `long_code` 192-token prompts.

## Provenance

- **Base model**: [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B) (Apache 2.0).
- **Quantization policy**: `mtplx-gdn8-speed4` — MLX-affine mixed-precision with uniform 8-bit GDN linears, 4-bit MLP, 4-bit `lm_head`, BF16 norms and the MTP head's `fc` projection.
- **MTP sidecar**: `cyankiwi-calibrated-int4-prequantized`, calibrated separately with MLX-affine quantization and grafted onto the GDN8-Speed4 trunk.
- **Runtime contract**: `mtplx_runtime.json` pins the architecture (`qwen3-next-mtp`), recommended profile, and exactness baseline.

## Limitations

- **The MTPLX runtime is not yet released.** Without it, you can still use the trunk weights with `mlx-lm` for ordinary AR decoding — but the MTP draft path that this checkpoint was built for requires MTPLX.
- **Apple Silicon focus.** MTPLX targets MLX as its primary backend; CUDA / x86 are not supported.
- **Verified architecture is Qwen3-Next.** MTPLX recognizes other MTP architectures (DeepSeek V3 MTP, GLM4 MoE MTP, MiMo, MiniMax M2 MTP, etc.) but only Qwen3-Next-class artifacts have a verified runtime contract today.

## License

This checkpoint is released under the **Apache License 2.0**, matching the Qwen3.6-27B base model.

## Citation

```bibtex
@misc{mtplx2026,
  author       = {Youssof Al},
  title        = {MTPLX: Native MTP speculative decoding on Apple Silicon},
  year         = {2026},
  howpublished = {\url{https://github.com/youssofal/mtplx}}
}
```

## Links

- **Runtime**: [github.com/youssofal/MTPLX](https://github.com/youssofal/MTPLX)  ·  `pip install mtplx`
- **Base model**: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)