Youssofal's picture
Tighten MTPLX section, drop self-link, evergreen heading
1327e07 verified
---
language:
- en
license: apache-2.0
library_name: mlx
base_model: Qwen/Qwen3.6-27B
tags:
- mlx
- apple-silicon
- speculative-decoding
- mtp
- multi-token-prediction
- qwen3
- qwen
- mtplx
pipeline_tag: text-generation
---
# Qwen3.6-27B MTPLX Optimized
## Run this with MTPLX
**MTPLX** is an MLX-native runtime for native Multi-Token-Prediction speculative decoding on Apple Silicon. Up to **2.24Γ— faster decode** at real coding temperatures (`temp=0.6 / top_p=0.95 / top_k=20`) using the model's own built-in MTP heads β€” no external drafter, no greedy hack.
```bash
pip install mtplx
mtplx start
```
**Project:** [github.com/youssofal/MTPLX](https://github.com/youssofal/MTPLX)
**Other MTPLX checkpoints:**
- [Qwen3.6-27B-MTPLX-Optimized-Speed](https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed) β€” 4-bit flagship speed (63 TPS on M5 Max)
- [Qwen3.5-4B-MTPLX-Optimized-Speed](https://huggingface.co/Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed) β€” small 4-bit speed-test
- [Qwen3.5-4B-Optimized-MTPLX](https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX) β€” small 8-bit
---
This artifact pairs the Qwen3.6-27B trunk β€” MLX-quantized with MTPLX's `gdn8-speed4` policy (8-bit Gated Delta Network linears, 4-bit MLP, BF16 norms) β€” with a **calibrated INT4 Multi-Token-Prediction sidecar** grafted onto the trunk. The MTP head is what enables *native* speculative decoding: the model drafts its own tokens, with no external draft model required.
MTPLX accepts those draft tokens with **mathematically exact** probability-ratio acceptance and residual correction, so the speculative path stays distribution-preserving at realistic coding settings (`temperature=0.6`, `top_p=0.95`, `top_k=20`) β€” not just greedy.
You can also:
- Inspect the architecture and MTP tensors with any `safetensors` reader.
- Use the trunk weights with [`mlx-lm`](https://github.com/ml-explore/mlx-lm) for ordinary autoregressive decoding (the MTP head is sidecar-only and ignored by `mlx-lm`).
- Read the calibration / quantization metadata in `mtplx_runtime.json` and `config.json` to understand the build.
## What's in this checkpoint
| Component | Format |
| --- | --- |
| Trunk text + vision weights | MLX-affine mixed-precision: 8-bit Gated Delta Network linears, 4-bit MLP linears, BF16 norms |
| MTP head sidecar (`mtp.safetensors`) | Calibrated CyanKiwi prequantized INT4 with BF16 MTP norms |
| Vision encoder (`model-vision-*.safetensors`) | BF16, intact for multimodal use |
| Runtime contract (`mtplx_runtime.json`) | Pins architecture, recommended profile, and exactness baseline |
| Tokenizer + chat template | Qwen3.6 vocabulary (248k tokens) |
The MTP head is grafted from a separately calibrated INT4 sidecar (`Qwen3.6-27B-MTPLX-CyanKiwi-Packed-BF16-INT4-v3`) onto the MTPLX-specific GDN8-Speed4 trunk. This combination outperforms BF16 MTP on D2/D3/D4 acceptance under MTPLX's committed-history cache contract.
## MTP draft acceptance
These numbers describe the **MTP head's draft quality** β€” a property of the model itself, independent of any runtime's wall-clock throughput. Per-position acceptance under exact probability-ratio sampling at `temperature=0.6, top_p=0.95, top_k=20`:
| Depth | This checkpoint | vLLM MTP-5 oracle (3090, same temp) |
| --- | --- | --- |
| 1 | **97.62%** | 92.7% |
| 2 | **95.24%** | 77.0% |
| 3 | **88.10%** | 63.0% |
| 4 | **75.61%** | 50.9% |
| 5 | β€” | 43.0% |
Higher acceptance at every depth than vLLM's MTP-5 implementation on the same Qwen3.6 family, measured on `long_code` 192-token prompts.
## Provenance
- **Base model**: [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B) (Apache 2.0).
- **Quantization policy**: `mtplx-gdn8-speed4` β€” MLX-affine mixed-precision with uniform 8-bit GDN linears, 4-bit MLP, 4-bit `lm_head`, BF16 norms and the MTP head's `fc` projection.
- **MTP sidecar**: `cyankiwi-calibrated-int4-prequantized`, calibrated separately with MLX-affine quantization and grafted onto the GDN8-Speed4 trunk.
- **Runtime contract**: `mtplx_runtime.json` pins the architecture (`qwen3-next-mtp`), recommended profile, and exactness baseline.
## Limitations
- **The MTPLX runtime is not yet released.** Without it, you can still use the trunk weights with `mlx-lm` for ordinary AR decoding β€” but the MTP draft path that this checkpoint was built for requires MTPLX.
- **Apple Silicon focus.** MTPLX targets MLX as its primary backend; CUDA / x86 are not supported.
- **Verified architecture is Qwen3-Next.** MTPLX recognizes other MTP architectures (DeepSeek V3 MTP, GLM4 MoE MTP, MiMo, MiniMax M2 MTP, etc.) but only Qwen3-Next-class artifacts have a verified runtime contract today.
## License
This checkpoint is released under the **Apache License 2.0**, matching the Qwen3.6-27B base model.
## Citation
```bibtex
@misc{mtplx2026,
author = {Youssof Al},
title = {MTPLX: Native MTP speculative decoding on Apple Silicon},
year = {2026},
howpublished = {\url{https://github.com/youssofal/mtplx}}
}
```
## Links
- **Runtime**: [github.com/youssofal/MTPLX](https://github.com/youssofal/MTPLX) Β· `pip install mtplx`
- **Base model**: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)