--- language: - en license: apache-2.0 library_name: mlx base_model: Qwen/Qwen3.6-27B tags: - mlx - apple-silicon - speculative-decoding - mtp - multi-token-prediction - qwen3 - qwen - mtplx pipeline_tag: text-generation --- # Qwen3.6-27B MTPLX Optimized ## Run this with MTPLX **MTPLX** is an MLX-native runtime for native Multi-Token-Prediction speculative decoding on Apple Silicon. Up to **2.24× faster decode** at real coding temperatures (`temp=0.6 / top_p=0.95 / top_k=20`) using the model's own built-in MTP heads — no external drafter, no greedy hack. ```bash pip install mtplx mtplx start ``` **Project:** [github.com/youssofal/MTPLX](https://github.com/youssofal/MTPLX) **Other MTPLX checkpoints:** - [Qwen3.6-27B-MTPLX-Optimized-Speed](https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed) — 4-bit flagship speed (63 TPS on M5 Max) - [Qwen3.5-4B-MTPLX-Optimized-Speed](https://huggingface.co/Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed) — small 4-bit speed-test - [Qwen3.5-4B-Optimized-MTPLX](https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX) — small 8-bit --- This artifact pairs the Qwen3.6-27B trunk — MLX-quantized with MTPLX's `gdn8-speed4` policy (8-bit Gated Delta Network linears, 4-bit MLP, BF16 norms) — with a **calibrated INT4 Multi-Token-Prediction sidecar** grafted onto the trunk. The MTP head is what enables *native* speculative decoding: the model drafts its own tokens, with no external draft model required. MTPLX accepts those draft tokens with **mathematically exact** probability-ratio acceptance and residual correction, so the speculative path stays distribution-preserving at realistic coding settings (`temperature=0.6`, `top_p=0.95`, `top_k=20`) — not just greedy. You can also: - Inspect the architecture and MTP tensors with any `safetensors` reader. - Use the trunk weights with [`mlx-lm`](https://github.com/ml-explore/mlx-lm) for ordinary autoregressive decoding (the MTP head is sidecar-only and ignored by `mlx-lm`). - Read the calibration / quantization metadata in `mtplx_runtime.json` and `config.json` to understand the build. ## What's in this checkpoint | Component | Format | | --- | --- | | Trunk text + vision weights | MLX-affine mixed-precision: 8-bit Gated Delta Network linears, 4-bit MLP linears, BF16 norms | | MTP head sidecar (`mtp.safetensors`) | Calibrated CyanKiwi prequantized INT4 with BF16 MTP norms | | Vision encoder (`model-vision-*.safetensors`) | BF16, intact for multimodal use | | Runtime contract (`mtplx_runtime.json`) | Pins architecture, recommended profile, and exactness baseline | | Tokenizer + chat template | Qwen3.6 vocabulary (248k tokens) | The MTP head is grafted from a separately calibrated INT4 sidecar (`Qwen3.6-27B-MTPLX-CyanKiwi-Packed-BF16-INT4-v3`) onto the MTPLX-specific GDN8-Speed4 trunk. This combination outperforms BF16 MTP on D2/D3/D4 acceptance under MTPLX's committed-history cache contract. ## MTP draft acceptance These numbers describe the **MTP head's draft quality** — a property of the model itself, independent of any runtime's wall-clock throughput. Per-position acceptance under exact probability-ratio sampling at `temperature=0.6, top_p=0.95, top_k=20`: | Depth | This checkpoint | vLLM MTP-5 oracle (3090, same temp) | | --- | --- | --- | | 1 | **97.62%** | 92.7% | | 2 | **95.24%** | 77.0% | | 3 | **88.10%** | 63.0% | | 4 | **75.61%** | 50.9% | | 5 | — | 43.0% | Higher acceptance at every depth than vLLM's MTP-5 implementation on the same Qwen3.6 family, measured on `long_code` 192-token prompts. ## Provenance - **Base model**: [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B) (Apache 2.0). - **Quantization policy**: `mtplx-gdn8-speed4` — MLX-affine mixed-precision with uniform 8-bit GDN linears, 4-bit MLP, 4-bit `lm_head`, BF16 norms and the MTP head's `fc` projection. - **MTP sidecar**: `cyankiwi-calibrated-int4-prequantized`, calibrated separately with MLX-affine quantization and grafted onto the GDN8-Speed4 trunk. - **Runtime contract**: `mtplx_runtime.json` pins the architecture (`qwen3-next-mtp`), recommended profile, and exactness baseline. ## Limitations - **The MTPLX runtime is not yet released.** Without it, you can still use the trunk weights with `mlx-lm` for ordinary AR decoding — but the MTP draft path that this checkpoint was built for requires MTPLX. - **Apple Silicon focus.** MTPLX targets MLX as its primary backend; CUDA / x86 are not supported. - **Verified architecture is Qwen3-Next.** MTPLX recognizes other MTP architectures (DeepSeek V3 MTP, GLM4 MoE MTP, MiMo, MiniMax M2 MTP, etc.) but only Qwen3-Next-class artifacts have a verified runtime contract today. ## License This checkpoint is released under the **Apache License 2.0**, matching the Qwen3.6-27B base model. ## Citation ```bibtex @misc{mtplx2026, author = {Youssof Al}, title = {MTPLX: Native MTP speculative decoding on Apple Silicon}, year = {2026}, howpublished = {\url{https://github.com/youssofal/mtplx}} } ``` ## Links - **Runtime**: [github.com/youssofal/MTPLX](https://github.com/youssofal/MTPLX) · `pip install mtplx` - **Base model**: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)