Youssofal commited on
Commit
ac8e4d3
Β·
verified Β·
1 Parent(s): fe68f16

Update MTPLX section: runtime is now released

Browse files

Replace 'MTPLX β€” coming soon' placeholder with concise project explainer, install command, GitHub link, and the four MTPLX model cross-links.

Files changed (1) hide show
  1. README.md +19 -3
README.md CHANGED
@@ -18,9 +18,25 @@ pipeline_tag: text-generation
18
 
19
  # Qwen3.6-27B MTPLX Optimized
20
 
21
- > ## MTPLX β€” coming soon
22
- >
23
- > This checkpoint is the verified default for the upcoming **MTPLX** inference engine β€” an MLX-native runtime for **native Multi-Token-Prediction speculative decoding on Apple Silicon**. **The runtime is not publicly released yet.** This model card is published in advance so you can review the architecture, MTP head, and quantization decisions while the runtime is finalized. Watch [github.com/youssofal/mtplx](https://github.com/youssofal/mtplx) for the release.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  This artifact pairs the Qwen3.6-27B trunk β€” MLX-quantized with MTPLX's `gdn8-speed4` policy (8-bit Gated Delta Network linears, 4-bit MLP, BF16 norms) β€” with a **calibrated INT4 Multi-Token-Prediction sidecar** grafted onto the trunk. The MTP head is what enables *native* speculative decoding: the model drafts its own tokens, with no external draft model required.
26
 
 
18
 
19
  # Qwen3.6-27B MTPLX Optimized
20
 
21
+ ## MTPLX is released
22
+
23
+ This checkpoint runs on **MTPLX** β€” an MLX-native runtime for native Multi-Token-Prediction speculative decoding on Apple Silicon. Up to **2.24Γ— faster decode** at real coding temperatures (`temp=0.6 / top_p=0.95 / top_k=20`), using the model's own built-in MTP heads. No external drafter, no greedy hack, no distribution drift.
24
+
25
+ ```bash
26
+ pip install mtplx
27
+ mtplx start
28
+ ```
29
+
30
+ **Project:** [github.com/youssofal/MTPLX](https://github.com/youssofal/MTPLX)
31
+
32
+ **MTPLX model fleet on Hugging Face:**
33
+
34
+ - [Qwen3.6-27B-MTPLX-Optimized-Speed](https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed) β€” 4-bit flagship speed (63 TPS on M5 Max)
35
+ - [Qwen3.6-27B-MTPLX-Optimized](https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized) β€” verified default (GDN8-Speed4 trunk + CyanKiwi INT4 MTP)
36
+ - [Qwen3.5-4B-MTPLX-Optimized-Speed](https://huggingface.co/Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed) β€” small 4-bit speed-test
37
+ - [Qwen3.5-4B-Optimized-MTPLX](https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX) β€” small 8-bit
38
+
39
+ ---
40
 
41
  This artifact pairs the Qwen3.6-27B trunk β€” MLX-quantized with MTPLX's `gdn8-speed4` policy (8-bit Gated Delta Network linears, 4-bit MLP, BF16 norms) β€” with a **calibrated INT4 Multi-Token-Prediction sidecar** grafted onto the trunk. The MTP head is what enables *native* speculative decoding: the model drafts its own tokens, with no external draft model required.
42