Text Generation
MLX
Safetensors
English
qwen3_5
apple-silicon
speculative-decoding
mtp
multi-token-prediction
qwen3
qwen
mtplx
conversational
4-bit precision
Instructions to use Youssofal/Qwen3.6-27B-MTPLX-Optimized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Youssofal/Qwen3.6-27B-MTPLX-Optimized with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("Youssofal/Qwen3.6-27B-MTPLX-Optimized") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use Youssofal/Qwen3.6-27B-MTPLX-Optimized with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Youssofal/Qwen3.6-27B-MTPLX-Optimized"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Youssofal/Qwen3.6-27B-MTPLX-Optimized" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Youssofal/Qwen3.6-27B-MTPLX-Optimized with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Youssofal/Qwen3.6-27B-MTPLX-Optimized"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Youssofal/Qwen3.6-27B-MTPLX-Optimized
Run Hermes
hermes
- MLX LM
How to use Youssofal/Qwen3.6-27B-MTPLX-Optimized with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "Youssofal/Qwen3.6-27B-MTPLX-Optimized"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "Youssofal/Qwen3.6-27B-MTPLX-Optimized" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Youssofal/Qwen3.6-27B-MTPLX-Optimized", "messages": [ {"role": "user", "content": "Hello"} ] }'
Reframe model card: MTPLX coming soon, drop runtime tok/s claims
Browse files
README.md
CHANGED
|
@@ -18,58 +18,37 @@ pipeline_tag: text-generation
|
|
| 18 |
|
| 19 |
# Qwen3.6-27B MTPLX Optimized
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
This
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
``
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
```
|
| 33 |
-
|
| 34 |
-
`mtplx quickstart` auto-downloads this model on first run if it isn't already on disk. It also detects existing copies in your local model folders (`~/models/`, LM Studio, HuggingFace cache) and lets you pick from them.
|
| 35 |
-
|
| 36 |
-
To pin this model explicitly:
|
| 37 |
-
|
| 38 |
-
```bash
|
| 39 |
-
mtplx quickstart --model Youssofal/Qwen3.6-27B-MTPLX-Optimized
|
| 40 |
-
```
|
| 41 |
-
|
| 42 |
-
For the OpenAI-compatible API server only (no chat UI):
|
| 43 |
-
|
| 44 |
-
```bash
|
| 45 |
-
mtplx start --port 8000
|
| 46 |
-
```
|
| 47 |
-
|
| 48 |
-
The server then exposes:
|
| 49 |
-
|
| 50 |
-
- `/v1/chat/completions` and `/v1/completions` (OpenAI-compatible)
|
| 51 |
-
- `/v1/messages` (Anthropic-compatible, streaming SSE)
|
| 52 |
-
- `/v1/models`, `/health`, `/metrics`
|
| 53 |
-
|
| 54 |
-
Plug it into Open WebUI, Claude Code, Cline, Continue, or anything that speaks OpenAI.
|
| 55 |
|
| 56 |
## What's in this checkpoint
|
| 57 |
|
| 58 |
| Component | Format |
|
| 59 |
| --- | --- |
|
| 60 |
-
| Trunk text + vision weights | MLX-affine mixed-precision
|
| 61 |
| MTP head sidecar (`mtp.safetensors`) | Calibrated CyanKiwi prequantized INT4 with BF16 MTP norms |
|
| 62 |
| Vision encoder (`model-vision-*.safetensors`) | BF16, intact for multimodal use |
|
| 63 |
-
| Runtime contract (`mtplx_runtime.json`) | Pins
|
| 64 |
| Tokenizer + chat template | Qwen3.6 vocabulary (248k tokens) |
|
| 65 |
|
| 66 |
-
The MTP head is grafted from a separately calibrated INT4 sidecar (`Qwen3.6-27B-MTPLX-CyanKiwi-Packed-BF16-INT4-v3`) onto the MTPLX-specific GDN8-Speed4 trunk. This combination outperforms BF16 MTP on D2/D3/D4 acceptance under MTPLX's committed-history cache contract
|
| 67 |
|
| 68 |
-
##
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
-
| Depth |
|
| 73 |
| --- | --- | --- |
|
| 74 |
| 1 | **97.62%** | 92.7% |
|
| 75 |
| 2 | **95.24%** | 77.0% |
|
|
@@ -77,51 +56,27 @@ Per-position acceptance under MTPLX's `linear-gdn-from-conv-tape` verify path, d
|
|
| 77 |
| 4 | **75.61%** | 50.9% |
|
| 78 |
| 5 | β | 43.0% |
|
| 79 |
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
## Performance
|
| 83 |
-
|
| 84 |
-
Verified on Apple Silicon **M5 Max, 128 GB unified memory, macOS 26.3.1**, MLX 0.31.2:
|
| 85 |
-
|
| 86 |
-
- **60.169 tok/s** clean-preflight headline on D3 / 192-token long-code prompts at `temp=0.6` β the cold cold-decode benchmark MTPLX was built around.
|
| 87 |
-
- **2.54Γ** vs matched no-MTP autoregressive control (23.59 tok/s) on the same prompt and sampler.
|
| 88 |
-
- **`max_diff = 0.0`** on the Phase 0H paged-verifier exactness gate (2048 ctx).
|
| 89 |
-
- **D5 / 512-token long-code**: 43.871 tok/s with [97.09, 91.26, 82.35, 69.61, 58.82] per-position acceptance.
|
| 90 |
-
|
| 91 |
-
Sustained no-fan long-context throughput is **lower** than the cold number β currently ~37 tok/s on 10k-token uncapped generation under macOS power-governor throttling. Closing that gap is the v0.2 deliverable; see the MTPLX repo's roadmap for the kernel-ladder plan. Until then, MTPLX ships three honest profiles:
|
| 92 |
-
|
| 93 |
-
- `Safe` β ~37 tok/s steady, no fan changes, near-flat on long answers. Default for new users.
|
| 94 |
-
- `Fast` β ~60 tok/s on short replies, decays on long ones because fans stay on Apple's default curve.
|
| 95 |
-
- `Max` β Fast + ThermalForge fans pinned at 100%, sustained ~60 tok/s, loud. Opt-in via the wizard.
|
| 96 |
-
|
| 97 |
-
## Hardware compatibility
|
| 98 |
-
|
| 99 |
-
Tested on Apple Silicon M5 Max 128 GB. Should run on any Apple Silicon Mac with β₯ 24 GB unified memory, but only the M5 Max class is verified for the cold 60 tok/s number. M3/M4 numbers will be lower in proportion to memory bandwidth (the M5 Max has 614 GB/s).
|
| 100 |
-
|
| 101 |
-
The model file footprint on disk is ~19 GB β most users will want at least 32 GB unified memory for comfortable batched inference.
|
| 102 |
|
| 103 |
## Provenance
|
| 104 |
|
| 105 |
- **Base model**: [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B) (Apache 2.0).
|
| 106 |
- **Quantization policy**: `mtplx-gdn8-speed4` β MLX-affine mixed-precision with uniform 8-bit GDN linears, 4-bit MLP, 4-bit `lm_head`, BF16 norms and the MTP head's `fc` projection.
|
| 107 |
- **MTP sidecar**: `cyankiwi-calibrated-int4-prequantized`, calibrated separately with MLX-affine quantization and grafted onto the GDN8-Speed4 trunk.
|
| 108 |
-
- **Runtime contract**: `mtplx_runtime.json` pins the architecture (`qwen3-next-mtp`), recommended profile
|
| 109 |
|
| 110 |
## Limitations
|
| 111 |
|
| 112 |
-
- **
|
| 113 |
-
- **
|
| 114 |
-
- **Verified
|
| 115 |
-
- **Not greedy-only**. The whole point is exactness at `temperature=0.6`. If you only ever decode greedily, simpler tools like `mlx-lm` or the upstream Qwen MLX path are smaller dependencies.
|
| 116 |
|
| 117 |
## License
|
| 118 |
|
| 119 |
-
This checkpoint is released under the **Apache License 2.0**, matching the Qwen3.6-27B base model.
|
| 120 |
|
| 121 |
## Citation
|
| 122 |
|
| 123 |
-
If MTPLX or this checkpoint helped your work, please cite:
|
| 124 |
-
|
| 125 |
```bibtex
|
| 126 |
@misc{mtplx2026,
|
| 127 |
author = {Youssof Al},
|
|
@@ -133,5 +88,5 @@ If MTPLX or this checkpoint helped your work, please cite:
|
|
| 133 |
|
| 134 |
## Links
|
| 135 |
|
| 136 |
-
- **Runtime**: [github.com/youssofal/mtplx](https://github.com/youssofal/mtplx)
|
| 137 |
- **Base model**: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
|
|
|
|
| 18 |
|
| 19 |
# Qwen3.6-27B MTPLX Optimized
|
| 20 |
|
| 21 |
+
> ## MTPLX β coming soon
|
| 22 |
+
>
|
| 23 |
+
> This checkpoint is the verified default for the upcoming **MTPLX** inference engine β an MLX-native runtime for **native Multi-Token-Prediction speculative decoding on Apple Silicon**. **The runtime is not publicly released yet.** This model card is published in advance so you can review the architecture, MTP head, and quantization decisions while the runtime is finalized. Watch [github.com/youssofal/mtplx](https://github.com/youssofal/mtplx) for the release.
|
| 24 |
|
| 25 |
+
This artifact pairs the Qwen3.6-27B trunk β MLX-quantized with MTPLX's `gdn8-speed4` policy (8-bit Gated Delta Network linears, 4-bit MLP, BF16 norms) β with a **calibrated INT4 Multi-Token-Prediction sidecar** grafted onto the trunk. The MTP head is what enables *native* speculative decoding: the model drafts its own tokens, with no external draft model required.
|
| 26 |
|
| 27 |
+
When MTPLX is released, it will accept those draft tokens with **mathematically exact** probability-ratio acceptance and residual correction, so the speculative path stays distribution-preserving at realistic coding settings (`temperature=0.6`, `top_p=0.95`, `top_k=20`) β not just greedy.
|
| 28 |
|
| 29 |
+
Until then you can still:
|
| 30 |
|
| 31 |
+
- Inspect the architecture and MTP tensors with any `safetensors` reader.
|
| 32 |
+
- Use the trunk weights with [`mlx-lm`](https://github.com/ml-explore/mlx-lm) for ordinary autoregressive decoding (the MTP head is sidecar-only and ignored by `mlx-lm`).
|
| 33 |
+
- Read the calibration / quantization metadata in `mtplx_runtime.json` and `config.json` to understand the build.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
## What's in this checkpoint
|
| 36 |
|
| 37 |
| Component | Format |
|
| 38 |
| --- | --- |
|
| 39 |
+
| Trunk text + vision weights | MLX-affine mixed-precision: 8-bit Gated Delta Network linears, 4-bit MLP linears, BF16 norms |
|
| 40 |
| MTP head sidecar (`mtp.safetensors`) | Calibrated CyanKiwi prequantized INT4 with BF16 MTP norms |
|
| 41 |
| Vision encoder (`model-vision-*.safetensors`) | BF16, intact for multimodal use |
|
| 42 |
+
| Runtime contract (`mtplx_runtime.json`) | Pins architecture, recommended profile, and exactness baseline |
|
| 43 |
| Tokenizer + chat template | Qwen3.6 vocabulary (248k tokens) |
|
| 44 |
|
| 45 |
+
The MTP head is grafted from a separately calibrated INT4 sidecar (`Qwen3.6-27B-MTPLX-CyanKiwi-Packed-BF16-INT4-v3`) onto the MTPLX-specific GDN8-Speed4 trunk. This combination outperforms BF16 MTP on D2/D3/D4 acceptance under MTPLX's committed-history cache contract.
|
| 46 |
|
| 47 |
+
## MTP draft acceptance
|
| 48 |
|
| 49 |
+
These numbers describe the **MTP head's draft quality** β a property of the model itself, independent of any runtime's wall-clock throughput. Per-position acceptance under exact probability-ratio sampling at `temperature=0.6, top_p=0.95, top_k=20`:
|
| 50 |
|
| 51 |
+
| Depth | This checkpoint | vLLM MTP-5 oracle (3090, same temp) |
|
| 52 |
| --- | --- | --- |
|
| 53 |
| 1 | **97.62%** | 92.7% |
|
| 54 |
| 2 | **95.24%** | 77.0% |
|
|
|
|
| 56 |
| 4 | **75.61%** | 50.9% |
|
| 57 |
| 5 | β | 43.0% |
|
| 58 |
|
| 59 |
+
Higher acceptance at every depth than vLLM's MTP-5 implementation on the same Qwen3.6 family, measured on `long_code` 192-token prompts.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
## Provenance
|
| 62 |
|
| 63 |
- **Base model**: [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B) (Apache 2.0).
|
| 64 |
- **Quantization policy**: `mtplx-gdn8-speed4` β MLX-affine mixed-precision with uniform 8-bit GDN linears, 4-bit MLP, 4-bit `lm_head`, BF16 norms and the MTP head's `fc` projection.
|
| 65 |
- **MTP sidecar**: `cyankiwi-calibrated-int4-prequantized`, calibrated separately with MLX-affine quantization and grafted onto the GDN8-Speed4 trunk.
|
| 66 |
+
- **Runtime contract**: `mtplx_runtime.json` pins the architecture (`qwen3-next-mtp`), recommended profile, and exactness baseline.
|
| 67 |
|
| 68 |
## Limitations
|
| 69 |
|
| 70 |
+
- **The MTPLX runtime is not yet released.** Without it, you can still use the trunk weights with `mlx-lm` for ordinary AR decoding β but the MTP draft path that this checkpoint was built for requires MTPLX.
|
| 71 |
+
- **Apple Silicon focus.** MTPLX targets MLX as its primary backend; CUDA / x86 are not supported.
|
| 72 |
+
- **Verified architecture is Qwen3-Next.** MTPLX recognizes other MTP architectures (DeepSeek V3 MTP, GLM4 MoE MTP, MiMo, MiniMax M2 MTP, etc.) but only Qwen3-Next-class artifacts have a verified runtime contract today.
|
|
|
|
| 73 |
|
| 74 |
## License
|
| 75 |
|
| 76 |
+
This checkpoint is released under the **Apache License 2.0**, matching the Qwen3.6-27B base model.
|
| 77 |
|
| 78 |
## Citation
|
| 79 |
|
|
|
|
|
|
|
| 80 |
```bibtex
|
| 81 |
@misc{mtplx2026,
|
| 82 |
author = {Youssof Al},
|
|
|
|
| 88 |
|
| 89 |
## Links
|
| 90 |
|
| 91 |
+
- **Runtime (coming soon)**: [github.com/youssofal/mtplx](https://github.com/youssofal/mtplx)
|
| 92 |
- **Base model**: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
|