File size: 3,553 Bytes
28ac31a
 
 
 
 
 
f7c38ab
28ac31a
 
f7c38ab
28ac31a
f7c38ab
 
 
28ac31a
f7c38ab
 
5d19867
28ac31a
 
5d19867
28ac31a
01d5989
e18339d
01d5989
e18339d
 
 
 
 
 
 
 
01d5989
e18339d
 
 
 
 
 
f7c38ab
28ac31a
 
f7c38ab
 
 
 
28ac31a
 
 
f7c38ab
 
28ac31a
 
 
 
 
 
 
 
 
 
 
 
 
5d19867
28ac31a
 
 
5d19867
28ac31a
5d19867
28ac31a
5d19867
f7c38ab
 
 
 
28ac31a
 
 
 
 
 
 
 
 
 
 
 
 
f7c38ab
 
 
 
 
 
 
28ac31a
 
 
 
f7c38ab
 
 
399c137
f7c38ab
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
license: apache-2.0
library_name: mlx
base_model:
- Qwen/Qwen3.5-4B
- mlx-community/Qwen3.5-4B-MLX-4bit
pipeline_tag: text-generation
tags:
- mlx
- apple-silicon
- speculative-decoding
- qwen
- qwen3
- qwen3_5
- mtp
- mtplx
- local-ai
- q4
---

# Qwen3.5-4B MTPLX Optimized Speed (Q4 trunk)

## Run this with MTPLX

**MTPLX** is an MLX-native runtime for native Multi-Token-Prediction speculative decoding on Apple Silicon. Up to **2.24× faster decode** at real coding temperatures (`temp=0.6 / top_p=0.95 / top_k=20`) using the model's own built-in MTP heads — no external drafter, no greedy hack.

```bash
pip install mtplx
mtplx start
```

**Project:** [github.com/youssofal/MTPLX](https://github.com/youssofal/MTPLX)

**Other MTPLX checkpoints:**

- [Qwen3.6-27B-MTPLX-Optimized-Speed](https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed) — 4-bit flagship speed (63 TPS on M5 Max)
- [Qwen3.6-27B-MTPLX-Optimized](https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized) — verified default (GDN8-Speed4 trunk + CyanKiwi INT4 MTP)
- [Qwen3.5-4B-Optimized-MTPLX](https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX) — small 8-bit

---

Small speed-test artifact for MTPLX on Apple Silicon.

This model uses the public `mlx-community/Qwen3.5-4B-MLX-4bit` MLX affine 4-bit
trunk and grafts back the official native MTP head from `Qwen/Qwen3.5-4B`. The
MTP head is stored as `mtp.safetensors`; layer-0 attention/MLP linears are
quantized to 4-bit affine group-64, while `mtp.fc` and the MTP norms stay BF16.

## Intended Use

A quick MTPLX download / load / speed-path test artifact at 4B scale. Once the
runtime ships:

```bash
mtplx start
```

Choose `Custom Hugging Face repo`, then enter:

```text
Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed
```

## Artifact Layout

- Trunk: MLX affine 4-bit, group size 64
- MTP sidecar: official Qwen3.5-4B MTP tensors
- MTP sidecar quantization: body-int4
- Runtime contract: `mtplx_runtime.json`
- MTPLX default: depth 2, target temperature 0.6, draft temperature 0.6

## Local Smoke Result

On the local Apple Silicon MTPLX workstation, the depth-2 speed path measured
**120.06 tok/s** versus **108.41 tok/s** AR on the warm-code prompt
(`max_tokens=48`, `temperature=0.6`, `top_p=0.95`, `top_k=20`). Depth 3 is
intentionally not the default for this 4B artifact because it over-drafts the
small native-MTP head.

## Build Stats

```json
{
  "bits": 4,
  "group_size": 64,
  "mode": "affine",
  "output_size_bytes": 86701040,
  "output_tensor_count": 29,
  "policy": "cyankiwi",
  "quantization": "body-int4",
  "quantized_linears": {
    "mtp.layers.0.mlp.down_proj":   {"bits": 4, "group_size": 64, "mode": "affine"},
    "mtp.layers.0.mlp.gate_proj":   {"bits": 4, "group_size": 64, "mode": "affine"},
    "mtp.layers.0.mlp.up_proj":     {"bits": 4, "group_size": 64, "mode": "affine"},
    "mtp.layers.0.self_attn.k_proj":{"bits": 4, "group_size": 64, "mode": "affine"},
    "mtp.layers.0.self_attn.o_proj":{"bits": 4, "group_size": 64, "mode": "affine"},
    "mtp.layers.0.self_attn.q_proj":{"bits": 4, "group_size": 64, "mode": "affine"},
    "mtp.layers.0.self_attn.v_proj":{"bits": 4, "group_size": 64, "mode": "affine"}
  },
  "source_tensor_count": 15
}
```

## Links

- **MTPLX**: [github.com/youssofal/MTPLX](https://github.com/youssofal/MTPLX)  ·  `pip install mtplx`
- **Base model**: [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)
- **Trunk source**: [mlx-community/Qwen3.5-4B-MLX-4bit](https://huggingface.co/mlx-community/Qwen3.5-4B-MLX-4bit)