Instructions to use Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed

Run Hermes

hermes

MLX LM

How to use Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3.5-4B-MTPLX-Optimized-Speed

File size: 3,553 Bytes

28ac31a
 
 
 
 
 
f7c38ab
28ac31a
 
f7c38ab
28ac31a
f7c38ab
 
 
28ac31a
f7c38ab
 
5d19867
28ac31a
 
5d19867
28ac31a
01d5989
e18339d
01d5989
e18339d
 
 
 
 
 
 
 
01d5989
e18339d
 
 
 
 
 
f7c38ab
28ac31a
 
f7c38ab
 
 
 
28ac31a
 
 
f7c38ab
 
28ac31a
 
 
 
 
 
 
 
 
 
 
 
 
5d19867
28ac31a
 
 
5d19867
28ac31a
5d19867
28ac31a
5d19867
f7c38ab
 
 
 
28ac31a
 
 
 
 
 
 
 
 
 
 
 
 
f7c38ab
 
 
 
 
 
 
28ac31a
 
 
 
f7c38ab
 
 
399c137
f7c38ab

---
license: apache-2.0
library_name: mlx
base_model:
- Qwen/Qwen3.5-4B
- mlx-community/Qwen3.5-4B-MLX-4bit
pipeline_tag: text-generation
tags:
- mlx
- apple-silicon
- speculative-decoding
- qwen
- qwen3
- qwen3_5
- mtp
- mtplx
- local-ai
- q4
---

# Qwen3.5-4B MTPLX Optimized Speed (Q4 trunk)

## Run this with MTPLX

**MTPLX** is an MLX-native runtime for native Multi-Token-Prediction speculative decoding on Apple Silicon. Up to **2.24× faster decode** at real coding temperatures (`temp=0.6 / top_p=0.95 / top_k=20`) using the model's own built-in MTP heads — no external drafter, no greedy hack.

```bash
pip install mtplx
mtplx start
```

**Project:** [github.com/youssofal/MTPLX](https://github.com/youssofal/MTPLX)

**Other MTPLX checkpoints:**

- [Qwen3.6-27B-MTPLX-Optimized-Speed](https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed) — 4-bit flagship speed (63 TPS on M5 Max)
- [Qwen3.6-27B-MTPLX-Optimized](https://huggingface.co/Youssofal/Qwen3.6-27B-MTPLX-Optimized) — verified default (GDN8-Speed4 trunk + CyanKiwi INT4 MTP)
- [Qwen3.5-4B-Optimized-MTPLX](https://huggingface.co/Youssofal/Qwen3.5-4B-Optimized-MTPLX) — small 8-bit

---

Small speed-test artifact for MTPLX on Apple Silicon.

This model uses the public `mlx-community/Qwen3.5-4B-MLX-4bit` MLX affine 4-bit
trunk and grafts back the official native MTP head from `Qwen/Qwen3.5-4B`. The
MTP head is stored as `mtp.safetensors`; layer-0 attention/MLP linears are
quantized to 4-bit affine group-64, while `mtp.fc` and the MTP norms stay BF16.

## Intended Use

A quick MTPLX download / load / speed-path test artifact at 4B scale. Once the
runtime ships:

```bash
mtplx start
```

Choose `Custom Hugging Face repo`, then enter:

```text
Youssofal/Qwen3.5-4B-MTPLX-Optimized-Speed
```

## Artifact Layout

- Trunk: MLX affine 4-bit, group size 64
- MTP sidecar: official Qwen3.5-4B MTP tensors
- MTP sidecar quantization: body-int4
- Runtime contract: `mtplx_runtime.json`
- MTPLX default: depth 2, target temperature 0.6, draft temperature 0.6

## Local Smoke Result

On the local Apple Silicon MTPLX workstation, the depth-2 speed path measured
**120.06 tok/s** versus **108.41 tok/s** AR on the warm-code prompt
(`max_tokens=48`, `temperature=0.6`, `top_p=0.95`, `top_k=20`). Depth 3 is
intentionally not the default for this 4B artifact because it over-drafts the
small native-MTP head.

## Build Stats

```json
{
  "bits": 4,
  "group_size": 64,
  "mode": "affine",
  "output_size_bytes": 86701040,
  "output_tensor_count": 29,
  "policy": "cyankiwi",
  "quantization": "body-int4",
  "quantized_linears": {
    "mtp.layers.0.mlp.down_proj":   {"bits": 4, "group_size": 64, "mode": "affine"},
    "mtp.layers.0.mlp.gate_proj":   {"bits": 4, "group_size": 64, "mode": "affine"},
    "mtp.layers.0.mlp.up_proj":     {"bits": 4, "group_size": 64, "mode": "affine"},
    "mtp.layers.0.self_attn.k_proj":{"bits": 4, "group_size": 64, "mode": "affine"},
    "mtp.layers.0.self_attn.o_proj":{"bits": 4, "group_size": 64, "mode": "affine"},
    "mtp.layers.0.self_attn.q_proj":{"bits": 4, "group_size": 64, "mode": "affine"},
    "mtp.layers.0.self_attn.v_proj":{"bits": 4, "group_size": 64, "mode": "affine"}
  },
  "source_tensor_count": 15
}
```

## Links

- **MTPLX**: [github.com/youssofal/MTPLX](https://github.com/youssofal/MTPLX)  ·  `pip install mtplx`
- **Base model**: [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)
- **Trunk source**: [mlx-community/Qwen3.5-4B-MLX-4bit](https://huggingface.co/mlx-community/Qwen3.5-4B-MLX-4bit)