File size: 2,241 Bytes
cc1cd06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: apache-2.0
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
base_model:
- google/gemma-4-31B-it
- google/gemma-4-31B-it-assistant
library_name: mlx
tags:
- mlx
- gemma4
- mtplx
- speculative-decoding
- apple-silicon
- text-generation
pipeline_tag: text-generation
---

# Gemma4 MTPLX Optimized Speed

This is an **MTPLX pair bundle** for Gemma 4 31B speculative decoding on Apple Silicon.

It is not a single vanilla Transformers model directory. The repository contains two MLX-format artifacts:

- `target/` - Gemma 4 31B IT target, MLX Q4 affine group-size 64
- `assistant/` - official Gemma 4 31B assistant drafter, MLX Q6 affine group-size 64

Use this pair when absolute throughput is the priority.

## Source

- Target source: `google/gemma-4-31B-it`
- Target revision: `145dc2508c480a64b47242f160d286cff94a2343`
- Assistant source: `google/gemma-4-31B-it-assistant`
- Assistant revision: `cffbbd2cea41ea56a0fa5b0487e0d445121fd204`

Both artifacts were converted locally to MLX format.

## Quantization

Target:

```text
bits: 4
group_size: 64
mode: affine
```

Assistant:

```text
bits: 6
group_size: 64
mode: affine
```

## MTPLX Usage

After downloading this repository, point MTPLX at the two subdirectories:

```bash
mtplx bench gemma-mtp \
  --target-model ./target \
  --assistant-model ./assistant \
  --prompt-suite mtplx/benchmarks/prompts/flappy.jsonl \
  --max-tokens 1000 \
  --draft-block-sizes 6 \
  --allow-unverified-gemma
```

The Gemma 4 assistant is a separate drafter model. MTPLX uses exact speculative sampling with target verification and residual correction.

## Local Benchmark

Prompt: single-file HTML5 Canvas Flappy Bird game, capped at 1000 generated tokens.

Sampler:

```text
temperature: 1.0
top_p: 0.95
top_k: 64
seed: 0
```

Best observed block size:

```text
block_size: 6
acceptance: 830 / 846 = 98.11%
```

Observed MTPLX throughput samples:

```text
43.56 tok/s
44.46 tok/s
44.07 tok/s
```

The bundled benchmark JSON files are in `benchmarks/`.

## Notes

This release is optimized for MTPLX speed experiments. For a higher-precision target, use `Youssofal/Gemma4-MTPLX-Optimized-Quality`.

Gemma 4 is released by Google under the Gemma 4 license terms linked above.