Instructions to use Youssofal/Gemma4-MTPLX-Optimized-Quality with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Youssofal/Gemma4-MTPLX-Optimized-Quality with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("Youssofal/Gemma4-MTPLX-Optimized-Quality") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- MLX LM
How to use Youssofal/Gemma4-MTPLX-Optimized-Quality with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "Youssofal/Gemma4-MTPLX-Optimized-Quality" --prompt "Once upon a time"
File size: 2,312 Bytes
77ba53c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | ---
license: apache-2.0
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
base_model:
- google/gemma-4-31B-it
- google/gemma-4-31B-it-assistant
library_name: mlx
tags:
- mlx
- gemma4
- mtplx
- speculative-decoding
- apple-silicon
- text-generation
pipeline_tag: text-generation
---
# Gemma4 MTPLX Optimized Quality
This is an **MTPLX pair bundle** for Gemma 4 31B speculative decoding on Apple Silicon.
It is not a single vanilla Transformers model directory. The repository contains two MLX-format artifacts:
- `target/` - Gemma 4 31B IT target, MLX Q8 affine group-size 64
- `assistant/` - official Gemma 4 31B assistant drafter, MLX Q8 affine group-size 64
Use this pair when target precision and high acceptance are the priority.
## Source
- Target source: `google/gemma-4-31B-it`
- Target revision: `145dc2508c480a64b47242f160d286cff94a2343`
- Assistant source: `google/gemma-4-31B-it-assistant`
- Assistant revision: `cffbbd2cea41ea56a0fa5b0487e0d445121fd204`
Both artifacts were converted locally to MLX format.
## Quantization
Target:
```text
bits: 8
group_size: 64
mode: affine
```
Assistant:
```text
bits: 8
group_size: 64
mode: affine
```
## MTPLX Usage
After downloading this repository, point MTPLX at the two subdirectories:
```bash
mtplx bench gemma-mtp \
--target-model ./target \
--assistant-model ./assistant \
--prompt-suite mtplx/benchmarks/prompts/flappy.jsonl \
--max-tokens 1000 \
--draft-block-sizes 6 \
--allow-unverified-gemma
```
The Gemma 4 assistant is a separate drafter model. MTPLX uses exact speculative sampling with target verification and residual correction.
## Local Benchmark
Prompt: single-file HTML5 Canvas Flappy Bird game, capped at 1000 generated tokens.
Sampler:
```text
temperature: 1.0
top_p: 0.95
top_k: 64
seed: 0
```
Best observed block size:
```text
block_size: 6
acceptance: 833 / 835 = 99.76%
speedup_vs_ar: 2.49x
```
Observed MTPLX throughput samples:
```text
34.22 tok/s
32.88 tok/s
33.12 tok/s
```
The bundled benchmark JSON file is in `benchmarks/`.
## Notes
This release is optimized for target precision and high acceptance. It is not the fastest absolute-TPS pair; for speed, use `Youssofal/Gemma4-MTPLX-Optimized-Speed`.
Gemma 4 is released by Google under the Gemma 4 license terms linked above.
|