Instructions to use Youssofal/Gemma4-MTPLX-Optimized-Quality with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Youssofal/Gemma4-MTPLX-Optimized-Quality with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("Youssofal/Gemma4-MTPLX-Optimized-Quality") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- MLX LM
How to use Youssofal/Gemma4-MTPLX-Optimized-Quality with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "Youssofal/Gemma4-MTPLX-Optimized-Quality" --prompt "Once upon a time"
| license: apache-2.0 | |
| license_link: https://ai.google.dev/gemma/docs/gemma_4_license | |
| base_model: | |
| - google/gemma-4-31B-it | |
| - google/gemma-4-31B-it-assistant | |
| library_name: mlx | |
| tags: | |
| - mlx | |
| - gemma4 | |
| - mtplx | |
| - speculative-decoding | |
| - apple-silicon | |
| - text-generation | |
| pipeline_tag: text-generation | |
| # Gemma4 MTPLX Optimized Quality | |
| This is an **MTPLX pair bundle** for Gemma 4 31B speculative decoding on Apple Silicon. | |
| It is not a single vanilla Transformers model directory. The repository contains two MLX-format artifacts: | |
| - `target/` - Gemma 4 31B IT target, MLX Q8 affine group-size 64 | |
| - `assistant/` - official Gemma 4 31B assistant drafter, MLX Q8 affine group-size 64 | |
| Use this pair when target precision and high acceptance are the priority. | |
| ## Source | |
| - Target source: `google/gemma-4-31B-it` | |
| - Target revision: `145dc2508c480a64b47242f160d286cff94a2343` | |
| - Assistant source: `google/gemma-4-31B-it-assistant` | |
| - Assistant revision: `cffbbd2cea41ea56a0fa5b0487e0d445121fd204` | |
| Both artifacts were converted locally to MLX format. | |
| ## Quantization | |
| Target: | |
| ```text | |
| bits: 8 | |
| group_size: 64 | |
| mode: affine | |
| ``` | |
| Assistant: | |
| ```text | |
| bits: 8 | |
| group_size: 64 | |
| mode: affine | |
| ``` | |
| ## MTPLX Usage | |
| After downloading this repository, point MTPLX at the two subdirectories: | |
| ```bash | |
| mtplx bench gemma-mtp \ | |
| --target-model ./target \ | |
| --assistant-model ./assistant \ | |
| --prompt-suite mtplx/benchmarks/prompts/flappy.jsonl \ | |
| --max-tokens 1000 \ | |
| --draft-block-sizes 6 \ | |
| --allow-unverified-gemma | |
| ``` | |
| The Gemma 4 assistant is a separate drafter model. MTPLX uses exact speculative sampling with target verification and residual correction. | |
| ## Local Benchmark | |
| Prompt: single-file HTML5 Canvas Flappy Bird game, capped at 1000 generated tokens. | |
| Sampler: | |
| ```text | |
| temperature: 1.0 | |
| top_p: 0.95 | |
| top_k: 64 | |
| seed: 0 | |
| ``` | |
| Best observed block size: | |
| ```text | |
| block_size: 6 | |
| acceptance: 833 / 835 = 99.76% | |
| speedup_vs_ar: 2.49x | |
| ``` | |
| Observed MTPLX throughput samples: | |
| ```text | |
| 34.22 tok/s | |
| 32.88 tok/s | |
| 33.12 tok/s | |
| ``` | |
| The bundled benchmark JSON file is in `benchmarks/`. | |
| ## Notes | |
| This release is optimized for target precision and high acceptance. It is not the fastest absolute-TPS pair; for speed, use `Youssofal/Gemma4-MTPLX-Optimized-Speed`. | |
| Gemma 4 is released by Google under the Gemma 4 license terms linked above. | |