Youssofal
/

Gemma4-MTPLX-Optimized-Quality

Text Generation

speculative-decoding

Model card Files Files and versions

Gemma4-MTPLX-Optimized-Quality / README.md

Youssofal's picture

Add files using upload-large-folder tool

77ba53c verified 18 days ago

|

history blame contribute delete

2.31 kB

	---
	license: apache-2.0
	license_link: https://ai.google.dev/gemma/docs/gemma_4_license
	base_model:
	- google/gemma-4-31B-it
	- google/gemma-4-31B-it-assistant
	library_name: mlx
	tags:
	- mlx
	- gemma4
	- mtplx
	- speculative-decoding
	- apple-silicon
	- text-generation
	pipeline_tag: text-generation
	---

	# Gemma4 MTPLX Optimized Quality

	This is an MTPLX pair bundle for Gemma 4 31B speculative decoding on Apple Silicon.

	It is not a single vanilla Transformers model directory. The repository contains two MLX-format artifacts:

	- `target/` - Gemma 4 31B IT target, MLX Q8 affine group-size 64
	- `assistant/` - official Gemma 4 31B assistant drafter, MLX Q8 affine group-size 64

	Use this pair when target precision and high acceptance are the priority.

	## Source

	- Target source: `google/gemma-4-31B-it`
	- Target revision: `145dc2508c480a64b47242f160d286cff94a2343`
	- Assistant source: `google/gemma-4-31B-it-assistant`
	- Assistant revision: `cffbbd2cea41ea56a0fa5b0487e0d445121fd204`

	Both artifacts were converted locally to MLX format.

	## Quantization

	Target:

	```text
	bits: 8
	group_size: 64
	mode: affine
	```

	Assistant:

	```text
	bits: 8
	group_size: 64
	mode: affine
	```

	## MTPLX Usage

	After downloading this repository, point MTPLX at the two subdirectories:

	```bash
	mtplx bench gemma-mtp \
	--target-model ./target \
	--assistant-model ./assistant \
	--prompt-suite mtplx/benchmarks/prompts/flappy.jsonl \
	--max-tokens 1000 \
	--draft-block-sizes 6 \
	--allow-unverified-gemma
	```

	The Gemma 4 assistant is a separate drafter model. MTPLX uses exact speculative sampling with target verification and residual correction.

	## Local Benchmark

	Prompt: single-file HTML5 Canvas Flappy Bird game, capped at 1000 generated tokens.

	Sampler:

	```text
	temperature: 1.0
	top_p: 0.95
	top_k: 64
	seed: 0
	```

	Best observed block size:

	```text
	block_size: 6
	acceptance: 833 / 835 = 99.76%
	speedup_vs_ar: 2.49x
	```

	Observed MTPLX throughput samples:

	```text
	34.22 tok/s
	32.88 tok/s
	33.12 tok/s
	```

	The bundled benchmark JSON file is in `benchmarks/`.

	## Notes

	This release is optimized for target precision and high acceptance. It is not the fastest absolute-TPS pair; for speed, use `Youssofal/Gemma4-MTPLX-Optimized-Speed`.

	Gemma 4 is released by Google under the Gemma 4 license terms linked above.