Gemma-4 Assistant (MTP)
Collection
4 items • Updated • 12
This model was converted to MLX format from google/gemma-4-E2B-it-assistant using mlx-vlm version 0.4.5.
Refer to the original model card for more details on the model.
pip install -U mlx-vlm
Single request — --draft-block-size 6:
python -m mlx_vlm generate \
--model mlx-community/gemma-4-E2B-it-bf16 \
--draft-model mlx-community/gemma-4-E2B-it-assistant-bf16 \
--draft-kind mtp \
--draft-block-size 6 \
--prompt "Explain speculative decoding in 3 sentences." \
--max-tokens 256 --temperature 0
Batched generation — --draft-block-size 3, use batch_generate:
from mlx_vlm.utils import load
from mlx_vlm.generate import batch_generate
from mlx_vlm.speculative.drafters import load_drafter
model, processor = load("mlx-community/gemma-4-E2B-it-bf16")
drafter = load_drafter("mlx-community/gemma-4-E2B-it-assistant-bf16", kind="mtp")
prompts = [
"Explain speculative decoding in 3 sentences.",
"What is MLX?",
"Summarize attention in one paragraph.",
"List three prime numbers.",
]
response = batch_generate(
model,
processor,
prompts=prompts,
max_tokens=256,
temperature=0.0,
draft_model=drafter,
draft_kind="mtp",
draft_block_size=3,
)
for text in response.texts:
print(text)
MLX port of Google's Gemma 4 Multi-Token Prediction (MTP) drafter for speculative decoding. A small 4-layer assistant drafts several candidate tokens per round; the full Gemma 4 target verifies them in a single forward pass. Output is byte-identical to no-drafter at temperature=0.
Recommended --draft-block-size: 6 for single requests, 3 for batched generation.
See the drafter docs for architecture, supported pairings, performance numbers, and caveats.
Quantized
Base model
google/gemma-4-E2B-it-assistant