mlx-community/gemma-4-E4B-it-assistant-bf16

This model was converted to MLX format from google/gemma-4-E4B-it-assistant using mlx-vlm version 0.4.5. Refer to the original model card for more details on the model.

Use with mlx

pip install -U mlx-vlm

Single request — --draft-block-size 6:

python -m mlx_vlm generate \
    --model mlx-community/gemma-4-E4B-it-bf16 \
    --draft-model mlx-community/gemma-4-E4B-it-assistant-bf16 \
    --draft-kind mtp \
    --draft-block-size 6 \
    --prompt "Explain speculative decoding in 3 sentences." \
    --max-tokens 256 --temperature 0

Batched generation — --draft-block-size 3, use batch_generate:

from mlx_vlm.utils import load
from mlx_vlm.generate import batch_generate
from mlx_vlm.speculative.drafters import load_drafter

model, processor = load("mlx-community/gemma-4-E4B-it-bf16")
drafter = load_drafter("mlx-community/gemma-4-E4B-it-assistant-bf16", kind="mtp")

prompts = [
    "Explain speculative decoding in 3 sentences.",
    "What is MLX?",
    "Summarize attention in one paragraph.",
    "List three prime numbers.",
]

response = batch_generate(
    model,
    processor,
    prompts=prompts,
    max_tokens=256,
    temperature=0.0,
    draft_model=drafter,
    draft_kind="mtp",
    draft_block_size=3,
)
for text in response.texts:
    print(text)

About

MLX port of Google's Gemma 4 Multi-Token Prediction (MTP) drafter for speculative decoding. A small 4-layer assistant drafts several candidate tokens per round; the full Gemma 4 target verifies them in a single forward pass. Output is byte-identical to no-drafter at temperature=0.

Recommended --draft-block-size: 6 for single requests, 3 for batched generation.

See the drafter docs for architecture, supported pairings, performance numbers, and caveats.

Downloads last month
2,003
Safetensors
Model size
78.8M params
Tensor type
I64
·
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/gemma-4-E4B-it-assistant-bf16

Finetuned
(1)
this model

Collection including mlx-community/gemma-4-E4B-it-assistant-bf16