⚠️ Existing MLX quantized Gemma 4 models (mlx-community, unsloth) produce garbage output due to quantizing PLE (Per-Layer Embedding) layers.

#1
by Alkd - opened
MLX Community org

Fixed 2 days ago during the release window: https://github.com/Blaizzy/mlx-vlm/pull/893

This works perfectly fine for me:

pip install --upgrade mlx-vlm
mlx_vlm.generate --model mlx-community/gemma-4-e2b-it-bf16 --prompt "Who are you?"

Works for bf16, doesn't work for quantization. See http://github.com/jundot/omlx/issues/534 nvm! seems to work!

MLX Community org

be carefule, it took me 50G ram

MLX Community org

Also converted the REAP-pruned variants (21B and 19B) to PLE-safe MLX 4-bit - both validated with vision + multilingual chat working correctly.
REAP-21B (13.9 GB) actually outscores the full 26B 4-bit on several benchmarks despite being smaller.
You can also convert any gemma4 variant yourself using the scripts in the repo - just point convert_gemma4.py at the source model. Built on FakeRocket543's PLE-safe quantization work.
Models: https://huggingface.co/ukint-vs/gemma-4-21b-a4b-it-REAP-MLX-4bit
Conversion scripts + benchmark results: https://github.com/ukint-vs/mlx-gemma4-reap

Sign up or log in to comment