Gemma 3 4B Instruct – MLX 5-bit (Apple Silicon)
This repository provides a 5-bit quantized MLX version of Gemma 3 4B Instruct, optimized for efficient local inference on Apple Silicon (M1–M5).
Highlights
- 5-bit quantization (better quality than standard 4-bit)
- Fast inference on Apple Silicon (MLX backend)
- Good reasoning and instruction-following
- Low memory usage (~2.7 GB peak)
Performance (M3 Pro, 18GB RAM)
- Generation speed: ~46 tokens/sec
- Peak memory: ~2.7 GB
- Model size: ~2.5 GB
Usage
Install MLX
pip install mlx mlx-lm
mlx_lm.generate \
--model ./gemma-3-4b-it-mlx-5bit \
--prompt "Explain HVAC airflow calculation in simple terms."
Example
Input
A room needs 12000 BTU/h cooling. If a system uses about 400 CFM per ton, estimate the airflow needed.
Expected reasoning:
12000 BTU/h = 1 ton → airflow ≈ 400 CFM
License and Attribution
This model is a derivative work based on:
Google Gemma 3 4B Instruct
Original model: https://huggingface.co/google/gemma-3-4b-it License: Gemma Terms of Use https://ai.google.dev/gemma/terms
Modifications
This repository includes the following modifications:
Converted to MLX format Quantized to 5-bit precision Optimized for Apple Silicon inference
Notice
Gemma is provided under and subject to the Gemma Terms of Use: https://ai.google.dev/gemma/terms
Disclaimer
This is an independently modified version of the original model. Google is not responsible for this version or its outputs.
Credits
Google – for the Gemma model MLX team – for Apple Silicon inference framework
- Downloads last month
- 138
5-bit