Gemma 3 4B Instruct – MLX 5-bit (Apple Silicon)

This repository provides a 5-bit quantized MLX version of Gemma 3 4B Instruct, optimized for efficient local inference on Apple Silicon (M1–M5).


Highlights

  • 5-bit quantization (better quality than standard 4-bit)
  • Fast inference on Apple Silicon (MLX backend)
  • Good reasoning and instruction-following
  • Low memory usage (~2.7 GB peak)

Performance (M3 Pro, 18GB RAM)

  • Generation speed: ~46 tokens/sec
  • Peak memory: ~2.7 GB
  • Model size: ~2.5 GB

Usage

Install MLX

pip install mlx mlx-lm

mlx_lm.generate \
  --model ./gemma-3-4b-it-mlx-5bit \
  --prompt "Explain HVAC airflow calculation in simple terms."

Example

Input

A room needs 12000 BTU/h cooling. If a system uses about 400 CFM per ton, estimate the airflow needed.

Expected reasoning:

12000 BTU/h = 1 ton → airflow ≈ 400 CFM

License and Attribution

This model is a derivative work based on:

Google Gemma 3 4B Instruct

Original model: https://huggingface.co/google/gemma-3-4b-it License: Gemma Terms of Use https://ai.google.dev/gemma/terms


Modifications

This repository includes the following modifications:

Converted to MLX format Quantized to 5-bit precision Optimized for Apple Silicon inference


Notice

Gemma is provided under and subject to the Gemma Terms of Use: https://ai.google.dev/gemma/terms


Disclaimer

This is an independently modified version of the original model. Google is not responsible for this version or its outputs.


Credits

Google – for the Gemma model MLX team – for Apple Silicon inference framework

Downloads last month
138
Safetensors
Model size
0.7B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Irfanuruchi/gemma-3-4b-it-mlx-5bit

Quantized
(209)
this model