Gemma-4-31B-it — 31GB (MLX)

Mixed-precision quantized version of google/gemma-4-31B-it, optimised by baa.ai using a proprietary Black Sheep AI method.

Base model: google/gemma-4-31B-it
Quantized size: 28.6 GiB (6 shards)
Average bits / weight: 8.0
Runtime: MLX (Apple Silicon)

Evaluation

All benchmarks run in thinking mode with the Gemma 4 chat template, max_tokens=2048 per question.

Headline

Benchmark	Score	Notes
MMLU-Pro (12,032 Q)	85.2%	10,247 correct, 10-choice, thinking mode
WikiText-2 perplexity	1362.7 mean / 1444.9 median	128 sequences × 2048 tokens

At 85.2% that is the exact same score as Googles Offical score for this model, which you can find on their HF card.

MMLU-Pro Per-Category Breakdown

Category	Correct	Total	Accuracy
math	1,274	1,351	94.3%
biology	665	717	92.7%
physics	1,167	1,299	89.8%
business	706	789	89.5%
chemistry	1,012	1,132	89.4%
economics	752	844	89.1%
computer science	362	410	88.3%
psychology	678	798	85.0%
philosophy	402	499	80.6%
health	655	818	80.1%
engineering	771	969	79.6%
other	721	924	78.0%
history	290	381	76.1%
law	792	1,101	71.9%

STEM categories (math, biology, physics, chemistry, computer science) all score ≥88% on MMLU-Pro in thinking mode.

Usage

from mlx_lm import load, generate
from transformers import AutoTokenizer

model, tokenizer = load("baa-ai/Gemma-4-31B-it-RAM-31GB-MLX")

# Use the Gemma 4 chat template for best results (enables thinking mode):
chat_tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")
messages = [{"role": "user", "content": "Explain why the sky appears blue."}]
formatted = chat_tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=formatted, max_tokens=2048)
print(response)

Requires a recent mlx-lm build that includes the gemma4 model module:

pip install git+https://github.com/ml-explore/mlx-lm.git

License

Inherits the Gemma Terms of Use from the base model. See the original google/gemma-4-31B-it model card for usage restrictions.

Quantized by baa.ai

Downloads last month: 572

Safetensors

Model size

31B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for baa-ai/Gemma-4-31B-it-RAM-31GB-MLX

Base model

google/gemma-4-31B-it

Quantized

(107)

this model

Collection including baa-ai/Gemma-4-31B-it-RAM-31GB-MLX

Gemma 4

Collection

RAM optimised Gemma 4 models by baa.ai • 11 items • Updated about 11 hours ago