Gemma-4-31B-it — 31GB (MLX)

Mixed-precision quantized version of google/gemma-4-31B-it, optimised by baa.ai using a proprietary Black Sheep AI method.

  • Base model: google/gemma-4-31B-it
  • Quantized size: 28.6 GiB (6 shards)
  • Average bits / weight: 8.0
  • Runtime: MLX (Apple Silicon)

Evaluation

All benchmarks run in thinking mode with the Gemma 4 chat template, max_tokens=2048 per question.

Headline

Benchmark Score Notes
MMLU-Pro (12,032 Q) 85.2% 10,247 correct, 10-choice, thinking mode
WikiText-2 perplexity 1362.7 mean / 1444.9 median 128 sequences × 2048 tokens

At 85.2% that is the exact same score as Googles Offical score for this model, which you can find on their HF card.

MMLU-Pro Per-Category Breakdown

Category Correct Total Accuracy
math 1,274 1,351 94.3%
biology 665 717 92.7%
physics 1,167 1,299 89.8%
business 706 789 89.5%
chemistry 1,012 1,132 89.4%
economics 752 844 89.1%
computer science 362 410 88.3%
psychology 678 798 85.0%
philosophy 402 499 80.6%
health 655 818 80.1%
engineering 771 969 79.6%
other 721 924 78.0%
history 290 381 76.1%
law 792 1,101 71.9%

STEM categories (math, biology, physics, chemistry, computer science) all score ≥88% on MMLU-Pro in thinking mode.

Usage

from mlx_lm import load, generate
from transformers import AutoTokenizer

model, tokenizer = load("baa-ai/Gemma-4-31B-it-RAM-31GB-MLX")

# Use the Gemma 4 chat template for best results (enables thinking mode):
chat_tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")
messages = [{"role": "user", "content": "Explain why the sky appears blue."}]
formatted = chat_tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=formatted, max_tokens=2048)
print(response)

Requires a recent mlx-lm build that includes the gemma4 model module:

pip install git+https://github.com/ml-explore/mlx-lm.git

License

Inherits the Gemma Terms of Use from the base model. See the original google/gemma-4-31B-it model card for usage restrictions.


Quantized by baa.ai

Downloads last month
572
Safetensors
Model size
31B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for baa-ai/Gemma-4-31B-it-RAM-31GB-MLX

Quantized
(107)
this model

Collection including baa-ai/Gemma-4-31B-it-RAM-31GB-MLX