Gemma 4
Collection
RAM optimised Gemma 4 models by baa.ai • 11 items • Updated
Mixed-precision quantized version of google/gemma-4-31B-it,
optimised by baa.ai using a proprietary Black Sheep AI method.
All benchmarks run in thinking mode with the Gemma 4 chat template,
max_tokens=2048 per question.
| Benchmark | Score | Notes |
|---|---|---|
| MMLU-Pro (12,032 Q) | 85.2% | 10,247 correct, 10-choice, thinking mode |
| WikiText-2 perplexity | 1362.7 mean / 1444.9 median | 128 sequences × 2048 tokens |
At 85.2% that is the exact same score as Googles Offical score for this model, which you can find on their HF card.
| Category | Correct | Total | Accuracy |
|---|---|---|---|
| math | 1,274 | 1,351 | 94.3% |
| biology | 665 | 717 | 92.7% |
| physics | 1,167 | 1,299 | 89.8% |
| business | 706 | 789 | 89.5% |
| chemistry | 1,012 | 1,132 | 89.4% |
| economics | 752 | 844 | 89.1% |
| computer science | 362 | 410 | 88.3% |
| psychology | 678 | 798 | 85.0% |
| philosophy | 402 | 499 | 80.6% |
| health | 655 | 818 | 80.1% |
| engineering | 771 | 969 | 79.6% |
| other | 721 | 924 | 78.0% |
| history | 290 | 381 | 76.1% |
| law | 792 | 1,101 | 71.9% |
STEM categories (math, biology, physics, chemistry, computer science) all score ≥88% on MMLU-Pro in thinking mode.
from mlx_lm import load, generate
from transformers import AutoTokenizer
model, tokenizer = load("baa-ai/Gemma-4-31B-it-RAM-31GB-MLX")
# Use the Gemma 4 chat template for best results (enables thinking mode):
chat_tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")
messages = [{"role": "user", "content": "Explain why the sky appears blue."}]
formatted = chat_tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=formatted, max_tokens=2048)
print(response)
Requires a recent mlx-lm build that includes the gemma4 model module:
pip install git+https://github.com/ml-explore/mlx-lm.git
Inherits the Gemma Terms of Use
from the base model. See the original
google/gemma-4-31B-it
model card for usage restrictions.
Quantized by baa.ai
4-bit
Base model
google/gemma-4-31B-it