Gemma 4 31B Mix-Quant Q3 GGUF

File

Model: gemma-4-31b-mixq-q3.gguf
Multimodal projector: mmproj-gemma-4-31b-f16.gguf

What This Is

This is a conservative mixed-quant GGUF build of Gemma 4 31B for llama.cpp.

It is not a pure uniform quant. It was built with:

importance-guided quantization using imatrix
higher precision on more sensitive tensors
lower precision on less sensitive tensors

Quantization Type

This release is a GGUF quantized model for llama.cpp.

Quantization family:

GGUF
llama.cpp
imatrix-guided quantization
mixed tensor quantization (Mix-Quant)
Q3-centered mixed recipe

This means the model is not stored with one single quant type everywhere. Instead, different tensor groups are assigned different precision levels according to sensitivity.

Importance Matrix (`imatrix`)

This build uses llama.cpp importance matrix calibration.

Core formula:

I_j = Σ_t x_{t,j}^2

Where:

x_{t,j} is the activation value of channel j for token/sample step t
I_j is the accumulated importance score of that channel across calibration text

Practical meaning:

channels that activate more often and with larger magnitude get larger importance values
more important directions are better preserved during quantization
less important directions can be compressed more aggressively

imatrix does not use benchmark scores directly. It estimates sensitivity from activations collected on calibration data.

Multimodal Support

Yes. Multimodal remains supported when used together with:

mmproj-gemma-4-31b-f16.gguf

Notes:

the text model and the projector are separate files
the text GGUF alone is not enough for vision input
for image support, load both the main model and mmproj

Quantization Road

The practical road was:

Start from the original HF Gemma 4 31B model.
Export text model to F16 GGUF.
Build an importance matrix from calibration text.
Use mixed quantization instead of a pure uniform Q3.
Test with local smoke checks and benchmark samples.

Self Tests

Observed checks during the project:

model loads successfully in llama.cpp
dual GPU CUDA loading works
multimodal chain remains available when mmproj is present

Local benchmark references:

F16 MMLU-Pro 1400: 0.6421428571
Q3 MMLU-Pro 1400: 0.6450000000
Q3 HellaSwag 200: 0.8650000000
Q3 HellaSwag 1400: 0.8771428571

Evaluation files:

evals/mmlu_pro_choose_f16_seed_20260411_n1400.json
evals/mmlu_pro_choose_q3_seed_20260411_n1400.json
evals/hellaswag_choose_q3_seed_20260411_n200_new.json
evals/hellaswag_choose_q3_seed_20260411_n1400_new.json

Note:

the strongest directly saved local F16 baseline from this project is MMLU-Pro 1400
HellaSwag F16 was not preserved as a finalized release reference file

Environment Build

This line was built and tested with:

Ubuntu 20.04
NVIDIA driver 580.95.05
2x RTX 5090
CUDA 12.8 toolkit
llama.cpp build with CUDA support

Minimal environment steps:

Install CUDA toolkit.
Build llama.cpp with CUDA enabled.
Keep mmproj-gemma-4-31b-f16.gguf next to the main GGUF if vision is needed.
Run with -ngl 999 -fa on.

Example text+vision server:

/home/kasm-user/src/llama.cpp-b8756/build/bin/llama-server \
  -m 'gemma-4-31b-mixq-q3.gguf' \
  --mmproj 'mmproj-gemma-4-31b-f16.gguf' \
  -ngl 999 -fa on --ctx-size 4096 -np 1 --port 18081

Datasets Used In The Project

These datasets were used in the broader tuning and testing workflow around this model line:

TeichAI/glm-4.7-2000x
Farseen0/opus-4.6-reasoning-sft-12k

Apache note:

the project workflow treated these datasets as Apache-licensed sources
if you redistribute publicly, verify the upstream dataset pages and their source lineage again

Practical Summary

This version is the larger and more conservative mixed-quant line.

Use this version if you want:

stronger quality retention than the smaller 14G build
multimodal compatibility with the same mmproj
a safer baseline for comparison

Downloads last month: 115

GGUF

Model size

31B params

Architecture

gemma4

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for keyuan01/Gemma-4-31B-it-MixQ-Q3-16G-GGUF

Base model

google/gemma-4-31B-it

Quantized

(165)

this model