Gemma 4 31B Mix-Quant Q3 GGUF
File
- Model:
gemma-4-31b-mixq-q3.gguf - Multimodal projector:
mmproj-gemma-4-31b-f16.gguf
What This Is
This is a conservative mixed-quant GGUF build of Gemma 4 31B for llama.cpp.
It is not a pure uniform quant. It was built with:
- importance-guided quantization using
imatrix - higher precision on more sensitive tensors
- lower precision on less sensitive tensors
Quantization Type
This release is a GGUF quantized model for llama.cpp.
Quantization family:
GGUFllama.cppimatrix-guided quantization- mixed tensor quantization (
Mix-Quant) - Q3-centered mixed recipe
This means the model is not stored with one single quant type everywhere. Instead, different tensor groups are assigned different precision levels according to sensitivity.
Importance Matrix (imatrix)
This build uses llama.cpp importance matrix calibration.
Core formula:
I_j = Σ_t x_{t,j}^2
Where:
x_{t,j}is the activation value of channeljfor token/sample steptI_jis the accumulated importance score of that channel across calibration text
Practical meaning:
- channels that activate more often and with larger magnitude get larger importance values
- more important directions are better preserved during quantization
- less important directions can be compressed more aggressively
imatrix does not use benchmark scores directly.
It estimates sensitivity from activations collected on calibration data.
Multimodal Support
Yes. Multimodal remains supported when used together with:
mmproj-gemma-4-31b-f16.gguf
Notes:
- the text model and the projector are separate files
- the text GGUF alone is not enough for vision input
- for image support, load both the main model and
mmproj
Quantization Road
The practical road was:
- Start from the original HF Gemma 4 31B model.
- Export text model to F16 GGUF.
- Build an importance matrix from calibration text.
- Use mixed quantization instead of a pure uniform Q3.
- Test with local smoke checks and benchmark samples.
Self Tests
Observed checks during the project:
- model loads successfully in
llama.cpp - dual GPU CUDA loading works
- multimodal chain remains available when
mmprojis present
Local benchmark references:
F16 MMLU-Pro 1400:0.6421428571Q3 MMLU-Pro 1400:0.6450000000Q3 HellaSwag 200:0.8650000000Q3 HellaSwag 1400:0.8771428571
Evaluation files:
evals/mmlu_pro_choose_f16_seed_20260411_n1400.jsonevals/mmlu_pro_choose_q3_seed_20260411_n1400.jsonevals/hellaswag_choose_q3_seed_20260411_n200_new.jsonevals/hellaswag_choose_q3_seed_20260411_n1400_new.json
Note:
- the strongest directly saved local F16 baseline from this project is
MMLU-Pro 1400 - HellaSwag F16 was not preserved as a finalized release reference file
Environment Build
This line was built and tested with:
- Ubuntu 20.04
- NVIDIA driver 580.95.05
- 2x RTX 5090
- CUDA 12.8 toolkit
llama.cppbuild with CUDA support
Minimal environment steps:
- Install CUDA toolkit.
- Build
llama.cppwith CUDA enabled. - Keep
mmproj-gemma-4-31b-f16.ggufnext to the main GGUF if vision is needed. - Run with
-ngl 999 -fa on.
Example text+vision server:
/home/kasm-user/src/llama.cpp-b8756/build/bin/llama-server \
-m 'gemma-4-31b-mixq-q3.gguf' \
--mmproj 'mmproj-gemma-4-31b-f16.gguf' \
-ngl 999 -fa on --ctx-size 4096 -np 1 --port 18081
Datasets Used In The Project
These datasets were used in the broader tuning and testing workflow around this model line:
TeichAI/glm-4.7-2000xFarseen0/opus-4.6-reasoning-sft-12k
Apache note:
- the project workflow treated these datasets as Apache-licensed sources
- if you redistribute publicly, verify the upstream dataset pages and their source lineage again
Practical Summary
This version is the larger and more conservative mixed-quant line.
Use this version if you want:
- stronger quality retention than the smaller 14G build
- multimodal compatibility with the same
mmproj - a safer baseline for comparison
- Downloads last month
- 115
We're not able to determine the quantization variants.
Model tree for keyuan01/Gemma-4-31B-it-MixQ-Q3-16G-GGUF
Base model
google/gemma-4-31B-it