Gemma 4 31B Mix-Quant Q3 GGUF

File

  • Model: gemma-4-31b-mixq-q3.gguf
  • Multimodal projector: mmproj-gemma-4-31b-f16.gguf

What This Is

This is a conservative mixed-quant GGUF build of Gemma 4 31B for llama.cpp.

It is not a pure uniform quant. It was built with:

  • importance-guided quantization using imatrix
  • higher precision on more sensitive tensors
  • lower precision on less sensitive tensors

Quantization Type

This release is a GGUF quantized model for llama.cpp.

Quantization family:

  • GGUF
  • llama.cpp
  • imatrix-guided quantization
  • mixed tensor quantization (Mix-Quant)
  • Q3-centered mixed recipe

This means the model is not stored with one single quant type everywhere. Instead, different tensor groups are assigned different precision levels according to sensitivity.

Importance Matrix (imatrix)

This build uses llama.cpp importance matrix calibration.

Core formula:

I_j = Σ_t x_{t,j}^2

Where:

  • x_{t,j} is the activation value of channel j for token/sample step t
  • I_j is the accumulated importance score of that channel across calibration text

Practical meaning:

  • channels that activate more often and with larger magnitude get larger importance values
  • more important directions are better preserved during quantization
  • less important directions can be compressed more aggressively

imatrix does not use benchmark scores directly. It estimates sensitivity from activations collected on calibration data.

Multimodal Support

Yes. Multimodal remains supported when used together with:

  • mmproj-gemma-4-31b-f16.gguf

Notes:

  • the text model and the projector are separate files
  • the text GGUF alone is not enough for vision input
  • for image support, load both the main model and mmproj

Quantization Road

The practical road was:

  1. Start from the original HF Gemma 4 31B model.
  2. Export text model to F16 GGUF.
  3. Build an importance matrix from calibration text.
  4. Use mixed quantization instead of a pure uniform Q3.
  5. Test with local smoke checks and benchmark samples.

Self Tests

Observed checks during the project:

  • model loads successfully in llama.cpp
  • dual GPU CUDA loading works
  • multimodal chain remains available when mmproj is present

Local benchmark references:

  • F16 MMLU-Pro 1400: 0.6421428571
  • Q3 MMLU-Pro 1400: 0.6450000000
  • Q3 HellaSwag 200: 0.8650000000
  • Q3 HellaSwag 1400: 0.8771428571

Evaluation files:

  • evals/mmlu_pro_choose_f16_seed_20260411_n1400.json
  • evals/mmlu_pro_choose_q3_seed_20260411_n1400.json
  • evals/hellaswag_choose_q3_seed_20260411_n200_new.json
  • evals/hellaswag_choose_q3_seed_20260411_n1400_new.json

Note:

  • the strongest directly saved local F16 baseline from this project is MMLU-Pro 1400
  • HellaSwag F16 was not preserved as a finalized release reference file

Environment Build

This line was built and tested with:

  • Ubuntu 20.04
  • NVIDIA driver 580.95.05
  • 2x RTX 5090
  • CUDA 12.8 toolkit
  • llama.cpp build with CUDA support

Minimal environment steps:

  1. Install CUDA toolkit.
  2. Build llama.cpp with CUDA enabled.
  3. Keep mmproj-gemma-4-31b-f16.gguf next to the main GGUF if vision is needed.
  4. Run with -ngl 999 -fa on.

Example text+vision server:

/home/kasm-user/src/llama.cpp-b8756/build/bin/llama-server \
  -m 'gemma-4-31b-mixq-q3.gguf' \
  --mmproj 'mmproj-gemma-4-31b-f16.gguf' \
  -ngl 999 -fa on --ctx-size 4096 -np 1 --port 18081

Datasets Used In The Project

These datasets were used in the broader tuning and testing workflow around this model line:

  • TeichAI/glm-4.7-2000x
  • Farseen0/opus-4.6-reasoning-sft-12k

Apache note:

  • the project workflow treated these datasets as Apache-licensed sources
  • if you redistribute publicly, verify the upstream dataset pages and their source lineage again

Practical Summary

This version is the larger and more conservative mixed-quant line.

Use this version if you want:

  • stronger quality retention than the smaller 14G build
  • multimodal compatibility with the same mmproj
  • a safer baseline for comparison
Downloads last month
115
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for keyuan01/Gemma-4-31B-it-MixQ-Q3-16G-GGUF

Quantized
(165)
this model