Gemma-4-31B-it GGUF Quantization for AMD RDNA3 (gfx1100)
This is a GGUF quantization of Google's Gemma-4-31B-it instruction-tuned multimodal model, optimized for AMD Radeon RX 7900 XTX (gfx1100 / RDNA3) GPUs using llama.cpp.
Model Details
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31B-it |
| Quantization Format | GGUF V3 (latest) |
| Quantization Type | multiple |
| Architecture | Gemma4ForConditionalGeneration |
| Layers | 60 decoder layers |
| Hidden Size | 5376 |
| Context Window | 262K tokens |
| Vision Tower | SigLIP (27 layers, F16/Q8_0 mmproj) |
| Quantization Library | llama.cpp b8703 |
| ROCm Version | 6.4.4 |
| Hardware Target | AMD Radeon RX 7900 XTX (gfx1100, 24GB VRAM) |
Available Variants
This repository contains quantization variants optimized for different performance/quality tradeoffs:
4-bit Variant L (Quality-Optimized) — gemma-4-31B-it-IQ4_NL_L_AMD.gguf
| Metric | Value |
|---|---|
| File Size | ~20.11 GB |
| bpw | ~4.5 bits per weight |
| KLD vs F16 | 0.647 (lower is better) |
| PPL (wikitext-2) | ~12565 |
| Quant Strategy | IQ4_NL |
Best for: Maximum quality deployment where VRAM allows (~20GB) with smaller context or 2-GPU setup.
Variant M (Size-Optimized) — gemma-4-31B-it-IQ4_NL_M_AMD.gguf
| Metric | Value |
|---|---|
| File Size | ~18.70 GB |
| bpw | ~4.25 bits per weight |
| KLD vs F16 | 0.684 (lower is better) |
| PPL (wikitext-2) | ~12197 |
| Quant Strategy | IQ4_NL |
Best for: Single 7900 XTX (24GB) with headroom for KV cache.
Vision Projector Files
| File | Size | Description |
|---|---|---|
mmproj-gemma-4-31B-it-f16.gguf |
~1.12 GB | Vision encoder projector (F16) |
mmproj-gemma-4-31B-it-q8_0.gguf |
~0.75 GB | Vision encoder projector (Q8_0) |
Performance Benchmarks
2x AMD Radeon RX 7900 XTX (48GB Total VRAM)
Variant L (20.11 GB) — llama.cpp-b8703, -fa 1 -ctk q4_1 -ctv q4_1 -ngl 999 -ub 256
| Test | Tokens/sec |
|---|---|
| pp1024 | 1381.37 ± 2.36 |
| pp2048 | 1494.88 ± 1.42 |
| pp4096 | 1530.94 ± 0.97 |
| pp8192 | 1493.17 ± 0.76 |
| pp16384 | 1377.77 ± 0.66 |
| pp32768 | 1177.68 ± 0.54 |
| tg128 | 24.14 ± 0.01 |
| tg512 | 23.82 ± 0.01 |
| tg1024 | 23.49 ± 0.00 |
Variant M (18.68 GB) — Same configuration
| Test | Tokens/sec |
|---|---|
| pp1024 | 1361.17 ± 1.42 |
| pp2048 | 1473.53 ± 1.80 |
| pp4096 | 1510.83 ± 1.21 |
| pp8192 | 1476.50 ± 0.61 |
| pp16384 | 1365.21 ± 0.38 |
| pp32768 | 1169.50 ± 0.16 |
| tg128 | 25.05 ± 0.01 |
| tg512 | 24.66 ± 0.02 |
| tg1024 | 24.30 ± 0.01 |
Single AMD Radeon RX 7900 XTX (24GB VRAM)
Variant L (20.11 GB) — -fa 1 -ctk q4_1 -ctv q4_1 -ngl 999 -ub 1024
| Test | Tokens/sec |
|---|---|
| pp1024 | 962.09 ± 0.75 |
| pp2048 | 925.67 ± 0.24 |
| pp4096 | 886.89 ± 0.39 |
| pp8192 | 834.71 ± 0.27 |
| pp16384 | 752.85 ± 0.22 |
| pp32768 | OOM crash |
| tg128 | 26.73 ± 0.02 |
| tg512 | 26.33 ± 0.01 |
| tg1024 | 25.94 ± 0.00 |
Variant M (18.68 GB) — Same configuration
| Test | Tokens/sec |
|---|---|
| pp1024 | 955.37 ± 0.72 |
| pp2048 | 919.04 ± 0.27 |
| pp4096 | 880.58 ± 0.30 |
| pp8192 | 829.50 ± 0.12 |
| pp16384 | 749.10 ± 0.16 |
| pp32768 | 629.07 ± 0.17 |
| tg128 | 27.98 ± 0.01 |
| tg512 | 27.53 ± 0.01 |
| tg1024 | 27.08 ± 0.00 |
Quality Metrics
Evaluated on wikitext-2-raw with F16 baseline:
| Variant | PPL | KLD vs F16 | Notes |
|---|---|---|---|
| F16 Baseline | 12250.49 | 0.000 | Reference |
| Variant L | ~12565 | 0.647 | K-quants for attn_k/output |
| Variant M | ~12197 | 0.684 | Pure IQ4_NL |
KLD (KL Divergence) measures output distribution divergence from F16 baseline. Lower is better.
Usage with llama.cpp
Basic Chat (Single GPU)
# Load model with full context on single 7900 XTX
./llama-server -m gemma-4-31B-it-IQ4_NL_M_AMD.gguf \
--mmproj mmproj-gemma-4-31B-it-f16.gguf \
-ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
-c 32768 -ub 1024 -t 5
Multi-GPU (2x 7900 XTX)
# Split across 2 GPUs with smaller batch for higher throughput
./llama-server -m gemma-4-31B-it-IQ4_NL_L_AMD.gguf \
--mmproj mmproj-gemma-4-31B-it-f16.gguf \
-ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
-c 32768 -ub 256 -b 1024 -t 5 \
-ts 1,1
API Server
# Start OpenAI-compatible API
./llama-server -m gemma-4-31B-it-IQ4_NL_M_AMD.gguf \
--mmproj mmproj-gemma-4-31B-it-f16.gguf \
-ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
-c 32768 -ub 1024 -t 5 \
--port 8080
Hardware Requirements
| Variant | Minimum VRAM | Recommended |
|---|---|---|
| Variant L (20GB) | 24GB (single GPU, limited context) | 48GB (2x 7900 XTX) |
| Variant M (18GB) | 24GB (single 7900 XTX) | 48GB (2x 7900 XTX) |
Note: Both variants were tested on AMD ROCm 6.4.4 with llama.cpp b8703+. NVIDIA GPUs will work as well but they were not focus group.
Quantization Method
Variant L (IQ4_NL_L) — Tensor Composition
| Tensor Group | Quantization | Layers |
|---|---|---|
attn_k + attn_output |
IQ4_NL | All 60 layers |
attn_q |
Q8_0 | Layers 0-6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51-59 (24 layers) |
attn_q |
IQ4_NL | Remaining layers |
attn_v |
Q8_0 | Layers 0-4, 6, 9, 13, 16, 20, 24, 27, 31, 34, 38, 42, 45, 49, 51, 52, 54-58 (23 layers) |
attn_v |
IQ4_NL | Remaining layers |
ffn_down |
Q8_0 | Layers 0-6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51-59 (24 layers) |
ffn_down |
IQ4_NL | Remaining layers |
ffn_gate + ffn_up |
Q4_1 | All 60 layers |
token_embd |
Q8_0 | - |
output_norm |
F32 | - |
Variant M (Pure IQ4_NL) — Tensor Composition
| Tensor Group | Quantization | Layers |
|---|---|---|
attn_k + attn_output |
IQ4_NL | All 60 layers |
attn_q |
Q8_0 | Layers 0-6, 12, 15, 18, 21, 24, 30, 36, 39, 42, 45, 48, 51-59 (20 layers) |
attn_q |
IQ4_NL | Remaining layers |
attn_v |
Q8_0 | Layers 0-6, 12, 15, 18, 21, 24, 30, 36, 39, 42, 45, 48, 51-59 (20 layers) |
attn_v |
IQ4_NL | Remaining layers |
ffn_down |
Q4_1 | All 60 layers (pure IQ4_NL base) |
ffn_gate + ffn_up |
Q4_1 | All 60 layers |
token_embd |
Q8_0 | - |
output_norm |
F32 | - |
Calibration
- Dataset: Mixed task-specific calibration (Magicoder + APIGen-MT + When2Call)
- Method: imatrix with task-formatted data
- Group Size: 128
- imatrix entries: 410
- imatrix chunks: 100
Files in This Repository
| File | Description |
|---|---|
gemma-4-31B-it-IQ4_NL_L_AMD.gguf |
Variant L (20GB, K-quants for attn) |
gemma-4-31B-it-IQ4_NL_M_AMD.gguf |
Variant M (18GB, pure IQ4_NL) |
mmproj-gemma-4-31B-it-f16.gguf |
Vision projector (F16) |
mmproj-gemma-4-31B-it-q8_0.gguf |
Vision projector (Q8_0) |
imatrix.dat |
Calibration data (optional) |
License
This quantization is released under the Apache 2.0 License.
The base model google/gemma-4-31B-it is also licensed under Apache 2.0.
See LICENSE for the full text.
Citation
If you use this model in your research, please cite:
@misc{gemma4-31b-gguf-amd,
title = {Gemma-4-31B-it GGUF Quantization for AMD RDNA3},
author = {ebircak},
year = {2026},
howpublished = {\url{https://huggingface.co/ebircak/gemma-4-31B-it-GGUF}},
note = {Quantized with llama.cpp b8703 for AMD gfx1100}
}
Disclaimer
This is a community quantization of the Google Gemma-4-31B-it model optimized for AMD RDNA3 GPUs. While efforts have been made to ensure quality, this model is provided "as is" without warranty of any kind. Users should evaluate the model for their specific use cases.
These builds are specifically optimized for AMD Radeon RX 7900 XTX (gfx1100) using ROCm. They will work on other AMD GPUs or NVIDIA GPUs without modification but there might be other GGUF quantizations available with better size/KLD ratio for them.
Model Card Version
This model card follows the Model Cards for Model Reporting standard.
Original Model: google/gemma-4-31B-it
Quantization Tool: llama.cpp
Quantization Format: GGUF V3
Target Hardware: AMD Radeon RX 7900 XTX (gfx1100, RDNA3)
- Downloads last month
- 1,244
4-bit
Model tree for ebircak/gemma-4-31B-it-GGUF
Base model
google/gemma-4-31B-it