Gemma-4-31B-it GGUF Quantization for AMD RDNA3 (gfx1100)

This is a GGUF quantization of Google's Gemma-4-31B-it instruction-tuned multimodal model, optimized for AMD Radeon RX 7900 XTX (gfx1100 / RDNA3) GPUs using llama.cpp.

Model Details

Property Value
Base Model google/gemma-4-31B-it
Quantization Format GGUF V3 (latest)
Quantization Type multiple
Architecture Gemma4ForConditionalGeneration
Layers 60 decoder layers
Hidden Size 5376
Context Window 262K tokens
Vision Tower SigLIP (27 layers, F16/Q8_0 mmproj)
Quantization Library llama.cpp b8703
ROCm Version 6.4.4
Hardware Target AMD Radeon RX 7900 XTX (gfx1100, 24GB VRAM)

Available Variants

This repository contains quantization variants optimized for different performance/quality tradeoffs:

4-bit Variant L (Quality-Optimized) — gemma-4-31B-it-IQ4_NL_L_AMD.gguf

Metric Value
File Size ~20.11 GB
bpw ~4.5 bits per weight
KLD vs F16 0.647 (lower is better)
PPL (wikitext-2) ~12565
Quant Strategy IQ4_NL

Best for: Maximum quality deployment where VRAM allows (~20GB) with smaller context or 2-GPU setup.

Variant M (Size-Optimized) — gemma-4-31B-it-IQ4_NL_M_AMD.gguf

Metric Value
File Size ~18.70 GB
bpw ~4.25 bits per weight
KLD vs F16 0.684 (lower is better)
PPL (wikitext-2) ~12197
Quant Strategy IQ4_NL

Best for: Single 7900 XTX (24GB) with headroom for KV cache.

Vision Projector Files

File Size Description
mmproj-gemma-4-31B-it-f16.gguf ~1.12 GB Vision encoder projector (F16)
mmproj-gemma-4-31B-it-q8_0.gguf ~0.75 GB Vision encoder projector (Q8_0)

Performance Benchmarks

2x AMD Radeon RX 7900 XTX (48GB Total VRAM)

Variant L (20.11 GB) — llama.cpp-b8703, -fa 1 -ctk q4_1 -ctv q4_1 -ngl 999 -ub 256

Test Tokens/sec
pp1024 1381.37 ± 2.36
pp2048 1494.88 ± 1.42
pp4096 1530.94 ± 0.97
pp8192 1493.17 ± 0.76
pp16384 1377.77 ± 0.66
pp32768 1177.68 ± 0.54
tg128 24.14 ± 0.01
tg512 23.82 ± 0.01
tg1024 23.49 ± 0.00

Variant M (18.68 GB) — Same configuration

Test Tokens/sec
pp1024 1361.17 ± 1.42
pp2048 1473.53 ± 1.80
pp4096 1510.83 ± 1.21
pp8192 1476.50 ± 0.61
pp16384 1365.21 ± 0.38
pp32768 1169.50 ± 0.16
tg128 25.05 ± 0.01
tg512 24.66 ± 0.02
tg1024 24.30 ± 0.01

Single AMD Radeon RX 7900 XTX (24GB VRAM)

Variant L (20.11 GB) — -fa 1 -ctk q4_1 -ctv q4_1 -ngl 999 -ub 1024

Test Tokens/sec
pp1024 962.09 ± 0.75
pp2048 925.67 ± 0.24
pp4096 886.89 ± 0.39
pp8192 834.71 ± 0.27
pp16384 752.85 ± 0.22
pp32768 OOM crash
tg128 26.73 ± 0.02
tg512 26.33 ± 0.01
tg1024 25.94 ± 0.00

Variant M (18.68 GB) — Same configuration

Test Tokens/sec
pp1024 955.37 ± 0.72
pp2048 919.04 ± 0.27
pp4096 880.58 ± 0.30
pp8192 829.50 ± 0.12
pp16384 749.10 ± 0.16
pp32768 629.07 ± 0.17
tg128 27.98 ± 0.01
tg512 27.53 ± 0.01
tg1024 27.08 ± 0.00

Quality Metrics

Evaluated on wikitext-2-raw with F16 baseline:

Variant PPL KLD vs F16 Notes
F16 Baseline 12250.49 0.000 Reference
Variant L ~12565 0.647 K-quants for attn_k/output
Variant M ~12197 0.684 Pure IQ4_NL

KLD (KL Divergence) measures output distribution divergence from F16 baseline. Lower is better.

Usage with llama.cpp

Basic Chat (Single GPU)

# Load model with full context on single 7900 XTX
./llama-server -m gemma-4-31B-it-IQ4_NL_M_AMD.gguf \
    --mmproj mmproj-gemma-4-31B-it-f16.gguf \
    -ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
    -c 32768 -ub 1024 -t 5

Multi-GPU (2x 7900 XTX)

# Split across 2 GPUs with smaller batch for higher throughput
./llama-server -m gemma-4-31B-it-IQ4_NL_L_AMD.gguf \
    --mmproj mmproj-gemma-4-31B-it-f16.gguf \
    -ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
    -c 32768 -ub 256 -b 1024 -t 5 \
    -ts 1,1

API Server

# Start OpenAI-compatible API
./llama-server -m gemma-4-31B-it-IQ4_NL_M_AMD.gguf \
    --mmproj mmproj-gemma-4-31B-it-f16.gguf \
    -ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
    -c 32768 -ub 1024 -t 5 \
    --port 8080

Hardware Requirements

Variant Minimum VRAM Recommended
Variant L (20GB) 24GB (single GPU, limited context) 48GB (2x 7900 XTX)
Variant M (18GB) 24GB (single 7900 XTX) 48GB (2x 7900 XTX)

Note: Both variants were tested on AMD ROCm 6.4.4 with llama.cpp b8703+. NVIDIA GPUs will work as well but they were not focus group.

Quantization Method

Variant L (IQ4_NL_L) — Tensor Composition

Tensor Group Quantization Layers
attn_k + attn_output IQ4_NL All 60 layers
attn_q Q8_0 Layers 0-6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51-59 (24 layers)
attn_q IQ4_NL Remaining layers
attn_v Q8_0 Layers 0-4, 6, 9, 13, 16, 20, 24, 27, 31, 34, 38, 42, 45, 49, 51, 52, 54-58 (23 layers)
attn_v IQ4_NL Remaining layers
ffn_down Q8_0 Layers 0-6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51-59 (24 layers)
ffn_down IQ4_NL Remaining layers
ffn_gate + ffn_up Q4_1 All 60 layers
token_embd Q8_0 -
output_norm F32 -

Variant M (Pure IQ4_NL) — Tensor Composition

Tensor Group Quantization Layers
attn_k + attn_output IQ4_NL All 60 layers
attn_q Q8_0 Layers 0-6, 12, 15, 18, 21, 24, 30, 36, 39, 42, 45, 48, 51-59 (20 layers)
attn_q IQ4_NL Remaining layers
attn_v Q8_0 Layers 0-6, 12, 15, 18, 21, 24, 30, 36, 39, 42, 45, 48, 51-59 (20 layers)
attn_v IQ4_NL Remaining layers
ffn_down Q4_1 All 60 layers (pure IQ4_NL base)
ffn_gate + ffn_up Q4_1 All 60 layers
token_embd Q8_0 -
output_norm F32 -

Calibration

  • Dataset: Mixed task-specific calibration (Magicoder + APIGen-MT + When2Call)
  • Method: imatrix with task-formatted data
  • Group Size: 128
  • imatrix entries: 410
  • imatrix chunks: 100

Files in This Repository

File Description
gemma-4-31B-it-IQ4_NL_L_AMD.gguf Variant L (20GB, K-quants for attn)
gemma-4-31B-it-IQ4_NL_M_AMD.gguf Variant M (18GB, pure IQ4_NL)
mmproj-gemma-4-31B-it-f16.gguf Vision projector (F16)
mmproj-gemma-4-31B-it-q8_0.gguf Vision projector (Q8_0)
imatrix.dat Calibration data (optional)

License

This quantization is released under the Apache 2.0 License.

The base model google/gemma-4-31B-it is also licensed under Apache 2.0.

See LICENSE for the full text.

Citation

If you use this model in your research, please cite:

@misc{gemma4-31b-gguf-amd,
  title = {Gemma-4-31B-it GGUF Quantization for AMD RDNA3},
  author = {ebircak},
  year = {2026},
  howpublished = {\url{https://huggingface.co/ebircak/gemma-4-31B-it-GGUF}},
  note = {Quantized with llama.cpp b8703 for AMD gfx1100}
}

Disclaimer

This is a community quantization of the Google Gemma-4-31B-it model optimized for AMD RDNA3 GPUs. While efforts have been made to ensure quality, this model is provided "as is" without warranty of any kind. Users should evaluate the model for their specific use cases.

These builds are specifically optimized for AMD Radeon RX 7900 XTX (gfx1100) using ROCm. They will work on other AMD GPUs or NVIDIA GPUs without modification but there might be other GGUF quantizations available with better size/KLD ratio for them.

Model Card Version

This model card follows the Model Cards for Model Reporting standard.


Original Model: google/gemma-4-31B-it
Quantization Tool: llama.cpp
Quantization Format: GGUF V3
Target Hardware: AMD Radeon RX 7900 XTX (gfx1100, RDNA3)

Downloads last month
1,244
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ebircak/gemma-4-31B-it-GGUF

Quantized
(107)
this model

Paper for ebircak/gemma-4-31B-it-GGUF