Gemma-4-31B-it GGUF Quantization for AMD RDNA3 (gfx1100)

This is a GGUF quantization of Google's Gemma-4-31B-it instruction-tuned multimodal model, optimized for AMD Radeon RX 7900 XTX (gfx1100 / RDNA3) GPUs using llama.cpp.

Model Details

Property	Value
Base Model	google/gemma-4-31B-it
Quantization Format	GGUF V3 (latest)
Quantization Type	multiple
Architecture	Gemma4ForConditionalGeneration
Layers	60 decoder layers
Hidden Size	5376
Context Window	262K tokens
Vision Tower	SigLIP (27 layers, F16/Q8_0 mmproj)
Quantization Library	llama.cpp b8703
ROCm Version	6.4.4
Hardware Target	AMD Radeon RX 7900 XTX (gfx1100, 24GB VRAM)

Available Variants

This repository contains quantization variants optimized for different performance/quality tradeoffs:

4-bit Variant L (Quality-Optimized) — `gemma-4-31B-it-IQ4_NL_L_AMD.gguf`

Metric	Value
File Size	~20.11 GB
bpw	~4.5 bits per weight
KLD vs F16	0.647 (lower is better)
PPL (wikitext-2)	~12565
Quant Strategy	IQ4_NL

Best for: Maximum quality deployment where VRAM allows (~20GB) with smaller context or 2-GPU setup.

Variant M (Size-Optimized) — `gemma-4-31B-it-IQ4_NL_M_AMD.gguf`

Metric	Value
File Size	~18.70 GB
bpw	~4.25 bits per weight
KLD vs F16	0.684 (lower is better)
PPL (wikitext-2)	~12197
Quant Strategy	IQ4_NL

Best for: Single 7900 XTX (24GB) with headroom for KV cache.

Vision Projector Files

File	Size	Description
`mmproj-gemma-4-31B-it-f16.gguf`	~1.12 GB	Vision encoder projector (F16)
`mmproj-gemma-4-31B-it-q8_0.gguf`	~0.75 GB	Vision encoder projector (Q8_0)

Performance Benchmarks

2x AMD Radeon RX 7900 XTX (48GB Total VRAM)

Variant L (20.11 GB) — llama.cpp-b8703, -fa 1 -ctk q4_1 -ctv q4_1 -ngl 999 -ub 256

Test	Tokens/sec
pp1024	1381.37 ± 2.36
pp2048	1494.88 ± 1.42
pp4096	1530.94 ± 0.97
pp8192	1493.17 ± 0.76
pp16384	1377.77 ± 0.66
pp32768	1177.68 ± 0.54
tg128	24.14 ± 0.01
tg512	23.82 ± 0.01
tg1024	23.49 ± 0.00

Variant M (18.68 GB) — Same configuration

Test	Tokens/sec
pp1024	1361.17 ± 1.42
pp2048	1473.53 ± 1.80
pp4096	1510.83 ± 1.21
pp8192	1476.50 ± 0.61
pp16384	1365.21 ± 0.38
pp32768	1169.50 ± 0.16
tg128	25.05 ± 0.01
tg512	24.66 ± 0.02
tg1024	24.30 ± 0.01

Single AMD Radeon RX 7900 XTX (24GB VRAM)

Variant L (20.11 GB) — -fa 1 -ctk q4_1 -ctv q4_1 -ngl 999 -ub 1024

Test	Tokens/sec
pp1024	962.09 ± 0.75
pp2048	925.67 ± 0.24
pp4096	886.89 ± 0.39
pp8192	834.71 ± 0.27
pp16384	752.85 ± 0.22
pp32768	OOM crash
tg128	26.73 ± 0.02
tg512	26.33 ± 0.01
tg1024	25.94 ± 0.00

Variant M (18.68 GB) — Same configuration

Test	Tokens/sec
pp1024	955.37 ± 0.72
pp2048	919.04 ± 0.27
pp4096	880.58 ± 0.30
pp8192	829.50 ± 0.12
pp16384	749.10 ± 0.16
pp32768	629.07 ± 0.17
tg128	27.98 ± 0.01
tg512	27.53 ± 0.01
tg1024	27.08 ± 0.00

Quality Metrics

Evaluated on wikitext-2-raw with F16 baseline:

Variant	PPL	KLD vs F16	Notes
F16 Baseline	12250.49	0.000	Reference
Variant L	~12565	0.647	K-quants for attn_k/output
Variant M	~12197	0.684	Pure IQ4_NL

KLD (KL Divergence) measures output distribution divergence from F16 baseline. Lower is better.

Usage with llama.cpp

Basic Chat (Single GPU)

# Load model with full context on single 7900 XTX
./llama-server -m gemma-4-31B-it-IQ4_NL_M_AMD.gguf \
    --mmproj mmproj-gemma-4-31B-it-f16.gguf \
    -ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
    -c 32768 -ub 1024 -t 5

Multi-GPU (2x 7900 XTX)

# Split across 2 GPUs with smaller batch for higher throughput
./llama-server -m gemma-4-31B-it-IQ4_NL_L_AMD.gguf \
    --mmproj mmproj-gemma-4-31B-it-f16.gguf \
    -ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
    -c 32768 -ub 256 -b 1024 -t 5 \
    -ts 1,1

API Server

# Start OpenAI-compatible API
./llama-server -m gemma-4-31B-it-IQ4_NL_M_AMD.gguf \
    --mmproj mmproj-gemma-4-31B-it-f16.gguf \
    -ngl 999 -fa 1 -ctk q4_1 -ctv q4_1 \
    -c 32768 -ub 1024 -t 5 \
    --port 8080

Hardware Requirements

Variant	Minimum VRAM	Recommended
Variant L (20GB)	24GB (single GPU, limited context)	48GB (2x 7900 XTX)
Variant M (18GB)	24GB (single 7900 XTX)	48GB (2x 7900 XTX)

Note: Both variants were tested on AMD ROCm 6.4.4 with llama.cpp b8703+. NVIDIA GPUs will work as well but they were not focus group.

Quantization Method

Variant L (IQ4_NL_L) — Tensor Composition

Tensor Group	Quantization	Layers
`attn_k` + `attn_output`	IQ4_NL	All 60 layers
`attn_q`	Q8_0	Layers 0-6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51-59 (24 layers)
`attn_q`	IQ4_NL	Remaining layers
`attn_v`	Q8_0	Layers 0-4, 6, 9, 13, 16, 20, 24, 27, 31, 34, 38, 42, 45, 49, 51, 52, 54-58 (23 layers)
`attn_v`	IQ4_NL	Remaining layers
`ffn_down`	Q8_0	Layers 0-6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51-59 (24 layers)
`ffn_down`	IQ4_NL	Remaining layers
`ffn_gate` + `ffn_up`	Q4_1	All 60 layers
`token_embd`	Q8_0	-
`output_norm`	F32	-

Variant M (Pure IQ4_NL) — Tensor Composition

Tensor Group	Quantization	Layers
`attn_k` + `attn_output`	IQ4_NL	All 60 layers
`attn_q`	Q8_0	Layers 0-6, 12, 15, 18, 21, 24, 30, 36, 39, 42, 45, 48, 51-59 (20 layers)
`attn_q`	IQ4_NL	Remaining layers
`attn_v`	Q8_0	Layers 0-6, 12, 15, 18, 21, 24, 30, 36, 39, 42, 45, 48, 51-59 (20 layers)
`attn_v`	IQ4_NL	Remaining layers
`ffn_down`	Q4_1	All 60 layers (pure IQ4_NL base)
`ffn_gate` + `ffn_up`	Q4_1	All 60 layers
`token_embd`	Q8_0	-
`output_norm`	F32	-

Calibration

Dataset: Mixed task-specific calibration (Magicoder + APIGen-MT + When2Call)
Method: imatrix with task-formatted data
Group Size: 128
imatrix entries: 410
imatrix chunks: 100

Files in This Repository

File	Description
`gemma-4-31B-it-IQ4_NL_L_AMD.gguf`	Variant L (20GB, K-quants for attn)
`gemma-4-31B-it-IQ4_NL_M_AMD.gguf`	Variant M (18GB, pure IQ4_NL)
`mmproj-gemma-4-31B-it-f16.gguf`	Vision projector (F16)
`mmproj-gemma-4-31B-it-q8_0.gguf`	Vision projector (Q8_0)
`imatrix.dat`	Calibration data (optional)

License

This quantization is released under the Apache 2.0 License.

The base model google/gemma-4-31B-it is also licensed under Apache 2.0.

See LICENSE for the full text.

Citation

If you use this model in your research, please cite:

@misc{gemma4-31b-gguf-amd,
  title = {Gemma-4-31B-it GGUF Quantization for AMD RDNA3},
  author = {ebircak},
  year = {2026},
  howpublished = {\url{https://huggingface.co/ebircak/gemma-4-31B-it-GGUF}},
  note = {Quantized with llama.cpp b8703 for AMD gfx1100}
}

Disclaimer

This is a community quantization of the Google Gemma-4-31B-it model optimized for AMD RDNA3 GPUs. While efforts have been made to ensure quality, this model is provided "as is" without warranty of any kind. Users should evaluate the model for their specific use cases.

These builds are specifically optimized for AMD Radeon RX 7900 XTX (gfx1100) using ROCm. They will work on other AMD GPUs or NVIDIA GPUs without modification but there might be other GGUF quantizations available with better size/KLD ratio for them.

Model Card Version

This model card follows the Model Cards for Model Reporting standard.

Original Model: google/gemma-4-31B-it
Quantization Tool: llama.cpp
Quantization Format: GGUF V3
Target Hardware: AMD Radeon RX 7900 XTX (gfx1100, RDNA3)

Downloads last month: 1,244

GGUF

Model size

31B params

Architecture

gemma4

Hardware compatibility

4-bit

Model tree for ebircak/gemma-4-31B-it-GGUF

Base model

google/gemma-4-31B-it

Quantized

(107)

this model

Paper for ebircak/gemma-4-31B-it-GGUF

Model Cards for Model Reporting

Paper • 1810.03993 • Published Oct 5, 2018 • 7