gemma-4-26B-AWQ / README.md

mattbucci

Update vision status: untestable due to server crash

d47e216 verified 26 days ago

preview code

raw

history blame contribute delete

1.99 kB

metadata

base_model: google/gemma-4-26b-a4b-it
tags:
  - awq
  - 4-bit
  - rdna4
  - gfx1201
  - rocm
  - sglang
  - quantized
license: apache-2.0

Gemma 4 26B MoE AWQ 4-bit

AWQ 4-bit quantization of Gemma 4 26B-A4B-it optimized for AMD RDNA4 (gfx1201) inference with SGLang.

Model Details


Base model	google/gemma-4-26b-a4b-it
Architecture	MoE (128 experts, top-8)
Parameters	26B total / 4B active
Layers	30
Context	4K (tested)
Quantization	AWQ 4-bit, group_size=32. Forced-routing GPTQ calibration covers all 128 experts (standard GPTQ only calibrates ~1/128).

Performance (2x AMD Radeon AI PRO R9700, TP=2)

Decode speed: 30 tok/s single-user on 2x R9700
Launch: scripts/launch.sh gemma4

Notes

Standard community GPTQ under-calibrates rare experts due to routing imbalance. This model uses forced-routing calibration to ensure all 128 experts are properly quantized.

Known Limitations

Vision: UNTESTABLE — Vision encoder layers (embed_vision.*) were quantized to INT4, which likely degrades vision quality. Server crashes on first request (pre-existing RDNA4 triton issue with this model's SWA configuration, not vision-specific). Text-only inference recommended. A future version should add vision layers to modules_to_not_convert.

Usage with SGLang

git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
scripts/launch.sh gemma4

See the RDNA4 Inference Repository for full setup instructions, patches, and benchmarks.

Hardware

Tested on 2x AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32+34 GB VRAM) with ROCm 7.2 and SGLang v0.5.10 + RDNA4 patches.