mattbucci
/

gemma-4-26B-AWQ

+---
+base_model: google/gemma-4-26b-a4b-it
+tags:
+- awq
+- 4-bit
+- rdna4
+- gfx1201
+- rocm
+- sglang
+- quantized
+license: apache-2.0
+---
+# Gemma 4 26B MoE AWQ 4-bit
+AWQ 4-bit quantization of [Gemma 4 26B-A4B-it](https://huggingface.co/google/gemma-4-26b-a4b-it) optimized for AMD RDNA4 (gfx1201) inference with [SGLang](https://github.com/sgl-project/sglang).
+## Model Details
+| | |
+|---|---|
+| **Base model** | [google/gemma-4-26b-a4b-it](https://huggingface.co/google/gemma-4-26b-a4b-it) |
+| **Architecture** | MoE (128 experts, top-8) |
+| **Parameters** | 26B total / 4B active |
+| **Layers** | 30 |
+| **Context** | 4K (tested) |
+| **Quantization** | AWQ 4-bit, group_size=32. Forced-routing GPTQ calibration covers all 128 experts (standard GPTQ only calibrates ~1/128). |
+## Performance (2x AMD Radeon AI PRO R9700, TP=2)
+- **Decode speed**: 30 tok/s single-user on 2x R9700
+- **Launch**: `scripts/launch.sh gemma4`
+## Notes
+Standard community GPTQ under-calibrates rare experts due to routing imbalance. This model uses forced-routing calibration to ensure all 128 experts are properly quantized.
+## Usage with SGLang
+```bash
+git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
+cd 2x-R9700-RDNA4-GFX1201-sglang-inference
+./scripts/setup.sh
+scripts/launch.sh gemma4
+```
+See the [RDNA4 Inference Repository](https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference) for full setup instructions, patches, and benchmarks.
+## Hardware
+Tested on 2x AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32+34 GB VRAM) with ROCm 7.2 and SGLang v0.5.10 + RDNA4 patches.