--- base_model: google/gemma-4-26b-a4b-it tags: - awq - 4-bit - rdna4 - gfx1201 - rocm - sglang - quantized license: apache-2.0 --- # Gemma 4 26B MoE AWQ 4-bit AWQ 4-bit quantization of [Gemma 4 26B-A4B-it](https://huggingface.co/google/gemma-4-26b-a4b-it) optimized for AMD RDNA4 (gfx1201) inference with [SGLang](https://github.com/sgl-project/sglang). ## Model Details | | | |---|---| | **Base model** | [google/gemma-4-26b-a4b-it](https://huggingface.co/google/gemma-4-26b-a4b-it) | | **Architecture** | MoE (128 experts, top-8) | | **Parameters** | 26B total / 4B active | | **Layers** | 30 | | **Context** | 4K (tested) | | **Quantization** | AWQ 4-bit, group_size=32. Forced-routing GPTQ calibration covers all 128 experts (standard GPTQ only calibrates ~1/128). | ## Performance (2x AMD Radeon AI PRO R9700, TP=2) - **Decode speed**: 30 tok/s single-user on 2x R9700 - **Launch**: `scripts/launch.sh gemma4` ## Notes Standard community GPTQ under-calibrates rare experts due to routing imbalance. This model uses forced-routing calibration to ensure all 128 experts are properly quantized. ## Known Limitations - **Vision: UNTESTABLE** — Vision encoder layers (`embed_vision.*`) were quantized to INT4, which likely degrades vision quality. Server crashes on first request (pre-existing RDNA4 triton issue with this model's SWA configuration, not vision-specific). **Text-only inference recommended.** A future version should add vision layers to `modules_to_not_convert`. ## Usage with SGLang ```bash git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference cd 2x-R9700-RDNA4-GFX1201-sglang-inference ./scripts/setup.sh scripts/launch.sh gemma4 ``` See the [RDNA4 Inference Repository](https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference) for full setup instructions, patches, and benchmarks. ## Hardware Tested on 2x AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32+34 GB VRAM) with ROCm 7.2 and SGLang v0.5.10 + RDNA4 patches.