Gemma 4 26B A4B IT Heretic (FP8 Static)
This repository contains an offline statically quantized FP8 version of the coder3101/gemma-4-26B-A4B-it-heretic model. The quantization procedure was designed to maximize inference throughput on NVIDIA hardware equipped with FP8 tensor cores (e.g., Hopper, Blackwell architectures) while preserving conversational accuracy.
Quantization Methodology
The model was quantized utilizing the AutoFP8 library. To ensure optimal runtime performance without the overhead of dynamic activation scaling, a strict static activation scheme was employed.
Quantization Format: FP8 (W8A8)
Activation Scheme: Static
Calibration Dataset: 512 samples from the
mgoin/ultrachat_2kdataset.Exclusions for Multimodality & Architecture Stability: The Gemma 4 architecture utilizes separate pathways for vision (the Vision Tower) and text generation.
AutoFP8implicitly attempts to quantize all linear layers indiscriminately. To ensure the model remains fully multimodal and compatible with vLLM, a surgical restoration process was applied post-quantization:- The
config.jsonwas patched to explicitly add thevision_towerandembed_visionmodules, alongside the MoErouter.projlayers, to theignored_layerslist. - The original, unquantized
bfloat16weights for all vision layers (e.g.,model.vision_tower.encoder.layers...) and MoE routers were extracted from the base model. - These
bfloat16weights were manually injected back into the quantized FP8model.safetensors, overwriting the corrupted FP8 layers. - All residual scale factors (
weight_scale,input_scale) generated by AutoFP8 for these specific modules were deleted to prevent vLLM from attempting to load them as quantized layers.
This hybrid precision approach guarantees that the heavy LLM feedforward and attention layers benefit from FP8 speedups, while the sensitive vision encoders and expert routing mechanisms maintain their native precision, preserving full multimodal capabilities without crashing the inference engine.
- The
Deployment with vLLM
This model is pre-configured for deployment using vLLM. The static scaling factors are embedded within the model tensors, allowing vLLM to load and execute the model with native hardware acceleration automatically.
Prerequisites
Ensure you are utilizing a recent version of vLLM (v0.19.0 or later) that supports the Gemma 4 MoE architecture and FP8 loading.
Launch Command
Execute the following command to serve the model:
python -m vllm.entrypoints.openai.api_server \
--model cloud19/gemma-4-26B-A4B-it-heretic-FP8-Static \
--served-model-name "gemma-rp-uncensored" \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--gpu-memory-utilization 0.95 \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--port 8080
- Downloads last month
- 1,711
Model tree for cloud19/gemma-4-26B-A4B-it-heretic-FP8-Static
Base model
google/gemma-4-26B-A4B-it