Gemma 4 26B A4B IT Heretic (FP8 Static)

This repository contains an offline statically quantized FP8 version of the coder3101/gemma-4-26B-A4B-it-heretic model. The quantization procedure was designed to maximize inference throughput on NVIDIA hardware equipped with FP8 tensor cores (e.g., Hopper, Blackwell architectures) while preserving conversational accuracy.

Quantization Methodology

The model was quantized utilizing the AutoFP8 library. To ensure optimal runtime performance without the overhead of dynamic activation scaling, a strict static activation scheme was employed.

  • Quantization Format: FP8 (W8A8)

  • Activation Scheme: Static

  • Calibration Dataset: 512 samples from the mgoin/ultrachat_2k dataset.

  • Exclusions for Multimodality & Architecture Stability: The Gemma 4 architecture utilizes separate pathways for vision (the Vision Tower) and text generation. AutoFP8 implicitly attempts to quantize all linear layers indiscriminately. To ensure the model remains fully multimodal and compatible with vLLM, a surgical restoration process was applied post-quantization:

    1. The config.json was patched to explicitly add the vision_tower and embed_vision modules, alongside the MoE router.proj layers, to the ignored_layers list.
    2. The original, unquantized bfloat16 weights for all vision layers (e.g., model.vision_tower.encoder.layers...) and MoE routers were extracted from the base model.
    3. These bfloat16 weights were manually injected back into the quantized FP8 model.safetensors, overwriting the corrupted FP8 layers.
    4. All residual scale factors (weight_scale, input_scale) generated by AutoFP8 for these specific modules were deleted to prevent vLLM from attempting to load them as quantized layers.

    This hybrid precision approach guarantees that the heavy LLM feedforward and attention layers benefit from FP8 speedups, while the sensitive vision encoders and expert routing mechanisms maintain their native precision, preserving full multimodal capabilities without crashing the inference engine.

Deployment with vLLM

This model is pre-configured for deployment using vLLM. The static scaling factors are embedded within the model tensors, allowing vLLM to load and execute the model with native hardware acceleration automatically.

Prerequisites

Ensure you are utilizing a recent version of vLLM (v0.19.0 or later) that supports the Gemma 4 MoE architecture and FP8 loading.

Launch Command

Execute the following command to serve the model:

python -m vllm.entrypoints.openai.api_server \
    --model cloud19/gemma-4-26B-A4B-it-heretic-FP8-Static \
    --served-model-name "gemma-rp-uncensored" \
    --tensor-parallel-size 1 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.95 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --port 8080
Downloads last month
1,711
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cloud19/gemma-4-26B-A4B-it-heretic-FP8-Static

Quantized
(14)
this model