gemma-4-26B-A4B-it-assistant-fp8
NOTE: I'm still trying to figure out if there are additional tweaks needed to get this working with MTP. As it stands in my first tests I got a 0 token acceptance rate against both the Google BF16 base and my FP8 quant. For now consider this experimental and possibly not working. I'll keep tweaking and see if I can get it all running FP8. Using the quantization flag at run time in VLLM to force fp8 works so it has to be possible.
Format: FP8_DYNAMIC โ weights quantized to FP8 statically; activations scaled dynamically at runtime.
Base model: google/gemma-4-26B-A4B-it-assistant
How it was made: One-shot datafree quantization with LLM Compressor (FP8_DYNAMIC recipe) on a DGX Spark (GB10 Grace Blackwell). No calibration data required โ activations are scaled dynamically at runtime.
Notes:
lm_headand multimodal projection layers kept in high precision. Blackwell (GB10/B100/B200) has native FP8 hardware support. Hopper (H100/H200) also supports FP8 natively. Older architectures will fall back to BF16 compute while still benefiting from the reduced model size.
Check the original model card for information about this model.
Running the model with vLLM in Docker
sudo docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest \
--model Firworks/gemma-4-26B-A4B-it-assistant-fp8 \
--dtype auto \
--max-model-len 32768
Tested on a DGX Spark (GB10 Grace Blackwell Superchip, 128GB unified memory).
If there are other models you'd like quantized to FP8, let me know.
- Downloads last month
- 1,448
Model tree for Firworks/gemma-4-26B-A4B-it-assistant-fp8
Base model
google/gemma-4-26B-A4B-it-assistant