gemma-4-26B-A4B-it-assistant-fp8

NOTE: I'm still trying to figure out if there are additional tweaks needed to get this working with MTP. As it stands in my first tests I got a 0 token acceptance rate against both the Google BF16 base and my FP8 quant. For now consider this experimental and possibly not working. I'll keep tweaking and see if I can get it all running FP8. Using the quantization flag at run time in VLLM to force fp8 works so it has to be possible.

Format: FP8_DYNAMIC โ€” weights quantized to FP8 statically; activations scaled dynamically at runtime. Base model: google/gemma-4-26B-A4B-it-assistant How it was made: One-shot datafree quantization with LLM Compressor (FP8_DYNAMIC recipe) on a DGX Spark (GB10 Grace Blackwell). No calibration data required โ€” activations are scaled dynamically at runtime.

Notes: lm_head and multimodal projection layers kept in high precision. Blackwell (GB10/B100/B200) has native FP8 hardware support. Hopper (H100/H200) also supports FP8 natively. Older architectures will fall back to BF16 compute while still benefiting from the reduced model size.

Check the original model card for information about this model.

Running the model with vLLM in Docker

sudo docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
  vllm/vllm-openai:latest \
  --model Firworks/gemma-4-26B-A4B-it-assistant-fp8 \
  --dtype auto \
  --max-model-len 32768

Tested on a DGX Spark (GB10 Grace Blackwell Superchip, 128GB unified memory).

If there are other models you'd like quantized to FP8, let me know.

Downloads last month
1,448
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Firworks/gemma-4-26B-A4B-it-assistant-fp8

Quantized
(3)
this model