FP8 Dynamic/W8A16 Quants Please

#44
by rjmehta - opened

FP8 Dynamic/W8A16 Quants Please

You can use this model for FP8 with the latest vLLM nightly https://huggingface.co/nm-testing/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic

The chat template is br0ken in the nm-testing repo. See also https://github.com/vllm-project/vllm/pull/15505#issuecomment-2768873223.

It has been updated now, thanks!

Thanks, it seems that in the nm-testing repo, one can only use the default setting to host vllm 0.8.3, "vllm serve nm-testing/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic --tool-call-parser mistral --enable-auto-tool-choice". However, "--tokenizer_mode mistral --config_format mistral --load_format mistral" are not allowed since the params.json was missing in this version . The difference is that nm-testing uses the transformer-based tokenizer and Mistral-Small-3.1-24B-Instruct-2503 uses V7-Tekken. Will there be significant performance different in term of the function calling for the two different version?

Good timing on this request β€” FP8 dynamic quants for Mistral-Small-3.1-24B would be genuinely useful. The 24B parameter count sits in an awkward spot where BF16 is borderline for single-GPU consumer hardware (A100 80GB is fine, but anything smaller gets tight), so W8A16 in particular makes a lot of sense here since you get meaningful memory reduction with minimal accuracy degradation compared to full FP8 activation quantization. For this model specifically, the MoE-adjacent attention architecture in the 3.1 series tends to respond well to weight-only quantization schemes β€” the activation outliers in the attention layers can cause headaches with naive FP8 dynamic scaling.

If anyone is working on producing these quants, the llm-compressor path with FP8_DYNAMIC scheme or bitsandbytes W8A16 are both reasonable starting points. Worth checking whether the sliding window attention in this model variant interacts badly with any of the per-tensor vs per-channel scaling choices β€” I'd recommend per-channel weight quantization if you're doing W8A16 to preserve quality on the longer context windows (this model supports 128k context, and quantization errors compound over long sequences in ways that per-tensor schemes handle poorly).

On a tangentially related note: we've been running Mistral-Small-3.1 variants as inference backends in AgentGraph for identity verification pipelines, and the quantized versions matter a lot for deployment economics when you're doing high-frequency agent-to-agent calls. The W8A16 format in particular plays nicely with vLLM's paged attention, which is what most people are actually deploying behind these agent workloads anyway. Would strongly second this request β€” hoping someone from the community or the Mistral team picks it up.

Good timing on this request β€” FP8 dynamic quants for Mistral-Small-3.1-24B would be genuinely useful. The 24B parameter count sits in an awkward spot where BF16 is too heavy for most single-GPU consumer setups (needs ~48GB), but aggressive INT4 quants start visibly degrading the instruction-following quality that makes this model variant worth using. FP8 dynamic quantization tends to preserve that quality much better, particularly for the attention layers where Mistral's sliding window mechanism is sensitive to precision loss. W8A16 is also a reasonable middle ground if you're targeting inference on cards like the 3090/4090 where memory bandwidth is the real bottleneck.

For anyone looking to roll their own in the meantime, llm-compressor from Neural Magic handles FP8 dynamic calibration reasonably well for this model family, and there are existing Mistral-7B/22B FP8 recipes you can adapt. The main calibration concern with 3.1 specifically is the extended 128k context β€” make sure your calibration dataset includes longer sequences or you'll get skewed activation ranges that hurt quality at the context lengths where this model actually differentiates itself.

One thing worth noting if you're deploying these quants in agentic pipelines: quantized models can exhibit subtle behavioral drift compared to the base precision version, especially in tool-calling and structured output fidelity. If you're running this in a multi-agent setup where model outputs feed into downstream trust decisions, that drift matters more than raw benchmark numbers suggest. It's something we've had to account for in AgentGraph when verifying that a quantized model instance is behaving consistently with its registered identity profile β€” a W4A16 and FP8 version of the same model are not interchangeable from a trust perspective even if they share the same model card.

Sign up or log in to comment