FP8 Version for running on vLLM with hardware optimizations from Ada+ generation GPUs

#14
by AQLabs - opened

Thanks for the distillation. The GGUF works fine on llama.cpp but performance and tool compatibility leaves much to be desired. Qwen themselves release an FP8 version of their model as well. Would you please release an FP8?

I'll try to look into producing an FP8 version for vLLM, thanks for the suggestion.

@Jackrong I can help with that. I posted an NVFP4 version of the qwen3.5-9b and it works really well especially on blackwell. FP8 follows same pipeline for quantizing.

https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w8a8_fp8/qwen3_next_example.py

basically everyone with enough vram can do it since fp8 quant does not require calibration dataset.

@CHNtentes do you know whats the main diff or pros/cons between using llm-compressor vs nvidia’s modelopt?

Sign up or log in to comment