FP8 Version for running on vLLM with hardware optimizations from Ada+ generation GPUs

#14

by AQLabs - opened Mar 11

Mar 11

Thanks for the distillation. The GGUF works fine on llama.cpp but performance and tool compatibility leaves much to be desired. Qwen themselves release an FP8 version of their model as well. Would you please release an FP8?

Jackrong

Owner Mar 11

I'll try to look into producing an FP8 version for vLLM, thanks for the suggestion.

ykarout

Mar 11

@Jackrong I can help with that. I posted an NVFP4 version of the qwen3.5-9b and it works really well especially on blackwell. FP8 follows same pipeline for quantizing.

CHNtentes

Mar 12

https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w8a8_fp8/qwen3_next_example.py

basically everyone with enough vram can do it since fp8 quant does not require calibration dataset.

ykarout

Mar 12

@CHNtentes do you know whats the main diff or pros/cons between using llm-compressor vs nvidia’s modelopt?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment