FP8 Version for running on vLLM with hardware optimizations from Ada+ generation GPUs
#14
by AQLabs - opened
Thanks for the distillation. The GGUF works fine on llama.cpp but performance and tool compatibility leaves much to be desired. Qwen themselves release an FP8 version of their model as well. Would you please release an FP8?
I'll try to look into producing an FP8 version for vLLM, thanks for the suggestion.
basically everyone with enough vram can do it since fp8 quant does not require calibration dataset.
@CHNtentes do you know whats the main diff or pros/cons between using llm-compressor vs nvidia’s modelopt?