Which framework was used for FP8 quantization? LLM-compressor?

by traphix - opened about 1 month ago

Discussion

traphix

about 1 month ago

Which framework was used for FP8 quantization? LLM-compressor?

nivvis

Owner about 1 month ago

•

edited about 1 month ago

No, unfortunately it had to be custom but is 100% faithful to llm-compressor. The story is a little convoluted tbh. Are you looking to repro? or understand?

TLDR llm-compressor has some issues around transformer version (they are pinned way back), so can't use llm-compressor with qwen 3.5 (at least MoE? forgetting). So I had claude carefully rebuild & verify llm-compressor's fp8 algo against Qwen's known, released fp8s. Could

I did basically test its JS divergence (Jensen-Shannon > KL), but long story*. Anyway I'm pretty confident it's working as expected though.

*was doing fp16 -> nvfp4 quants .. being lazy i tested my fp8 -> nvfp4 divergence and fp8 checks out as basically being truth for fp16. Will try to run more formal JSD if I find the time.

LMK if that would be helpful to release. It's a bear to get FP8 & NVFP4 up tbh .. lot's of dependency hell trying to get vllm/sglang running with these models.

details here
https://gist.github.com/nikdavis/ed443d8bfce82a720a88556e11332741

traphix

about 1 month ago

•

edited about 1 month ago

Releasing this quantized model is very helpful.

I am looking to repro.

I'll carefully read your article first

https://gist.github.com/nikdavis/ed443d8bfce82a720a88556e11332741

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment