Which framework was used for FP8 quantization? LLM-compressor?
Which framework was used for FP8 quantization? LLM-compressor?
No, unfortunately it had to be custom but is 100% faithful to llm-compressor. The story is a little convoluted tbh. Are you looking to repro? or understand?
TLDR llm-compressor has some issues around transformer version (they are pinned way back), so can't use llm-compressor with qwen 3.5 (at least MoE? forgetting). So I had claude carefully rebuild & verify llm-compressor's fp8 algo against Qwen's known, released fp8s. Could
I did basically test its JS divergence (Jensen-Shannon > KL), but long story*. Anyway I'm pretty confident it's working as expected though.
*was doing fp16 -> nvfp4 quants .. being lazy i tested my fp8 -> nvfp4 divergence and fp8 checks out as basically being truth for fp16. Will try to run more formal JSD if I find the time.
LMK if that would be helpful to release. It's a bear to get FP8 & NVFP4 up tbh .. lot's of dependency hell trying to get vllm/sglang running with these models.
details here
https://gist.github.com/nikdavis/ed443d8bfce82a720a88556e11332741
Releasing this quantized model is very helpful.
I am looking to repro.
I'll carefully read your article first
https://gist.github.com/nikdavis/ed443d8bfce82a720a88556e11332741