question about mxfp4

by koifish12 - opened Jan 21

Jan 21

If the model wasnt trained in mxfp4 are there still benefits to quantizing the model to mxfp4/ what is the main reason you did that?
just curious

ubergarm

Owner Jan 21

@koifish12

tl;dr; it is mostly an experiment.

If the model wasnt trained in mxfp4 are there still benefits to quantizing the model to mxfp4

Typically I do not release any MXFP4 unless the original model was specifically QAT targeting that quantization type. I did some visualizations of various quantization types where at least for image data that is not 0 mean, the MXFP4 does not look very good: https://www.reddit.com/r/LocalLLaMA/comments/1opeu1w/visualizing_quantization_types/ (take it with a grain of salt, given actual tensor data has a different distribution).

Also ik (the guy who did the original implementation of most of the quantization types used in mainline and ik_llama.cpp [after the original q8_0 q4_0 legacy types]) has commented on it as well here:

But don't get excited about using mxfp4 to quantize other models to fp4. The zero-bit mantissa in the block scales, along with the E2M1 choice for the 4-bit floats, results in a horrible quantization accuracy for the 4.25 bpw spent (about the same as IQ3_K), unless the model was directly trained with this specific fp4 variant (as the gpt-oss models).
https://github.com/ikawrakow/ik_llama.cpp/pull/682

The main benefits that I see are it is compatible with both ik and mainline llama.cpp. It is different than nvfp4 so I don't think there is native hardware support for it in newer blackwell CUDA GPUs either. You could benchmark your specific rig with llama-sweep-bench but in general it seems to run close enough similar speed PP/TG as say q4_0/q4_K/iq4_kss

Personally, I believe the for most models you're better off with preferably ik_llama.cpp's latest quantization types, or the usual q4_K and friends on mainline. I'm not sure why people release quants of so many models now, probably just hype.

what is the main reason you did that? just curious

I mention it some in the 1st discussion here, but for some odd reason on this specific model the MXFP4 is showing lower perplexity than other similar sized quants. This doesn't mean it is necessarily "better" in any absolute measurement. I still want to do some KLD measurements as it will likely show a more expected curve with smaller sized models deviating from the full bf16 "baseline" more.

Hope that helps! What do you think and what quant types do you prefer?

Cheers!

koifish12

Jan 22

•

edited Jan 22

thanks for the reply! Finally answers the question ive been wondering for a while. Its odd then that I see so many models quantized to mxfp4 when they shouldnt be. I will stick to Q4_K_M 👌

koifish12 changed discussion status to closed Jan 22

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment