MXFP4 slower than Q4_K_M

#21

by ykarout - opened Feb 27

Feb 27

Hello @danielhanchen I have tried the MXFP4 (From unsloth) and Q4_K_M (from official Qwen repo) under the same settings in latest llama.cpp and latest lmstudio. The Q4_K_M generation tok/s where significantly faster (48 vs 36 t/s) under numerous tests. Need to say that I am running it on 16GB RTX 5080 so some MoE layers where offloaded to CPU RAM. Is this a normal result given that MXFP4 might not be optimized for CPU as much as Q4_K_M and is only faster when all layers are offloaded to GPU to benefit from the MXFP4 native tensor cores? or am I doing something wrong?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment