MXFP4 slower than Q4_K_M
#21
by ykarout - opened
Hello @danielhanchen I have tried the MXFP4 (From unsloth) and Q4_K_M (from official Qwen repo) under the same settings in latest llama.cpp and latest lmstudio. The Q4_K_M generation tok/s where significantly faster (48 vs 36 t/s) under numerous tests. Need to say that I am running it on 16GB RTX 5080 so some MoE layers where offloaded to CPU RAM. Is this a normal result given that MXFP4 might not be optimized for CPU as much as Q4_K_M and is only faster when all layers are offloaded to GPU to benefit from the MXFP4 native tensor cores? or am I doing something wrong?