slower than non-LIMI

#1
by bobig - opened

Initial TPS speed about 10% slower than same size model: unsloth-glm-4.5-air-mlx MXFP4
LIMI speed seems to drop faster than similar model as context grows.

Maybe the extra LIMI thinking is filling up the context causing speed drop.

Yup, I noticed the same thing, and uploaded the mxfp4 for compare. As I can’t run tests, model being too big for that for my hardware, vibe check is the only way to test it. The qx quants mitigate the issue somewhat but my take on it is that mxfp4 is too “noisy” for the inference, and the model struggles. The hi quants think less(more self-confidence) and you see some differences in speed

Sign up or log in to comment