How to make MXFP4 Quants
Thanks for all your efforts on MXFP4 MLX quants. Do you have a script for making the quants, or commands I could use? I wouldn't mind making updated versions of other models.
Hey @DatToad , happy to help! You can do this in a one-liner actually. This is the exact command I ran to quantize and upload this model: mlx_lm.convert --hf-path "zerofata/GLM-4.5-Iceblink-v2-106B-A12B" -q --q-mode mxfp4 --q-group-size 32 --upload-repo "beezu/zerofata_GLM-4.5-Iceblink-v2-106B-A12B-MLX-MXFP4".
Note that this is more RAM and GPU intensive than normal quants, and will take longer as well. I'm on a 128GB M3 Max, with the VRAM limit set at 116GB, and I ended up filling the VRAM and using about 10GB of swap space at points during this model's quantization. I had GPU timeouts (causing it to fail mid quantization) a couple of times, likely due to slow swap speeds. What ultimately worked for me was rebooting my Mac, launching the terminal and nothing else, and starting the quantization. Those steps are probably overkill for making mxfp4 quants of smaller models, but I figured I'd mention it to save you some headache if you run into those errors too.
I believe quantizations like QX65g are generally mixed quantizations, essentially some layers are given a higher quality quantization than others, like some layers getting 6bit and others getting 5bit for a QX65. Hybrid quantizations like that are a rough equivalent to the GGUF "Q4_K_M" quantizations for the MLX world. I asked someone who makes these hybrid quantizations regularly about all of this over in this discussion thread: https://huggingface.co/nightmedia/unsloth-GLM-4.5-Air-qx64-mlx/discussions/2.
Making these in higher qualities like QX65, QX86, etc, essentially requires a modified version of mlx_lm.convert that adds new --quant-predicate options, since the defaults are what we'd call QX62, QX43, QX63, and QX64. I'm pulling these defaults from running mlx_lm.convert --help, to be clear. I did test things out and it worked, but I stupidly wiped out my modifications in a pip update without backing them up first, and just haven't gotten around to redoing the modifications and testing futher.
As for benchmarks, it's not something I've really explored unfortunately. I wish I could point you to resources for that but I've got nothing at the moment.