How to make MXFP4 Quants

#1
by DatToad - opened

Thanks for all your efforts on MXFP4 MLX quants. Do you have a script for making the quants, or commands I could use? I wouldn't mind making updated versions of other models.

Hey @DatToad , happy to help! You can do this in a one-liner actually. This is the exact command I ran to quantize and upload this model: mlx_lm.convert --hf-path "zerofata/GLM-4.5-Iceblink-v2-106B-A12B" -q --q-mode mxfp4 --q-group-size 32 --upload-repo "beezu/zerofata_GLM-4.5-Iceblink-v2-106B-A12B-MLX-MXFP4".

Note that this is more RAM and GPU intensive than normal quants, and will take longer as well. I'm on a 128GB M3 Max, with the VRAM limit set at 116GB, and I ended up filling the VRAM and using about 10GB of swap space at points during this model's quantization. I had GPU timeouts (causing it to fail mid quantization) a couple of times, likely due to slow swap speeds. What ultimately worked for me was rebooting my Mac, launching the terminal and nothing else, and starting the quantization. Those steps are probably overkill for making mxfp4 quants of smaller models, but I figured I'd mention it to save you some headache if you run into those errors too.

Owner

@DatToad Little update here: mxfp4 (and mxfp8/nvfp4) are much quicker and less resource-intensive now. Essentially if you can make a normal MLX quant with your hardware, you can do mxfp4/mxfp8/nvfp4 without your Mac breaking a sweat now.

@beezu Thanks for that. Are there any guides or benchmarks to the other various kinds of quants out there? I saw a model for instance with QX65g that I think compared favorably to other 6bit quants but there's not much information I know of out there.

Owner

I believe quantizations like QX65g are generally mixed quantizations, essentially some layers are given a higher quality quantization than others, like some layers getting 6bit and others getting 5bit for a QX65. Hybrid quantizations like that are a rough equivalent to the GGUF "Q4_K_M" quantizations for the MLX world. I asked someone who makes these hybrid quantizations regularly about all of this over in this discussion thread: https://huggingface.co/nightmedia/unsloth-GLM-4.5-Air-qx64-mlx/discussions/2.

Making these in higher qualities like QX65, QX86, etc, essentially requires a modified version of mlx_lm.convert that adds new --quant-predicate options, since the defaults are what we'd call QX62, QX43, QX63, and QX64. I'm pulling these defaults from running mlx_lm.convert --help, to be clear. I did test things out and it worked, but I stupidly wiped out my modifications in a pip update without backing them up first, and just haven't gotten around to redoing the modifications and testing futher.

As for benchmarks, it's not something I've really explored unfortunately. I wish I could point you to resources for that but I've got nothing at the moment.

Sign up or log in to comment