Any plans on creating ik quants for REAP models, please?
Like unslorth's https://huggingface.co/unsloth/GLM-4.6-REAP-268B-A32B-GGUF ? https://huggingface.co/unsloth/Qwen3-Coder-REAP-363B-A35B-GGUF and also there is no MiniMax-M2 ik quantization, while there is REAP for it and https://huggingface.co/mradermacher/MiniMax-M2-THRIFT-i1-GGUF .
Thanks for the suggestions, I'm always curious what models folks are using.
I've heard mixed reports on REAP, with some perplexity testing suggesting it isn't delivering the "near lossless" performance but quite a higher perplexity than original. There is some promise though and good to have another way of shrinking big quants to make them more accessible even with quality degradation. What have you heard on quality of REAP quants?
Right, MiniMax-M2 is merged into ik_llama.cpp for a couple weeks now: https://github.com/ikawrakow/ik_llama.cpp/pull/907 . I also don't see a specific ik quantization for it or its recent REAP/THRIFT versions. I originally was holding out to do GLM-4.6-Air insetad, as it is roughly in a similar size class, but it still hasn't dropped yet...
As you probably already know, you can run the regular mainline quants on ik_llama.cpp and still get potential speed boost especially doing hybrid inferencing putting routed experts on CPU and the rest on GPU.
I'm probably not going to do one of these models at the moment. But what kinda RAM+VRAM rig are you targeting to fit it into just in case?
Thanks for response John. I have a bit strange rig, 128GB RAM + 68GB VRAM(3060 12 Gb+3090 Ti FE+5090 FE). I likely can do conversion for MiniMax on my own, but without getting perplexity numbers etc. I wanted to check what would be better for coding: REAPed 4 bit GLM or original 3 bit one (4 bit smol one allows really small context). I know I could run normal ones, just trying to get best results possible.
Oh that is a nice setup ~196GB combined RAM+VRAM. Without having hard numbers on all of those quants you listed available it is difficult to say with evidence.
My gut would be to suggest using the full size GLM-4.6 in this repo e.g.
- smol-IQ4_KSS 169.895 GiB (4.090 BPW)
- IQ3_KS 148.390 GiB (3.573 BPW) <--- probably most quality in smallest package allowing more context
Keep in mind some of the listed size here is for those MTP tensors which are marked unused not actually loaded. The upshot is the RAM+VRAM required is actually a bit lower (maybe 4-8GB I forget exactly??) than what is printed there.
Sounds like you want longer context, which you can use -ctk q8_0 -ctv q8_0 to save space, but especially with GLM-4.6 this can tend to slow down token generation at larger context sizes. If you really want to push it, you can drop down to q6_0 but I wouldn't suggest lower than that. You could also compress k more than v etc, I forget which one is better but it seems asymettric regarding quality but I don't have data collected myself to show that either (would have to search old PRs on ik_llama.cpp for ik's advice).
I found the reference for some random reddit vibes:
Tried glm 4.5 air, minimax m2, deepseek, all the qwens, oss 120 etc. None of them come close for my use cases [to GLM-4.6].
https://www.reddit.com/r/LocalLLaMA/comments/1p0r5ww/comment/npms00z/
And fwiw they are using a worse quality quant than the ones I suggest above.
Thanks a lot for suggestions. I've just got 5090, so still waiting for pci-e cable and 90 degrees power cable to add 3060 back. This will likely do the trick with getting full glm working. Will update thread with benchmark
with GLM-4.6 this can tend to slow down token generation at larger context sizes
I think this has been fixed in https://github.com/ikawrakow/ik_llama.cpp/pull/899
I've heard mixed reports on REAP, with some perplexity testing suggesting it isn't delivering the "near lossless" performance but quite a higher perplexity than original.
From my testing, it is decent for coding tasks, but almost totally losses the ability to recite wikitext. That may not be a bad thing though, as somehow GLM-4.6-REAP-218B-A32B was able to successfully troubleshoot a deadlock concurrency issue, the first open weight model that can do it.
One thing to note is that for quants with the same size in GiB, a REAP quant will be less sparse and thus has slower TG.