Any plans on creating ik quants for REAP models, please?

by xakepp - opened Nov 18, 2025

Nov 18, 2025

Like unslorth's https://huggingface.co/unsloth/GLM-4.6-REAP-268B-A32B-GGUF ? https://huggingface.co/unsloth/Qwen3-Coder-REAP-363B-A35B-GGUF and also there is no MiniMax-M2 ik quantization, while there is REAP for it and https://huggingface.co/mradermacher/MiniMax-M2-THRIFT-i1-GGUF .

ubergarm

Owner Nov 18, 2025

Thanks for the suggestions, I'm always curious what models folks are using.

I've heard mixed reports on REAP, with some perplexity testing suggesting it isn't delivering the "near lossless" performance but quite a higher perplexity than original. There is some promise though and good to have another way of shrinking big quants to make them more accessible even with quality degradation. What have you heard on quality of REAP quants?

Right, MiniMax-M2 is merged into ik_llama.cpp for a couple weeks now: https://github.com/ikawrakow/ik_llama.cpp/pull/907 . I also don't see a specific ik quantization for it or its recent REAP/THRIFT versions. I originally was holding out to do GLM-4.6-Air insetad, as it is roughly in a similar size class, but it still hasn't dropped yet...

As you probably already know, you can run the regular mainline quants on ik_llama.cpp and still get potential speed boost especially doing hybrid inferencing putting routed experts on CPU and the rest on GPU.

I'm probably not going to do one of these models at the moment. But what kinda RAM+VRAM rig are you targeting to fit it into just in case?

xakepp

Nov 18, 2025

Thanks for response John. I have a bit strange rig, 128GB RAM + 68GB VRAM(3060 12 Gb+3090 Ti FE+5090 FE). I likely can do conversion for MiniMax on my own, but without getting perplexity numbers etc. I wanted to check what would be better for coding: REAPed 4 bit GLM or original 3 bit one (4 bit smol one allows really small context). I know I could run normal ones, just trying to get best results possible.

ubergarm

Owner Nov 19, 2025

@xakepp

Oh that is a nice setup ~196GB combined RAM+VRAM. Without having hard numbers on all of those quants you listed available it is difficult to say with evidence.

My gut would be to suggest using the full size GLM-4.6 in this repo e.g.

smol-IQ4_KSS 169.895 GiB (4.090 BPW)
IQ3_KS 148.390 GiB (3.573 BPW) <--- probably most quality in smallest package allowing more context

Keep in mind some of the listed size here is for those MTP tensors which are marked unused not actually loaded. The upshot is the RAM+VRAM required is actually a bit lower (maybe 4-8GB I forget exactly??) than what is printed there.

Sounds like you want longer context, which you can use -ctk q8_0 -ctv q8_0 to save space, but especially with GLM-4.6 this can tend to slow down token generation at larger context sizes. If you really want to push it, you can drop down to q6_0 but I wouldn't suggest lower than that. You could also compress k more than v etc, I forget which one is better but it seems asymettric regarding quality but I don't have data collected myself to show that either (would have to search old PRs on ik_llama.cpp for ik's advice).

ubergarm

Owner Nov 19, 2025

@xakepp

I found the reference for some random reddit vibes:

Tried glm 4.5 air, minimax m2, deepseek, all the qwens, oss 120 etc. None of them come close for my use cases [to GLM-4.6].
https://www.reddit.com/r/LocalLLaMA/comments/1p0r5ww/comment/npms00z/

And fwiw they are using a worse quality quant than the ones I suggest above.

xakepp

Nov 19, 2025

Thanks a lot for suggestions. I've just got 5090, so still waiting for pci-e cable and 90 degrees power cable to add 3060 back. This will likely do the trick with getting full glm working. Will update thread with benchmark

sokann

Nov 20, 2025

with GLM-4.6 this can tend to slow down token generation at larger context sizes

I think this has been fixed in https://github.com/ikawrakow/ik_llama.cpp/pull/899

I've heard mixed reports on REAP, with some perplexity testing suggesting it isn't delivering the "near lossless" performance but quite a higher perplexity than original.

From my testing, it is decent for coding tasks, but almost totally losses the ability to recite wikitext. That may not be a bad thing though, as somehow GLM-4.6-REAP-218B-A32B was able to successfully troubleshoot a deadlock concurrency issue, the first open weight model that can do it.

One thing to note is that for quants with the same size in GiB, a REAP quant will be less sparse and thus has slower TG.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment