Appreciation for IQ1_KT

#5
by Radamanthys11 - opened

First, I want to thank you for the IQ1_KT option. I tried other Q1s from unsloth and DevQuasar but they fell into endless repetition. As someone who tried DeepSeek R1 and V3, I knew that it was possible to have a good Q1 to use, and you were the only one that achieved that.

That being said, this KT type is not the best for hybrid as you wrote, so I want to ask if it is possible to have an option similar in size and quality but better for CPU + GPU? I'm getting roughly 5.5 t/s, while on the other Q1s (although useless) it was possible to get around 7.5 t/s.

Lastly, I love your quants, already waiting for more in GLM 5.

Owner
β€’
edited Feb 15

@Radamanthys11

Thanks, I appreciate the kind words!

Yes, the KT / QTIP / EXL3 quants generally work best on full GPU offload for fast TG (decode). They pack an lot of quality in such a small size too as you observe. It does also let folks without enough RAM+VRAM for larger quants to at least get a taste even if slower hah...

What is your RAM+VRAM situation and what kind of hardware e.g. AMD/Intel CUDA or what? The existing smol-IQ2_KS is probably the next size up that runs well on CPU/RAM.


Below is for https://huggingface.co/ubergarm/GLM-5-GGUF

Oh wait, you're talking about Kimi-K2.5... I'm currently cooking GLM-5 right now. This reply below was assuming GLM-5 haha, sorry I got confused.

It is pretty wild to me that this sub 2bpw quant is working right now on opencode in my testing on CPU only haha...

is possible to have an option similar in size and quality but better for CPU + GPU?

Yup, i'm thinking maybe a smol-IQ2_KS which is typically as low as I like to go taking advantage of ik's newer quant typesl. the iq2_ks routed experts run quite well on CPU/RAM.

I'll likely aim for a smol-IQ3_KS too for those with more RAM+VRAM as that is about the sweet spot historically before quality begins to drop off faster.

Haha, sorry for mixing things up! I'm currently using Kimi while I wait for the GLM PR to be merged and the new quants, though I don't doubt that GLM will perform better.

My struggle comes mostly from trying to run a 1T model like Kimi. I'm currently in a bit of a weird spot with 204 GB RAM and 48 GB VRAM. I plan to upgrade to 256 GB RAM in the future, but until then, I can't really run the available Q2 options. That is why I was wondering if something like a "smallest possible Q2" or a "high-end Q1" is achievable?

The IQ1_KT already takes up almost all my available memory, so I understand if better options aren't realistic right now.

@Radamanthys11

Yeah its hard to go below iq2_ks for routed exps imo. iq2_ks is 2.1875 BPW which is already too big sounds like. The iq1_kt is 1.75 BPW.

To match or go below 1.75BPW that leaves us with:

  • IQ1_S 1.56 bpw
  • IQ1_S_R4 1.5 bpw (its the only non-symmetrical _r4 i think, but i haven't used it since my first DeepSeek.. non repacked have faster pp in large batch sizes now)
  • IQ1_M 1.75 bpw

These quant types are going to be faster TG for you running on CPU than the KT quants probably. But they are not as good in terms of quality as measured by perplexity in my experience. I don't have a one vs one direct comparison unfortunately to illuminate ourselves with certainty, and I have been known to be wrong for sure lol (recently a IQ4_NL quant had the best perplexity which shook me lol).

Unfortunately, 1T model is just kinda big. Definitely consider the new GLM-5 as the smol-IQ2_KS will be just over 200GiB and easily fit on your rig with plenty of context and probably offload a few more layers onto VRAM too. I'm dialing in the best recipe right now.

I tried other Q1s from unsloth and DevQuasar but they fell into endless repetition.

You should click the .gguf file in HF before downloading.

DevQuasar

Have a look at this (their Kimi-K2.5):

image

They seem to quant the attn for some reason. It is immediately obvious, this quant is never going to be usable.
And look how small it is, that's not going to save any meaningful amount of memory.
eg: [7β€―168, 1β€―536] is 71681536 -> about 11 million parameters.
vs [2β€―048, 7β€―168, 384] is 2048
7168*384 -> 5.6 billion parameters

The other one to watch out for is token_embd.weight (in the first shared).

This should never be < q4_k, ideally higher (it's Q2_K in the DevQuasar quant)

Unsloth have a habit of crushing these down.

Really, for these types of models, you'd only want to shrink the routed up/down/gate exps.

"DevQuasar but they fell into endless repetition"
Thanks for calling this out!
When I've tested the IQ1_M quant I've also experieced the prepetition so that quant should have been removed but i've forgot.
Removing it now. They rest of the quants seemed behave reasonable.

They seem to quant the attn for some reason. It is immediately obvious, this quant is never going to be usable.
And look how small it is, that's not going to save any meaningful amount of memory.
eg: [7β€―168, 1β€―536] is 71681536 -> about 11 million parameters.
vs [2β€―048, 7β€―168, 384] is 2048
7168*384 -> 5.6 billion parameters

The other one to watch out for is token_embd.weight (in the first shared).

This should never be < q4_k, ideally higher (it's Q2_K in the DevQuasar quant)

Thanks for the explanation! I had a rough idea of what was best to quantize versus what to leave alone, but I never took the time to look exactly at how it was done in the file. I will dedicate more time to inspecting the layers in future models to make sure I'm not downloading something just to delete it minutes later.


  • IQ1_S_R4 1.5 bpw (its the only non-symmetrical _r4 i think, but i haven't used it since my first DeepSeek.. non repacked have faster pp in large batch sizes now)

I was an avid user of R1-0528-IQ1_S_R4 made for you and it was a beast! Now that I know IQ1_KT is in a similar quantization bracket to IQ1_M, I might try to make an IQ1_M for myself later to test the speed difference. We will see in the future if I can upgrade my memory to push directly to Q2 or if I'll stick to the sub-2bpw life.

Unfortunately, 1T model is just kinda big. Definitely consider the new GLM-5 as the smol-IQ2_KS will be just over 200GiB and easily fit on your rig with plenty of context and probably offload a few more layers onto VRAM too. I'm dialing in the best recipe right now.

Sure! I will keep testing Kimi a bit more, but I'm planning to download your IQ2_KS so I can test GLM-5 properly.

@csabakecskemeti

Thanks for calling this out!

Really appreciate you getting out quants so early so people can test! Great seeing you here DevQuasar!

@Radamanthys11

I might try to make an IQ1_M for myself later to test the speed difference.

It will definitely be faster TG given you're running on CPU/RAM for those tensors. But it will likely be worse perplexity/KLD. As long as you keep the rest of the recipe similar without smashing down attn/shexp/dense layers it will likely still run good enough. Just change the "secret recipe" for the smol-IQ1_KT where it says iq1_kt twice to be iq1_m (case sensitive) and you'll get a model the exact same size pretty sure.

I'm planning to download your IQ2_KS so I can test GLM-5 properly.

Great, keep us posted I'm really curious how GLM-5 stacks up against Kimi-K2.5 .. they both have MLA now so kv-cache should be about same VRAM usage.

Sign up or log in to comment