IQ3_K Discussion.
Xeon 8480+ QYFS 512GB 5600mhz 2x 3090s - 10tok/s . im only using 24 gigs of vram so this will fit on one 3090 maybe a bit less -c . I can start to add -ot on gpu later but might not make much difference at this size
numactl -N 0 -m 0 ../build/bin/llama-server
--model "/mnt/ExtraStorage/Models/Kimi-K2-Thinking-IQ3_K-00001-of-00011.gguf"
--alias ubergarm\KimiThink-IQ3_K
-mla 3 -amb 2048 -ub 4096 -b 4096
-c 32768 -ctk q8_0
-ngl 99
-ot exps=CPU
--parallel 1
--threads 88
--host 127.0.0.1
--port 8080
--numa numactl
--split-mode layer --tensor-split 1,1
--mirostat 2 --mirostat-ent 3
--mirostat-lr 0.1
Seems to be coherent enough but is missing thinking tags. Still happy with it, will try the larger size after I download it.
Edit: I'm dumb i didn't see the --special tag requirement
Mirostat 2 at lower Quant Sizes:
Tried Ent-5 but it's svg code at that isn't very good and is broken and iterates on its own code way too much like 8 times making it worse every time. I don't recommend using entropy 5 or 4 unless you want it only for chat.
What i found over the last 2 weeks-
- if you feel like things are broken or dont work for a model with top k/ p just use the --mirostat 2 and lower ent until it is coherent. It's my go to for any model that is very low quant size like q2 even. In general without it it usually becomes broken way too fast after a couple of terns.
SVG example at Ent3 and Ent5
Thanks for the report and glad you have found some settings that you like! Yeah I added that --special a bit late to the model card, I've never had to use it before.
My understanding is something like this will help the model out:
--special--jinja--chat-template-file customtemplate.jinjaupdating to latest version (they've made a couple changes since release)- add something to your client to use
<|im_end|>as stop string as it will be printed out now due to--special
I'll add this to the model card now. thanks!
Would it be possible to update IQ3_K with patched q4_0 tensors or is it better to just stick with smol-IQ4_KSS for this size range?
Right, I've made a new "Q4_X-IQ3_K" but didn't release it. It is smaller, but technically if you can run the smol-IQ4_KSS that still scores "better". The ideal quant to run by-far is the Q4_X itself, but it is probably a bit large for some folks, but has the lowest baseline perplexity.
Here is my most recent chart:
On this chart I have not released the:
- IQ3_KT (probably won't release it as it was not as good as I was hoping)
- Q4_X-IQ3_K (this is using the new modified q4_x patch for the q4_0 tensors) - I could release this and if so would likely delete the original IQ3_K
I plan to delete the original Q8_0-Q4_0 soon too given storage is limited on hf.
If folks would prefer the smaller Q4_X-IQ3_K 459.432GiB (3.845 BPW) which only uses imatrix on the iq3_k tensors let me know and I'll upload it.
Thanks for your work!
I think it would make sense to replace IQ3_K with Q4_X-IQ3_K since perplexity is almost the same, but size as another ≈15 gb smaller. On 512gb machines this would allow to run this quant and have plenty of RAM left for other tasks.
Yeah I agree it is quite a bit smaller with negligible perplexity loss. Gimme a bit to split it and upload it!
Okay, uploading now over-writing the previous IQ3_K. Hopefully it isn't too confusing. Check the date stamp on the files to confirm the new one is completed uploading: https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/tree/main/IQ3_K (should be done within half an hour)
Thanks once again! I'll put it on download tonight😊



