Unofficial models
Hello, I want to ask you if you can make IQ4_KSS quantization for unofficial AI models. As I see you are the only one who are making IQK quantization on Hugging Face. I want someone to make IQ4_KSS (or IQ4_KT, but as I see you doing only IQK quants) for https://huggingface.co/llmfan46/Qwen3.5-27B-heretic-v2
If you have 100GB of free disk space, you can quantize it yourself pretty quickly to any recipe you like including copy pasting my "secret recipe" for the 27B here and adjusting to IQ4_KSS or IQ4_KT etc. Both are very nice for dense models full GPU offload. I do like KT "trellis" quants for low BPW but find IQ4_KSS is more generally applicable for CPU inference and is same 4.0 BPW with similar PPL/KLD stats.
Basically:
- download full bf16 safetensors from original repo
- use mainline llama.cpp
convert_hf_to_gguf.py - You can use the imatrix from here: https://huggingface.co/AesSedai/Qwen3.5-122B-A10B-GGUF/resolve/main/mmproj-Qwen3.5-122B-A10B-BF16.gguf
- Just convert the
ggufimatrix file to.datformat using mainline'sllama-imatrixto convert it so that ik_llama.cpp can use it. - run
llama-quantizeusing the .dat imatrix file and my 'secret recipe' adjusted to your liking
It doesn't take any VRAM to do this and even on a modest gaming rig won't take too long and probably less than 32GB RAM total.
Let me know if you need any help!
If you have 100GB of free disk space, you can quantize it yourself pretty quickly to any recipe you like including copy pasting my "secret recipe" for the 27B here and adjusting to IQ4_KSS or IQ4_KT etc. Both are very nice for dense models full GPU offload. I do like KT "trellis" quants for low BPW but find IQ4_KSS is more generally applicable for CPU inference and is same 4.0 BPW with similar PPL/KLD stats.
Basically:
- download full bf16 safetensors from original repo
- use mainline llama.cpp
convert_hf_to_gguf.py- You can use the imatrix from here: https://huggingface.co/AesSedai/Qwen3.5-122B-A10B-GGUF/resolve/main/mmproj-Qwen3.5-122B-A10B-BF16.gguf
- Just convert the
ggufimatrix file to.datformat using mainline'sllama-imatrixto convert it so that ik_llama.cpp can use it.- run
llama-quantizeusing the .dat imatrix file and my 'secret recipe' adjusted to your likingIt doesn't take any VRAM to do this and even on a modest gaming rig won't take too long and probably less than 32GB RAM total.
Let me know if you need any help!
Thank you! Do you think 32GB RAM is enough? I got only Intel Core Ultra 7 265K and 32GB 6800 MHz RAM, no dGPU. Do you think can I try to make quants for this model?
Anyways, thank you for your help, I think I will try to do it anyways.
I make all my quants with exactly 0 vram haha...
Yes, the most demanding (in terms of hardware and RAM) is generating the imatrix from the full size bf16 as you must be able to inference with that. But if someone else has made the imatrix for you, then the llama-quantize itself takes very little resources by comparison. A fast nvme drive is nice for the disk i/o, but if you're patient it should be fine.
If you want a very high level overview of the process you can check my recent talk: https://blog.aifoundry.org/p/adventures-in-model-quantization and i have a very old quant cookers guide (out of date) here: https://github.com/ikawrakow/ik_llama.cpp/discussions/434
I have a few commands and such too in my recent logs/ folders for more updated commands.
Let me know if you get stuck on any part, good luck!