Unofficial models

by Austriani - opened Mar 8

Mar 8

•

Hello, I want to ask you if you can make IQ4_KSS quantization for unofficial AI models. As I see you are the only one who are making IQK quantization on Hugging Face. I want someone to make IQ4_KSS (or IQ4_KT, but as I see you doing only IQK quants) for https://huggingface.co/llmfan46/Qwen3.5-27B-heretic-v2

ubergarm

Owner Mar 9

@Austriani

If you have 100GB of free disk space, you can quantize it yourself pretty quickly to any recipe you like including copy pasting my "secret recipe" for the 27B here and adjusting to IQ4_KSS or IQ4_KT etc. Both are very nice for dense models full GPU offload. I do like KT "trellis" quants for low BPW but find IQ4_KSS is more generally applicable for CPU inference and is same 4.0 BPW with similar PPL/KLD stats.

Basically:

download full bf16 safetensors from original repo
use mainline llama.cpp convert_hf_to_gguf.py
You can use the imatrix from here: https://huggingface.co/AesSedai/Qwen3.5-122B-A10B-GGUF/resolve/main/mmproj-Qwen3.5-122B-A10B-BF16.gguf
Just convert the gguf imatrix file to .dat format using mainline's llama-imatrix to convert it so that ik_llama.cpp can use it.
run llama-quantize using the .dat imatrix file and my 'secret recipe' adjusted to your liking

It doesn't take any VRAM to do this and even on a modest gaming rig won't take too long and probably less than 32GB RAM total.

Let me know if you need any help!

Austriani

Mar 9

•

edited Mar 9

@Austriani

If you have 100GB of free disk space, you can quantize it yourself pretty quickly to any recipe you like including copy pasting my "secret recipe" for the 27B here and adjusting to IQ4_KSS or IQ4_KT etc. Both are very nice for dense models full GPU offload. I do like KT "trellis" quants for low BPW but find IQ4_KSS is more generally applicable for CPU inference and is same 4.0 BPW with similar PPL/KLD stats.

Basically:

download full bf16 safetensors from original repo

use mainline llama.cpp convert_hf_to_gguf.py

You can use the imatrix from here: https://huggingface.co/AesSedai/Qwen3.5-122B-A10B-GGUF/resolve/main/mmproj-Qwen3.5-122B-A10B-BF16.gguf

Just convert the gguf imatrix file to .dat format using mainline's llama-imatrix to convert it so that ik_llama.cpp can use it.

run llama-quantize using the .dat imatrix file and my 'secret recipe' adjusted to your liking

It doesn't take any VRAM to do this and even on a modest gaming rig won't take too long and probably less than 32GB RAM total.

Let me know if you need any help!

Thank you! Do you think 32GB RAM is enough? I got only Intel Core Ultra 7 265K and 32GB 6800 MHz RAM, no dGPU. Do you think can I try to make quants for this model?

Anyways, thank you for your help, I think I will try to do it anyways.

Austriani changed discussion status to closed Mar 9

ubergarm

Owner Mar 9

•

edited Mar 9

@Austriani

I make all my quants with exactly 0 vram haha...

Yes, the most demanding (in terms of hardware and RAM) is generating the imatrix from the full size bf16 as you must be able to inference with that. But if someone else has made the imatrix for you, then the llama-quantize itself takes very little resources by comparison. A fast nvme drive is nice for the disk i/o, but if you're patient it should be fine.

If you want a very high level overview of the process you can check my recent talk: https://blog.aifoundry.org/p/adventures-in-model-quantization and i have a very old quant cookers guide (out of date) here: https://github.com/ikawrakow/ik_llama.cpp/discussions/434

I have a few commands and such too in my recent logs/ folders for more updated commands.

Let me know if you get stuck on any part, good luck!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment