Can you provide Q6_K GGUF?

#1
by Omnico - opened

In this current time - no.
There’s no full support in llama.cpp for this model for now.
GigaChat3’s attention uses a hybrid DeepSeek-style MLA layout (uncompressed Q with compressed MLA KV and specific RoPE placement), while the current llama.cpp DeepSeek backend assumes a different compression scheme and RoPE application pattern, so it cannot correctly map or execute this architecture yet.

Alright, thanks to ubergarm's open pull request on GitHub for llama.cpp, it's now possible to work with this model properly (if you compile llama.cpp with this patch yourself, of course).

I've prepared the requested q_6k and some other quants.

whoy pinned discussion

@whoy

Great! The PR just got merged into main, so anyone can pull and re-build llama.cpp . Also it is working on ik_llama.cpp for very fast inference especially on CPU: https://github.com/ikawrakow/ik_llama.cpp/issues/994

Thanks for releasing further quants! Feel free to release any ik_llama.cpp SOTA quantizations as well. My model cards have a lot of example recipes, basically leave all attn/shexp/first dense layer as full Q8_0 and the routed experts can be smaller e.g. IQ5_KS ffn_(gate|up)_exps and IQ6_K ffn_down_exps would probably be a nice mix of quality and speed.

Cheers!

Alright, thanks to ubergarm's open pull request on GitHub for llama.cpp, it's now possible to work with this model properly (if you compile llama.cpp with this patch yourself, of course).

I've prepared the requested q_6k and some other quants.

LMStudio says this when I try to load your model.

error loading model: error loading model hyperparameters: key not found in model: deepseek2.attention.q_lora_rank

Llama.cpp runtime was only updated today. Any ideas?

@omnico likely, lm studio stuck on the b7087 release right now, without support for this model (It's in b7127). need to wait some, i guess.

My smaller quantizations are available with a variety of different backends, see here for details:

https://huggingface.co/ai-sage/GigaChat3-10B-A1.8B/discussions/1#692328e2159c0902bf860119

Jan has support for arbitrary backends which might allow you to update faster. LM Studio might get this useful feature in the future too hopefully as yeah the downstream projects always have some delay and don't get 0 day support for new models / patches.

Sign up or log in to comment