Can you provide Q6_K GGUF?

pinned

by Omnico - opened Nov 20, 2025

Discussion

Omnico

Nov 20, 2025

topic

whoy

Owner Nov 21, 2025

In this current time - no.
There’s no full support in llama.cpp for this model for now.
GigaChat3’s attention uses a hybrid DeepSeek-style MLA layout (uncompressed Q with compressed MLA KV and specific RoPE placement), while the current llama.cpp DeepSeek backend assumes a different compression scheme and RoPE application pattern, so it cannot correctly map or execute this architecture yet.

ubergarm

Nov 21, 2025

Opened a PR here: https://github.com/ggml-org/llama.cpp/pull/17420

whoy

Owner Nov 21, 2025

Alright, thanks to ubergarm's open pull request on GitHub for llama.cpp, it's now possible to work with this model properly (if you compile llama.cpp with this patch yourself, of course).

I've prepared the requested q_6k and some other quants.

whoy pinned discussion Nov 21, 2025

ubergarm

Nov 21, 2025

@whoy

Great! The PR just got merged into main, so anyone can pull and re-build llama.cpp . Also it is working on ik_llama.cpp for very fast inference especially on CPU: https://github.com/ikawrakow/ik_llama.cpp/issues/994

Thanks for releasing further quants! Feel free to release any ik_llama.cpp SOTA quantizations as well. My model cards have a lot of example recipes, basically leave all attn/shexp/first dense layer as full Q8_0 and the routed experts can be smaller e.g. IQ5_KS ffn_(gate|up)_exps and IQ6_K ffn_down_exps would probably be a nice mix of quality and speed.

Cheers!

Omnico

Nov 22, 2025

•

edited Nov 22, 2025

Alright, thanks to ubergarm's open pull request on GitHub for llama.cpp, it's now possible to work with this model properly (if you compile llama.cpp with this patch yourself, of course).

I've prepared the requested q_6k and some other quants.

LMStudio says this when I try to load your model.

error loading model: error loading model hyperparameters: key not found in model: deepseek2.attention.q_lora_rank

Llama.cpp runtime was only updated today. Any ideas?

whoy

Owner Nov 22, 2025

•

edited Nov 22, 2025

@omnico likely, lm studio stuck on the b7087 release right now, without support for this model (It's in b7127). need to wait some, i guess.

ubergarm

Nov 23, 2025

My smaller quantizations are available with a variety of different backends, see here for details:

https://huggingface.co/ai-sage/GigaChat3-10B-A1.8B/discussions/1#692328e2159c0902bf860119

Jan has support for arbitrary backends which might allow you to update faster. LM Studio might get this useful feature in the future too hopefully as yeah the downstream projects always have some delay and don't get 0 day support for new models / patches.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment