Any chance to re-do the quants for MLA?

#1
by Panchovix - opened

Hi there, thanks for your work quantizing. I was wondering if it was possible to re-do these quants with latest mainline llamacpp, as it uses MLA and it reduces the VRAM usage by a lot of the KV Cache, for example for 16K going from 80GB VRAM to 4GB.

Thanks in advance!

DevQuasar-4 org

Thanks for the suggestion.
Yes it's possible.
Just for my education, can you please point me which release introduced the feature?
https://github.com/ggml-org/llama.cpp/releases

Thanks! It was from this PR, that got merged 3 weeks ago.

https://github.com/ggml-org/llama.cpp/pull/12801

And specific commit was https://github.com/ggml-org/llama.cpp/commit/daa422881a0ec7944771bcc8ff8de34d11f5bd3b

Most issues mentioned after the nerge are fixed by now.

DevQuasar-4 org

@Panchovix It's in the making (will take a while)

@csabakecskemeti amazing, many thanks! I can run Q3_K_M and Q4_K_M, so will wait patiently but expectantly for those!

DevQuasar-4 org

@Panchovix updated Q2_K quant has uploaded. Could you please double check if this supports all feature you've mentioned? Thanks
The rest of the quants are under upload.

DevQuasar-4 org

Q3, Q4 has re-uploaded too

Many thanks! I will download Q4_K_M and let you know how it goes!
EDIT: It works fine with MLA + FA, many thanks!

Just finished the q5 an q6 quants, by tomorrow those should be reuploaded as well!
Any need for q8?

Many thanks!!!

Personally I can't run Q8, but not sure how many people want it either D:

@Panchovix Is the Q4_K any good?

I ran the Q2_K and found the code output worse than a UD-1s of the full DeepSeek-V3.5 model.

P.S. Thanks for the MLA quants @csabakecskemeti !

@gghfez I feel it is better than the 1 bit quants, but not above q2_k_xl quant.

DevQuasar-4 org

Q6 will be updated by tomorrow

Sign up or log in to comment