QuIP - 2 bit quantised as good as 16 bit

#5
by infinityai - opened

Hi Ubergarm , I don't know if you've seen this or not about this,

It's supposed to be a way to compress the weights without any loss supposedly

Can you have a look at it and let me know what you think?

Official repos
QuIP (original):
https://github.com/Cornell-RelaxML/QuIP

QuIP# (improved + CUDA kernels):
https://github.com/Cornell-RelaxML/quip-sharp

It would be really cool if we could compress some of the top models using this QuIP compression quantisation technique

I think QuIP is a couple years old? Do you have links to any comparisons between that an other things like QTIP, various ik/llama.cpp quantization types etc? Also this library is used quite a bit already including with Kimi-K2-Thinking https://github.com/vllm-project/llm-compressor . Also intel has their autoround stuff, but the ~2bpw quants in ik_llama.cpp were already better than that in my own testing (you can search the old PRs and discussion on that github for discussion with the intel guy)

There is also QAT stuff including fine tuning as part of the process before doing the actual weights quantization...

It would be really cool if we could compress some of the top models using this QuIP compression quantisation technique

I'm guessing ik's SOTA quants are already better and also have kernels for hybrid CPU+CUDA inferencing so not sure what QuIP could offer here?

Feel free to convince me!

Cheers!

ok yes that code is a bit old , what about this from intel?

https://github.com/intel/auto-round

https://www.youtube.com/watch?v=LszyOPcajEQ

also have you seen this REAP

https://github.com/CerebrasResearch/reap

ExllamaV3 uses QTIP as its backing algorithm if you’re interested. ik_llama.cpp uses an int-based variant for its own trellis quants too.

@infinityai

what about this from intel?

You can do your research on intel AutoRound discussions with ik and the intel dev: https://github.com/ikawrakow/ik_llama.cpp/discussions/657#discussioncomment-13900044

also have you seen this REAP

Yes, I'm familiar with REAP to remove some sparse experts to shrink the model size. I've heard anecdotally that sometimes the resulting model might inference slower than one would expect. Also I've not seen convincing evidence (e.g. perplexity / kld graphs) showing that a REAP version of a model is better than a non-reap with heavier quantization. It is interesting and still an area for research if you are interested in doing that.

As @dinerburger points out, EXL3 (exllamav3 by turboderp) and ik both have their own implementation of trellis quants (based on the QTIP paper) which is one of the best quality available currently, but works best on GPU offload due to computational requirements during token generation (decode). It does work on CPU though and might be worth exploring on some hardware configurations, especially with very limited RAM.

Cheers!

Sign up or log in to comment