IQ4_XS fits 64G mac when set max vram to 61440
Hello. Thank you for the quants.
Just to inform ones who choose right quantization. On 64Gb M1 max macbook, I was able to run IQ4_XS, after setting sudo sysctl iogpu.wired_limit_mb=61440 and closing all other high-RAM apps. Context size set to 65536. Experienced crash after closing llama.cpp with ctrl+c. But before that moment it worked just fine.
Thank you, thank cerebras, thank z.ai, thank Gerganov and all who make this possible.
Great model.
It would have been really awesome if Q4s other than IQ were under 50Gb (for vulkan inference) but yes thank you cerebras, z.ai, Gerganov and bartowski.
Thanks also to the people who suggested that ggerganov put bundle-in model metadata (jinja) such as temp, top_k etc into his revised fileformat!
(that's me! :)
We might expect also higher context size to fit, once TurboQuant lands llama.cpp