3bit quant for 128GiB Macs?

#3
by m-i - opened

Or maybe some info about what to use to create quants for ds4.c ?
Some are trying to squeeze every byte of RAM available.
Just running headless with ssh login only, macos will only use ~4GiB.
You can then allocate about 123GiB/132GB to model + context with sudo sysctl iogpu.wired_limit_mb=132000.

Those quants are for the DS4 inference engine only, and to have 3 bit quants without the specialized kernels would be useless. I want to stick for more practical size classes :) Thanks.

I thought it would just be a matter of quantizing and run but after reading parts of ds4/moe.metal I see that you would also have to write additional code for other quants.
If the gguf is under 125GB on disk, IQ3-XXS + MTP + 50~200k ctx + compressed indexer + headless macOS may fit in 128GiB of RAM.

m-i changed discussion status to closed

Sign up or log in to comment