any love for 16gb?

#3
by iucpxleps - opened

any smaller versions for us in limbo with 16GB vram :)

This is a MoE model and only has 4 active parameters at a time. Instead of offloading layers offload some experts with llama cpp moe-cpu or n-moe-cpu. I can easily get 40tps with the normal gemma4 at Q6 with just 16GB vram. (Had other issues slowing me down and dropped to IQ4_XS now and get around 80tps. It was literally internal VRAM bus bound not regular VRAM bound). Still, an IQ4_XS would be nice. Going to give this a shot soon with my 5070ti.

https://huggingface.co/blog/Doctor-Shotgun/llamacpp-moe-offload-guide

thanks i will try that way

Sign up or log in to comment