Just squeeze 4-5 GB more.

#2
by emircanerkul - opened

Why not squeeze a bit more to fit 16gb vram gpus? I tested gemma-4-26B-A4B i4-lowest one with 6800xt and got 70tps. I wonder if this full model good or moe one

This quant targets Blackwell FP4 tensor cores (RTX 5090, PRO 6000, etc.), so it wouldn't benefit AMD GPUs anyway. I've already quantized the most lossless layers. Pushing further might get it closer to 16GB, but you wouldn't have enough space left for the KV cache.

I see, thank you.

emircanerkul changed discussion status to closed

Sign up or log in to comment