Gemma 4 31B's Context VRAM is insane. Seems like an unusable model to me.

by watchingyousleep - opened 19 days ago

•

I did the numbers crunching with Claude and here's how much VRAM you'd need to run this model at full context.
Model used: UD-Q4_K_XL (18.8 GB)
Model Size in VRAM without Cache: 21.07 GB
— No KV Quantization (0.85 MB/token) —
8K: ~7 GB
32K: ~27 GB (48GB of VRAM required)
64K: ~54 GB
128K: ~109 GB
256K: ~218 GB

— Q8_0 (0.25 MB/token) —
8K: ~2 GB (Might fit in 24GB of VRAM)
32K: ~8 GB
64K: ~16 GB (37GB of VRAM required)
128K: ~32 GB
256K: ~64 GB

— Q4_0 (0.038 MB/token) —
8K: ~0.3 GB
32K: ~1.2 GB
64K: ~2.4 GB (Might fit in 24GB of VRAM)
128K: ~4.9 GB
256K: ~9.7 GB

These numbers are rough and were gained using a setting of 100, 2048, and 8192 empty context with the KV Cache settings. Tests were done on LM Studio on Windows 11. They aren't perfect but plenty to get the point across. Q4_0 is the only one that looks remotely usable if you ask me. For reference Qwen3.5 27B is somewhere around 0.003MB per token at Q8_0.

watchingyousleep changed discussion title from Should I just ignore this model if I don't have at least 64GB of VRAM? Context RAM usage is absurd to Gemma 4 31B's Context VRAM is insane. Seems like an unusable model to me. 19 days ago

tomstokes

19 days ago

The model has a fixed 3.6GB SWA KV cache that you need to account for.

With --fit and 32GB of VRAM the UD-Q4_K_XL model with f16/f16 (no KV quantization) leaves me with over 100,000 context.

watchingyousleep

19 days ago

The model has a fixed 3.6GB SWA KV cache that you need to account for.

With --fit and 32GB of VRAM the UD-Q4_K_XL model with f16/f16 (no KV quantization) leaves me with over 100,000 context.

I switched to llama.cpp to test it further when I had more time on my hands and it appears that my above post was just an LM Studio bug. In llama.cpp I'm seeing 22.3GB of VRAM usage at 16384 context at FP16.

watchingyousleep changed discussion status to closed 19 days ago

nevril

19 days ago

If you are the only user add the
-np 1
option to your llama.cpp config. I will reserve space for a single SWA KV instead of the default 4.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment