Hardware requirement
what GPU VRAM do I need to run this?
I load gemma-4-31b-it-AWQ with vllm/vllm-openai-cu130 images on my rtx3090
my docker command is :
docker run --gpus all
--runtime nvidia
--ipc=host
-v "$MODEL_PATH:/model"
-p 8000:8000
vllm/vllm-openai:gemma4-cu130
--model /model
--served-model-name gemma-4-31b
--dtype bfloat16
--quantization compressed-tensors
--max-model-len 1536
--max-num-seqs 4
--gpu-memory-utilization 0.95
--trust-remote-code
--enable-auto-tool-choice
--tool-call-parser gemma4
Well. max-model-len can NOT exceed 1536, otherwise OOM
what GPU VRAM do I need to run this?
The technical answer is ZERO, of course (read up on turing completeness).
Probably the "practical" answer is 24GB, using an IQ3 GGUF quant with llama-server's --fit option. IQ2 quants might run as well, maybe even in 16GB, but tool calling and accuracy and context length would suffer a lot. i.e., it might fail to be agentic almost at all, or it might make REALLY bad mistakes, like picking - when it's meant to be + on that payment into your accounting software. You can mitigate with tests of course, but point is... it's probably pretty bad at IQ2. With limited context length, it's not going to be too practical for agentic dev or even long conversations anyway.
If you're tight on VRAM, consider the 27B A4B variant instead. It will require less VRAM, and be much faster, but it will be more of a convincing parrot OF an intelligent beast, than an actual intelligent beast.
Also, bear in mind that the other, smaller Gemma 4 models have audio support, which this lacks. Just trade-offs to consider, especially if you're struggling to fit this variant in.
Even Q4_K_L from bartowski fits into 24GB with 16k unquanted context at least (maybe bit more, did not try) while using the same card also as display device. No need to go below 4bpw with 24GB. For more context one can use Q4KS/IQ4_XS and/or KV cache quanted to Q8.
You will need bit more if you want to use vision though (load mmproj file) but IQ4_XS with possibly KV cache at Q8 should allow decent context still I think.
Even Q4_K_L from bartowski fits into 24GB with 16k unquanted context at least (maybe bit more, did not try) while using the same card also as display device. No need to go below 4bpw with 24GB. For more context one can use Q4KS/IQ4_XS and/or KV cache quanted to Q8.
You will need bit more if you want to use vision though (load mmproj file) but IQ4_XS with possibly KV cache at Q8 should allow decent context still I think.
16k context is junk for agentic use though, barely even fits the prompt with a few basic tools. You need like 80k as a practical starting point; 65k will work, but will churn between making some progress and compacting the context to make room for progress. Every compaction, you lose fidelity, like chinese whispers.
Even Q4_K_L from bartowski fits into 24GB with 16k unquanted context at least (maybe bit more, did not try) while using the same card also as display device. No need to go below 4bpw with 24GB. For more context one can use Q4KS/IQ4_XS and/or KV cache quanted to Q8.
You will need bit more if you want to use vision though (load mmproj file) but IQ4_XS with possibly KV cache at Q8 should allow decent context still I think.16k context is junk for agentic use though, barely even fits the prompt with a few basic tools. You need like 80k as a practical starting point; 65k will work, but will churn between making some progress and compacting the context to make room for progress. Every compaction, you lose fidelity, like chinese whispers.
80k is not going to be good with these small models anyway, especially if you can't use full precision or at least Q8 weights (and KV cache definitely full precision then). For me 16k is more than enough for almost all what I need. That said IQ4_XS will likely allow quite a lot of context. If someone really needs more, then Qwen 3.5 27B may be better choice (it has similar performance as Gemma4, losing mostly in languages/creative writing) as it is bit smaller and also uses long context quite efficiently.