Memory Leak?
Edit: I just noticed that this post may be similar to this one. It's hard for me to make sense of what exactly the people there are saying, but it seems related to ballooning cache taking up too much system memory.
Edit 2: Seems this is a known issue and a forthcoming fix is unlikely. There are several workarounds though in those linked threads.
I am running your gemma-4-31B-it-UD-Q8_K_XL.gguf with llama.cpp server and noticing behavior that I have not seen with any other models. I am not sure what can be causing it. I figure it's either:
a) Normal behavior, and I just don't know it
b) A bug in llama.cpp (llama-server version 8766, released Apr 12)
c) A bug in your quants
d) Something wrong with the hardware or overall system setup
I run your models on a DGX Spark. For gemma4-31B I'm giving it 128k context. When it starts up, the system is using about 57GB of its RAM. When I use this model with my assistant agent, as a session goes on and more tool calls and context is created, the RAM used keeps growing and growing, to the point where I have had a couple times where I needed to shut down the agent and unload the model in the middle of a turn because once you overflow the RAM on a DGX Spark, it locks up and the only way to fix it is to unplug it, basically.
I have used a ton of the unsloth quants, and GGUFs from other places, and I have never seen this behavior before. Once a model is loaded, the RAM used is pretty constant. Maybe it will vary by a few GB here and there, especially if I have been using it for hours. But this one more than doubles its RAM usage within 15 minutes of heavy use.
I've used both your initial quants and the updated ones you released on Apr 11 and have the same problem with both.
I am not sure if this is an Unsloth problem, but maybe someone knows what could be happening. If it's a problem with llama.cpp or the DGX spark, I will go there and report an issue if it's not reported already.
Here are the run flags I use when launching the server:
-fa on --n-gpu-layers 999 --no-mmap --direct-io --jinja \
--model "${models_dir}/gemma-4-31B-it-UD-Q8_K_XL/gemma-4-31B-it-UD-Q8_K_XL.gguf" \
--ctx-size 131072 \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
--mmproj "${models_dir}/gemma-4-31B-it-UD-Q8_K_XL/mmproj-BF16.gguf"
Oh...!