Hmm. Super slow performance on newest llama.cpp
#15
by jeffwadsworth - opened
You should check your llama-server logs to see how many layers were offloaded to your GPU, because as you didn't set --n-gpu-layers llama-server figured it out automatically, but maybe your VRAM wasn't all available at loading time?
Also I don't know your VRAM available amount, but if 24GB, it's really optimistic to set -ctx-size to 50000 for Q8_0! Should start with 4096 or even 2048 for example and ramp up from there.
To finish I don't either know what's your CPU but 36 threads seems a lot? What if you don't set it and let llama-server figure it out?
I don’t use a GPU. Dual Xeon Gold with 18c each at 3.2Ghz. DDR4 ram. The point being how it compares to running the massive GLM 5 model.
