Hmm. Super slow performance on newest llama.cpp

#15
by jeffwadsworth - opened

superSlowRunningLLAMA.CPP-Latest
Using this to start:
.\llama-server --model Qwen3.5-27B-Q8_0.gguf --temp 0.7 --threads 36 --ctx-size 50000 --flash-attn on --top-p 1.0 --min-p 0.01
I get around 3 t/s using the massive 4bit unsloth of GLM 5 for comparison. Using LLAMA.CPP B8192

You should check your llama-server logs to see how many layers were offloaded to your GPU, because as you didn't set --n-gpu-layers llama-server figured it out automatically, but maybe your VRAM wasn't all available at loading time?
Also I don't know your VRAM available amount, but if 24GB, it's really optimistic to set -ctx-size to 50000 for Q8_0! Should start with 4096 or even 2048 for example and ramp up from there.
To finish I don't either know what's your CPU but 36 threads seems a lot? What if you don't set it and let llama-server figure it out?

I don’t use a GPU. Dual Xeon Gold with 18c each at 3.2Ghz. DDR4 ram. The point being how it compares to running the massive GLM 5 model.

Sign up or log in to comment