Hmm. Super slow performance on newest llama.cpp

#15

by jeffwadsworth - opened Mar 3

Mar 3

Using this to start:
.\llama-server --model Qwen3.5-27B-Q8_0.gguf --temp 0.7 --threads 36 --ctx-size 50000 --flash-attn on --top-p 1.0 --min-p 0.01
I get around 3 t/s using the massive 4bit unsloth of GLM 5 for comparison. Using LLAMA.CPP B8192

owao

Mar 4

You should check your llama-server logs to see how many layers were offloaded to your GPU, because as you didn't set --n-gpu-layers llama-server figured it out automatically, but maybe your VRAM wasn't all available at loading time?
Also I don't know your VRAM available amount, but if 24GB, it's really optimistic to set -ctx-size to 50000 for Q8_0! Should start with 4096 or even 2048 for example and ramp up from there.
To finish I don't either know what's your CPU but 36 threads seems a lot? What if you don't set it and let llama-server figure it out?

jeffwadsworth

Mar 4

I don’t use a GPU. Dual Xeon Gold with 18c each at 3.2Ghz. DDR4 ram. The point being how it compares to running the massive GLM 5 model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment