Speed comparison with bartowski's IQ4_NL

by kulminaator - opened 16 days ago

I was considering that hey, maybe i should use mudler/gemma-4-26B-A4B-it-APEX-GGUF:I-Compact instead of bartowski/google_gemma-4-26B-A4B-it-GGUF:IQ4_NL , they look roughly the same ballpark and should have close to each other performance.

Well currently llama.cpp has an issue with these models and my vulkan backends, so very big prompts cause a crash on both.

But another not is that bartowski's IQ4_NL is much faster in token generation on my amd radeon + vulkan setup than the I-Compact , the speed diff is about 14 t/s vs 6 t/s. This is a vast difference.

Other settings than the model name were identical (96k context, context at q4 quant, fa on).

Any ideas what's wrong in my setup? Or is the I-Compact supposed to be slower?

Ran it like this:

llama-b8679$ ./llama-cli -hf mudler/gemma-4-26B-A4B-it-APEX-GGUF:I-Compact  -fa 1    -ngl 99   -c 96000   --temp 0.5 --jinja -ctk q4_0 -ctv q4_0   --reasoning-budget 1

kulminaator

10 days ago

Update to my instability - seemed to be related to ubuntu 22.04 vulkan libs, switching over to dockerized vulka llama.cpp fixed that issue, but the speed difference with apex vs iq4_nl is still very considerable.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment