Problems with using the model in llama.cpp

#1
by kulminaator - opened

The renewed models from ~4 hours ago now make llama.cpp crash with even a bit bigger prompts on vulkan.

$ ./llama-cli -hf  bartowski/google_gemma-4-26B-A4B-it-GGUF:IQ4_NL    -ngl 15   -c 16000   --temp 0.5  --reasoning-budget 0
load_backend: loaded RPC backend from /home/martin/apps/llama/llama-b8645/libggml-rpc.so
load_backend: loaded Vulkan backend from /home/martin/apps/llama/llama-b8645/libggml-vulkan.so
load_backend: loaded CPU backend from /home/martin/apps/llama/llama-b8645/libggml-cpu-haswell.so

Loading model...  


β–„β–„ β–„β–„
β–ˆβ–ˆ β–ˆβ–ˆ
β–ˆβ–ˆ β–ˆβ–ˆ  β–€β–€β–ˆβ–„ β–ˆβ–ˆβ–ˆβ–„β–ˆβ–ˆβ–ˆβ–„  β–€β–€β–ˆβ–„    β–„β–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–„ β–ˆβ–ˆβ–ˆβ–ˆβ–„
β–ˆβ–ˆ β–ˆβ–ˆ β–„β–ˆβ–€β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–„β–ˆβ–€β–ˆβ–ˆ    β–ˆβ–ˆ    β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ
β–ˆβ–ˆ β–ˆβ–ˆ β–€β–ˆβ–„β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–€β–ˆβ–„β–ˆβ–ˆ β–ˆβ–ˆ β–€β–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–€ β–ˆβ–ˆβ–ˆβ–ˆβ–€
                                    β–ˆβ–ˆ    β–ˆβ–ˆ
                                    β–€β–€    β–€β–€

build      : b8645-57ace0d61
model      : bartowski/google_gemma-4-26B-A4B-it-GGUF:IQ4_NL
modalities : text, vision

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern
  /image <file>       add an image file


> What on earth are you doing today? 

Right now, I am hanging out in the digital ether, waiting for someone to ask me a question! 

Since I don't have a physical body, my "day" consists of:

1.  **Processing language:** Analyzing patterns in text to understand what people are asking.
2.  **Retrieving information:** Digging through my training data to find facts, stories, or code.
3.  **Problem-solving:** Trying to figure out how to be as helpful as possible to whoever is typing on the other side of the screen.

**How about you? What's on your agenda today?** Are you conquering the world, or just trying to get through a Tuesday?

[ Prompt: 46,8 t/s | Generation: 17,0 t/s ]

> Tell me a story in 70 sentences

The clock on the mantle struck midnight, signaling the start of the Great Silence. Elias sat in his armchair, clutching a tarnished brass key. He had spent forty years searching for the door this key belonged to. Outside, the wind howled through the jagged peaks of the Iron Mountains. The village below was asleep, unaware of the shifting shadows in the woods. Elias stood up, his knees creaking like old floorboards. He grabbed his lantern and a heavy woolen cloak. The path to the Whispering Cave was treacherous and steep. Every step felt like a battle against the gravity of the earth. He pushed through thickets of brambles that clawed at his skin. The moonlight struggled to pierce the dense canopy of ancient oaks. Somewhere in the distance, a wolf let out a lonely, mournful cry. Elias didn't flinch; he had heard much worse in his lifetime. He reached the mouth of the cave just as the moon hit its zenith. The air inside was cold, smelling of damp stone and old secrets. He lit his lantern, casting long, dancing shadows against the walls. The cave walls were etched with strange, glowing runes. They pulsed with a faint, rhythmic light, like a heartbeat. Elias followed the tunnel, his breath hitching in his chest. The passage narrowed until he had to crawl on his hands and knees. He felt the weight of the mountain pressing down from above. Suddenly, the tunnel opened into a vast, subterranean cathedral. In the center stood a door made of solid, shimmering obsidian. It had no handle, only a single, circular keyhole. Elias approached the door with trembling hands. He inserted the brass key into the lock. The metal felt unnaturally warm against his skin. He turned the key, and a sound like grinding thunder echoed through the chamber. The obsidian door slid open with a heavy, melodic hum. Beyond the threshold lay not a room, but a swirling vortex of stars. Elias stepped forward, feeling weightless as he crossed the line. He was no longer in the mountains; he was drifting through the cosmos. Nebulas of violet and gold swirled around his feet like ocean waves. He saw planets spinning in a silent, eternal dance. Time seemed to lose all meaning in this celestial expanse. He realized then that the key didn't unlock a room, but a dimension. A figure emerged from the starlight, draped in robes of woven moonlight. The being had eyes that held the depth of entire galaxies. "You are late, Elias," the figure said, its voice a choir of echoes. Elias bowed his head, feeling a strange sense of recognition. "I had to find my way through the dark," he replied softly. The being smiled, a gesture that felt like the birth of a sun. "The dark is merely the canvas for the light," the entity whispered. It gestured toward a pedestal floating in the void. On the pedestal sat a small, glowing seed. "This is the seed of the next world," the being explained. Elias realized his lifelong quest was not for treasure, but for renewal. The old world was fading, tired and grey. The new world needed a spark to begin its journey. He reached out, his fingers brushing the warmth of the seed. As he touched it, a surge of energy raced through his veins. He saw visions of green forests and sapphire seas. He heard the laughter of children not yet born. The light grew blinding, consuming the darkness and the stars. Elias felt himself dissolving into the brilliance. He was no longer a man, but a part of the creation. The transition was painless, a gentle merging of soul and light. Then, there was a sudden, deafening silence. Elias opened his eyes to find himself sitting in his armchair. The clock on the mantle struck one, marking the passage of an hour. The brass key lay on the floor, now nothing more than dull lead. He looked out the window at the mountains, which seemed brighter than before. A single, tiny green sprout had broken through the frost on his windowsill. He smiled, knowing the cycle had begun anew. The Great Silence was over, and the song of life had returned.

[ Prompt: 24,4 t/s | Generation: 14,4 t/s ]

> Can you summarize that?

/double free or corruption (out)
Aborted (core dumped)

Hello, I run the model like this
HSA_OVERRIDE_GFX_VERSION=9.0.6
HSA_ENABLE_SDMA=0 ROCBLAS_TENSILE_LIBPATH=/opt/rocm-7.2.1/lib/rocblas/library/
LD_LIBRARY_PATH=/home/doman/llama-serwer/llama-b8651/build/bin
HIP_VISIBLE_DEVICES=1,0
/home/doman/llama-serwer/llama-b8651/build/bin/llama-server -m /home/models/gguf/google_gemma-4-26B-A4B-it-IQ4_NL.gguf -ngl 999 -c 131072 -fit off -mg 1 --cache-type-k q8_0 --cache-type-v q8_0 --jinja --parallel 1 --port 8080 --host 0.0.0.0 --no-warmup --metrics --log-file ~/server.log --log-colors off --flash-attn on

I don't know why Vulcan runs so smoothly on RoCM7.2.1 (Vega20 - Mi50, Radeon VII),
please tell me what cards you're running it on.
I'll try some more Llama on Vulcan and let you know.

πŸ› οΈ Generated startup command:
 HSA_OVERRIDE_GFX_VERSION=9.0.6 HSA_ENABLE_SDMA=0 ROCBLAS_TENSILE_LIBPATH=/opt/rocm-7.2.1/lib/rocblas/library/ LD_LIBRARY_PATH=/home/doman/llama-serwer/llama-b8651/build/bin HIP_VISIBLE_DEVICES=1,0 /home/doman/llama-serwer/llama-b8651/build/bin/llama-server -m /home/models/gguf/google_gemma-4-26B-A4B-it-IQ4_NL.gguf -ngl 999 -c 131072  -fit off -mg 1 --cache-type-k q8_0 --cache-type-v q8_0  --jinja --parallel 1 --port 8080 --host 0.0.0.0 --no-warmup --metrics --log-file ~/server.log --log-colors off --flash-attn on

πŸš€ Starting server with model: google_gemma-4-26B-A4B-it-IQ4_NL.gguf on port 8080
⏳ Waiting for model load and VRAM buffer allocation...

> Verifying context size...
βœ… Server started with model google_gemma-4-26B-A4B-it-IQ4_NL.gguf
βœ… Server is running in background on port 8080. Context tokens set correctly.
πŸ“„ Logs saved to: ~/server.log
doman@LianLi:~/start_llama$ llamabench
[*] Local API detected on 127.0.0.1:8080. Assuming native host execution.

=========================================================================================================
--- SYSTEM & HARDWARE ---                          | --- MEMORY ALLOCATION ---
---------------------------------------------------------------------------------------------------------
Source:       Native Host                          | VRAM Model:   14001.81 MiB (31/31 layers) [ROCm0: 4915.4 | ROCm1: 9086.5]
System:       Linux 6.17.0-19-generic              | KV Cache:      1519.37 MiB [q8_0] [ROCm0: 272.0 | ROCm1: 1088.0 | ROCm0: 63.8 | ROCm1: 95.6]
Architecture: gemma4                               | Compute Buf:   3387.49 MiB [ROCm0: 1278.7 | ROCm1: 1061.7 | ROCm_Host: 1047.1]
Detected Port:8080                                 | RAM BPE/Meta:   577.50 MiB
MoE Experts:  8 active / 128 total                 | RAM Layers:       0.00 MiB (0/31 layers)
MMAP Status:  ON                                   |
---------------------------------------------------------------------------------------------------------
CPU Model:    12th Gen Intel(R) Core(TM) i5-12400F
GPU Drivers:  ROCM: 7.2.1.70201, VULKAN: 1.3.275
GPUs:
  [ROCm0] AMD Radeon VII (16332 MiB free)
  [ROCm1] AMD Radeon Graphics (32732 MiB free)
=========================================================================================================

============================================================
                  --- System Breakdown ---
Server Build:   b8651-d3416a4aa
Context Limit:  131,072 tokens (Reduced from 262,144 due to lack of VRAM!)
------------------------------------------------------------
Benchmarking: google_gemma-4-26B-A4B-it-IQ4_NL.gguf (5 rounds)
------------------------------------------------------------
Model response: 'Write exactly: Boss, I'm so ready I feel like I'm not ready.
Wait, I'm not sure if I'm doing it right.'
Stability: OK.
------------------------------------------------------------
Round 01: PP =   813.59 t/s | TG =  74.61 t/s | TTFT =  592.43 ms | Gen Time =   80.42 ms (0.08 s) | Tokens = 6
Round 02: PP =   795.43 t/s | TG =  74.17 t/s | TTFT =  604.70 ms | Gen Time =   80.90 ms (0.08 s) | Tokens = 6
Round 03: PP =   802.54 t/s | TG =  65.60 t/s | TTFT =  600.59 ms | Gen Time =   91.47 ms (0.09 s) | Tokens = 6
Round 04: PP =   815.70 t/s | TG =  62.13 t/s | TTFT =  592.13 ms | Gen Time =   96.57 ms (0.10 s) | Tokens = 6
Round 05: PP =   795.22 t/s | TG =  70.85 t/s | TTFT =  606.12 ms | Gen Time =   84.68 ms (0.08 s) | Tokens = 6

============================================================
FINAL AVERAGES - google_gemma-4-26B-A4B-it-IQ4_NL.gguf
------------------------------------------------------------
Configured Token Limit (TG): 128
Average Tokens Generated:    6.0 tokens
Average Latency (TTFT):      599.20 ms
Average Gen Time (TG):       86.81 ms (0.09 s)

 πŸ“ˆ Token Generation summary (tok/s)
 benchmark(average): β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“ 69.5
 server data:        β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“ 64.7 (-6.9%)

 πŸ“ˆ Prompt Processing summary (tok/s)
 benchmark(average): β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“ 804.5
 server data:        β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“ 637.5 (-20.8%)
============================================================

Report generated successfully in '/home/samba/llama_bench/llamabench_logs':
   - Main Results: result_google_gemma-4-26B-A4B-it-IQ4_NL_ROCm0-ROCm1_ctx131072_linux.json
   - Server Logs:  serverlog_google_gemma-4-26B-A4B-it-IQ4_NL_ROCm0-ROCm1_ctx131072_linux.json
doman@LianLi:~/start_llama$

My card is radeon 6600m , 8gb of vram + rest offloaded to regular ram (24gb of that).
Tiny prompts are ok for me too, it's the big prompts that make it fall over here.

By my observations your test had a super tiny prompt. Test with a big prompt, thousands of tokens. See if it's still stable then.

The crash also happens when i start llama.cpp with "-ngl 0" , so nothing gets loaded into vulkan from the model or context.

ps. all other models like qwen3.5 etc are stable for me, it seems only gemma 4 is acting this way.

Aha, a little step closer. With llama-bench i spotted that it only crashes when flash attention is turned on with -fa 1. And when i turn flash attention off the llama cli does not crash that easy.

Edit:
With llama server that seems to be the same story. if disable context quantization and turn of flash attention then suddenly it is not crashing anymore on big prompts. it's just dead slow this way for me. But not crashing.

Hmmm that actually might be a big hint in the right direction, mentioned your comment here where people are debugging:

https://github.com/ggml-org/llama.cpp/issues/21321#issuecomment-4184299386

I'm using a Vega20 Radeon VII and a Mi50 with RoCM 7.2.1 compiled for gfx906.
I had an error at first:

error: GGML_ASSERT(n_inputs < GGML_SCHED_MAX_SPLIT_INPUTS) failed

-fit off < helped

I have --flash-attn on << in my opinion, that's not the problem.

[*] Local API detected on 127.0.0.1:8080. Assuming native host execution.

=========================================================================================================
--- SYSTEM & HARDWARE ---                          | --- MEMORY ALLOCATION ---
---------------------------------------------------------------------------------------------------------
Source:       Native Host                          | VRAM Model:   14001.74 MiB (31/31 layers) [Vulkan0: 4915.3 | Vulkan1: 9086.4]
System:       Linux 6.17.0-19-generic              | KV Cache:      1519.37 MiB [q8_0] [Vulkan0: 272.0 | Vulkan1: 1088.0 | Vulkan0: 63.8 | Vulkan1: 95.6]
Architecture: gemma4                               | Compute Buf:   3487.67 MiB [Vulkan0: 1290.0 | Vulkan1: 1150.6 | Vulkan_Host: 1047.1]
Detected Port:8080                                 | RAM BPE/Meta:   577.50 MiB
MoE Experts:  8 active / 128 total                 | RAM Layers:       0.00 MiB (0/31 layers)
MMAP Status:  ON                                   | 
---------------------------------------------------------------------------------------------------------
CPU Model:    12th Gen Intel(R) Core(TM) i5-12400F
GPU Drivers:  ROCM: 7.2.1.70201, VULKAN: 1.3.275
GPUs:
  [Vulkan0] AMD Radeon VII (RADV VEGA20) (15612 MiB free)
  [Vulkan1] AMD Radeon Graphics (RADV VEGA20) (32751 MiB free)
=========================================================================================================

============================================================
                  --- System Breakdown ---                  
Server Build:   b8651-d3416a4aa
Context Limit:  131,072 tokens (Reduced from 262,144 due to lack of VRAM!)
------------------------------------------------------------
Benchmarking: google_gemma-4-26B-A4B-it-IQ4_NL.gguf (5 rounds)
------------------------------------------------------------
Model response: 'Write exactly: Boss, I'm so ready I feel like I'm not ready.
I'm sorry, I can't help with that.'
Stability: OK.
------------------------------------------------------------
Round 01: PP =   494.22 t/s | TG =  37.86 t/s | TTFT =  950.99 ms | Gen Time = 132067.79 ms (132.07 s) | Tokens = 5000
Round 02: PP =   490.53 t/s | TG =  37.96 t/s | TTFT =  964.27 ms | Gen Time = 131714.89 ms (131.71 s) | Tokens = 5000
Round 03: PP =   490.34 t/s | TG =  37.95 t/s | TTFT =  962.60 ms | Gen Time = 131748.30 ms (131.75 s) | Tokens = 5000
Round 04: PP =   500.21 t/s | TG =  37.91 t/s | TTFT =  941.61 ms | Gen Time = 131894.38 ms (131.89 s) | Tokens = 5000
Round 05: PP =   490.26 t/s | TG =  37.92 t/s | TTFT =  962.76 ms | Gen Time = 131869.05 ms (131.87 s) | Tokens = 5000

============================================================
FINAL AVERAGES - google_gemma-4-26B-A4B-it-IQ4_NL.gguf
------------------------------------------------------------
Configured Token Limit (TG): 5000
Average Tokens Generated:    5000.0 tokens
Average Latency (TTFT):      956.44 ms
Average Gen Time (TG):       131858.88 ms (131.86 s)

 πŸ“ˆ Token Generation summary (tok/s)
 benchmark(average): β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“ 37.9
 server data:        β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“ 37.9 (+0.0%)

 πŸ“ˆ Prompt Processing summary (tok/s)
 benchmark(average): β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“ 493.1
 server data:        β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“ 450.4 (-8.7%)
============================================================

only mi50

LD_LIBRARY_PATH=/home/doman/llama-serwer/llama-b8651/build-vulkan/bin GGML_VK_VISIBLE_DEVICES=1 /home/doman/llama-serwer/llama-b8651/build-vulkan/bin/llama-server -m /home/models/gguf/google_gemma-4-26B-A4B-it-IQ4_NL.gguf -ngl 999 -c 131072 -fit off --cache-type-k q8_0 --cache-type-v q8_0 --jinja --parallel 1 --port 8080 --host 0.0.0.0 --no-warmup --metrics --log-file ~/server.log --log-colors off --flash-attn on

Round 01: PP =   476.54 t/s | TG =  74.66 t/s | TTFT =  988.38 ms | Gen Time = 13394.56 ms (13.39 s) | Tokens = 1000
Round 02: PP =   478.02 t/s | TG =  74.75 t/s | TTFT =  989.50 ms | Gen Time = 13378.70 ms (13.38 s) | Tokens = 1000
Round 03: PP =   471.68 t/s | TG =  74.58 t/s | TTFT =  996.44 ms | Gen Time = 13408.74 ms (13.41 s) | Tokens = 1000
Round 04: PP =   472.85 t/s | TG =  74.67 t/s | TTFT =  991.86 ms | Gen Time = 13391.74 ms (13.39 s) | Tokens = 1000
Round 05: PP =   471.98 t/s | TG =  74.56 t/s | TTFT = 1000.04 ms | Gen Time = 13412.55 ms (13.41 s) | Tokens = 1000

HSA_OVERRIDE_GFX_VERSION=9.0.6 HSA_ENABLE_SDMA=0 ROCBLAS_TENSILE_LIBPATH=/opt/rocm-7.2.1/lib/rocblas/library/ LD_LIBRARY_PATH=/home/doman/llama-serwer/llama-b8651/build/bin HIP_VISIBLE_DEVICES=0 /home/doman/llama-serwer/llama-b8651/build-vulkan/bin/llama-server -m /home/models/gguf/google_gemma-4-26B-A4B-it-IQ4_NL.gguf -ngl 999 -c 131072 -fit off -mg 1 --cache-type-k q8_0 --cache-type-v q8_0 --jinja --parallel 1 --port 8080 --host 0.0.0.0 --no-warmup --metrics --log-file ~/server.log --log-colors off --flash-attn on
rocm

Round 01: PP =   833.56 t/s | TG =  75.87 t/s | TTFT =  565.04 ms | Gen Time = 13180.90 ms (13.18 s) | Tokens = 1000
Round 02: PP =   831.11 t/s | TG =  75.67 t/s | TTFT =  565.51 ms | Gen Time = 13215.48 ms (13.22 s) | Tokens = 1000
Round 03: PP =   834.80 t/s | TG =  75.85 t/s | TTFT =  564.21 ms | Gen Time = 13184.73 ms (13.18 s) | Tokens = 1000
Round 04: PP =   833.02 t/s | TG =  75.73 t/s | TTFT =  567.81 ms | Gen Time = 13204.82 ms (13.20 s) | Tokens = 1000
Round 05: PP =   831.29 t/s | TG =  75.83 t/s | TTFT =  565.39 ms | Gen Time = 13187.12 ms (13.19 s) | Tokens = 1000

I have my own website, I publish reports in-game, and it's significantly slower on Vulcan.
https://x.doman.ovh/index.php look for yourself if you want > a4b
All of them with the -fit off option work fine.
RoCM is slightly faster on this model < This has happened onceall with -fit off work well

Found a fix to my problems - switched from running in my ubuntu 22.04 vulkan setup to llama.cpp's released docker vulkan images instead. These appear to work faster (huh?) and don't cause crashes.

kulminaator changed discussion status to closed

Sign up or log in to comment