why llama.cpp crashed when user first input prompt
13.11.053.395 I main: model loaded
13.11.053.398 I main: server is listening on http://0.0.0.0:8080
13.11.053.398 I main: starting the main loop...
13.11.053.400 I srv update_slots: all slots are idle
15.47.073.839 I srv log_server_r: done request: GET / 192.168.1.209 200
15.50.741.961 I srv params_from_: Chat format: peg-constructed
15.50.742.196 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
15.50.742.263 I slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
15.50.742.270 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
15.50.742.284 I slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 140032, n_keep = 0, task.n_tokens = 17
15.50.742.295 I slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
15.50.742.716 I slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 17, batch.n_tokens = 17, progress = 1.000000
15.50.742.721 I slot update_slots: id 0 | task 0 | prompt done, n_tokens = 17, batch.n_tokens = 17
15.50.742.739 I slot init_sampler: id 0 | task 0 | init sampler, took 0.01 ms, tokens: text = 17, total = 17
/home/yl/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2354: GGML_ASSERT(ids_to_sorted_host.size() == size_t(ne_get_rows)) failed
[New LWP 4999]
[New LWP 4998]
[New LWP 4997]
[New LWP 4996]
[New LWP 4995]
[New LWP 4994]
[New LWP 4993]
[New LWP 4992]
[New LWP 4991]
[New LWP 4990]
[New LWP 4989]
[New LWP 4988]
[New LWP 4987]
[New LWP 4986]
[New LWP 4985]
[New LWP 4984]
[New LWP 4983]
[New LWP 4982]
[New LWP 4981]
[New LWP 4980]
[New LWP 4979]
[New LWP 4978]
[New LWP 4977]
[New LWP 4976]
[New LWP 4975]
[New LWP 4974]
[New LWP 4973]
[New LWP 4972]
[New LWP 4971]
[New LWP 4970]
[New LWP 4969]
[New LWP 4968]
[New LWP 4967]
[New LWP 4966]
[New LWP 4965]
[New LWP 4964]
[New LWP 4963]
[New LWP 4962]
[New LWP 4961]
[New LWP 4960]
[New LWP 4959]
[New LWP 4958]
[New LWP 4957]
[New LWP 4956]
[New LWP 4955]
[New LWP 4954]
[New LWP 4953]
[New LWP 4952]
[New LWP 4951]
[New LWP 4950]
[New LWP 4949]
[New LWP 4948]
[New LWP 4907]
This GDB supports auto-downloading debuginfo from the following URLs:
https://debuginfod.ubuntu.com
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fa224510813 in __GI___wait4 (pid=5069, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0 0x00007fa224510813 in __GI___wait4 (pid=5069, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007fa224f6e703 in ggml_print_backtrace () from /home/yl/llama.cpp/build/bin/libggml-base.so.0
#2 0x00007fa224f6e8ab in ggml_abort () from /home/yl/llama.cpp/build/bin/libggml-base.so.0
#3 0x00007fa21eb96da7 in ggml_cuda_mul_mat_id(ggml_backend_cuda_context&, ggml_tensor*) () from /home/yl/llama.cpp/build/bin/libggml-cuda.so.0
#4 0x00007fa21eb97466 in ggml_cuda_compute_forward(ggml_backend_cuda_context&, ggml_tensor*) () from /home/yl/llama.cpp/build/bin/libggml-cuda.so.0
#5 0x00007fa21eb9bc47 in ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context*, ggml_cgraph*, bool, bool, void const*) () from /home/yl/llama.cpp/build/bin/libggml-cuda.so.0
#6 0x00007fa21eb9e45e in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/yl/llama.cpp/build/bin/libggml-cuda.so.0
#7 0x00007fa224f8ae47 in ggml_backend_sched_graph_compute_async () from /home/yl/llama.cpp/build/bin/libggml-base.so.0
#8 0x00007fa224cc15b1 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/yl/llama.cpp/build/bin/libllama.so.0
#9 0x00007fa224cc36c4 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/yl/llama.cpp/build/bin/libllama.so.0
#10 0x00007fa224cca286 in llama_context::decode(llama_batch const&) () from /home/yl/llama.cpp/build/bin/libllama.so.0
#11 0x00007fa224ccbd1f in llama_decode () from /home/yl/llama.cpp/build/bin/libllama.so.0
#12 0x000056c01404d5f8 in server_context_impl::update_slots() ()
#13 0x000056c01409517e in server_queue::start_loop(long) ()
#14 0x000056c013facaf0 in main ()
[Inferior 1 (process 4906) detached]
That's using rocm? If so, need to set env var GGML_CUDA_DISABLE_FUSION=1 for now, as suggested by https://github.com/ggml-org/llama.cpp/issues/19659#issuecomment-3925786260
Can you give some background e.g.
- what quant are you using?
- what version llama.cpp are you using (i assume this is mainline and not ik_llama.cpp, so only certain quants run on that)
- what is the full command you are using to run llama-server ?
- and yes as sokann mentions what CPU/RAM and GPU/VRAM configuration is your rig e.g. Vulkan backend etc?
what quant are you using?
Qwen3.5-397B-A17B-smol-IQ2_XS-00001-of-00004.gguf
what version llama.cpp are you using (i assume this is mainline and not ik_llama.cpp, so only certain quants run on that)
build: 8144 (c830f99cf) with GNU 13.3.0 for Linux x86_64
what is the full command you are using to run llama-server ?
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./llama-server -m /media/DATA/downloads/Qwen3.5-397B-A17B-smol-IQ2_XS-00001-of-00004.g
guf --host 0.0.0.0 --port 8080 -ngl 99 -t 11 --no-mmap -np 1 --flash-attn on --ctx-size 40000 --jinja --repeat-penalt
y 1.0 --log-prefix --log-timestamps --temp 0.6 --top-p 0.95 --min-p 0 --top-k 20 -s 3456575470987
and yes as sokann mentions what CPU/RAM and GPU/VRAM configuration is your rig e.g. Vulkan backend etc?
GGML_CUDA_FORCE_CUBLAS version crashed every time, then I re-compile with GGML_CUDA_FORCE_MMQ, it can run, but very slow considering it's a 2bit quant.
Okay so you are using the mainline compatible smol-IQ2_XS with mainline llama.cpp that is good and fine.
Do you have a DGX Spark or what are you running with a CUDA backend that needs unified memory?
GGML_CUDA_FORCE_CUBLAS version crashed every time, then I re-compile with GGML_CUDA_FORCE_MMQ, it can run, but very slow considering it's a 2bit quant.
How are you compiling this? For a CUDA backend generally I keep it simple like this and always avoid BLAS stuff:
cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_VULKAN=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
cmake --build ./build --config Release -j $(nproc)
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES="75;86"
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FORCE_MMQ=ON -DCMAKE_CUDA_ARCHITECTURES="75;86"
it's not a DGX Spark, it's multiple gpus of "75;86". I can run qwen3-coder-next very well on this machine.
The sm75 arch is kind of old, what are the GPUs and what is your actual rig e.g. intel or amd 9950 or something with ddr4 or ddr5? (11 threads??)
If you're offloading onto 2x GPUs, not sure why you're trying to use the UNIFIED_MEMORY stuff? Also your command is attempting to offload the entire model onto VRAM?
For hybrid CPU+GPU(s) on mainline llama.cpp you can use -fit on or try placing tensors manually e.g.
# ez way on mainline
-fit on
# advanced way adjust manually based on size of VRAM per CUDA device
-fit off \
-ngl 999 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12)\.ffn_(gate|up|down)_exps.*=CUDA0,blk\.(47|47|48|49|50|51|52|53|54|55|56|57|58|59|60)\.ffn_(gate|up|down)_exps.*=CUDA1" \
--cpu-moe \
You could also try just the sm86 arch GPU only too with -fit on or however you decide to offload it to take tha out of the equation.
UNIFIED_MEMORY is very necessary from my previous experience: Unified Memory
The environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted. In Windows this setting is available in the NVIDIA control panel as System Memory Fallback. this is the official recommendation from:
https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md
but I finally find out the crash only happen it's the 2bit quant. now I using iq4_xs of 122B version of qwen3.5, using FORCE_CUBLAS=ON is all fine and very fast!
so, it happens in 2bit quant and FORCE_CUBLAS=ON situation only in my experience.
This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted
Oh, I still don't know your rig specs, and I would generally avoid OOMing on VRAM by setting up tensor overrides up front and stay away from UNIFIED_MEMORY feature.
But if you got it sorted out, thanks for letting me know! Glad you found something that suits your rig better!
thanks for your ggufs! good man. I utilize many many RTX2xxx and RTX3XXX together (that's why -DCMAKE_CUDA_ARCHITECTURES="75;86"), it's a bit beyond normal knowledge.
Ahh, yeah use all the VRAM we can get our hands on! What a wild time it is with so many models coming out!
I have the same issue with ik_llama.cpp. The model loads, but when the preprocessing started it crashes.
ik_llama.cpp build: version: 4228 (68431b04)
quants: your ubergarm: Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf
full command (I shortened it for testing, so offloading ist just minimal in this command): /media/ai/ik_llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 1235 -m /media/ai/models/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf --alias Qwen3.5-397B-A17B-IQ4_KSS -c 41920 --threads 48 --threads-batch 48 -ctk q8_0 -ctv q8_0 -ngl 99 --n-cpu-moe 64 -sm graph -ts 1,1
CPU/RAM: 56 x Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (2 Sockets) with 512 GB RAM via Proxmox on Ubuntu.
error: /media/ai/ik_llama.cpp/src/llama-sampling.cpp:733: GGML_ASSERT(iter != probs.end()) failed
longer output (shortened):
/media/ai/ik_llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 1235 -m /media/ai/models/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf --alias Qwen3.5-397B-A17B-IQ4_KSS -c 41920 --threads 48 --threads-batch 48 -ctk q8_0 -ctv q8_0 -ngl 99 --n-cpu-moe 64 -sm graph -ts 1,1
INFO [ main] build info | tid="140429121015808" timestamp=1771964207 build=4228 commit="68431b04"
INFO [ main] system info | tid="140429121015808" timestamp=1771964207 n_threads=48 n_threads_batch=48 total_threads=50 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
INFO [ launch_slot_with_task] slot is processing task | tid="140429121015808" timestamp=1771964255 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 = 0, n_past1 = 0, n_past_prompt1 = 0, n_past2 = 0, n_past_prompt2 = 0
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140429121015808" timestamp=1771964255 id_slot=0 id_task=0 p0=0
/media/ai/ik_llama.cpp/src/llama-sampling.cpp:733: GGML_ASSERT(iter != probs.end()) failed
[New LWP 1614526]
[New LWP 1614525]
[New LWP 1614524]
.....
[New LWP 1609444]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fb826110813 in __GI___wait4 (pid=1616466, stat_loc=0x7ffd6df9a454, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Warnung: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0 0x00007fb826110813 in __GI___wait4 (pid=1616466, stat_loc=0x7ffd6df9a454, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007fb8268d2b92 in ggml_abort () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#2 0x00007fb833f0694d in llama_sample_token_with_rng_impl(llama_sampling*, llama_token_data_array*, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&) () from /media/ai/ik_llama.cpp/build/src/libllama.so
#3 0x00005d904488903f in llama_sampling_sample_impl(common_sampler*, llama_context*, llama_context*, int, bool) ()
#4 0x00005d90447846e7 in server_context::process_batch_tokens(int&) ()
#5 0x00005d9044785d18 in server_context::update_slots() ()
#6 0x00005d904471f892 in server_queue::start_loop() ()
#7 0x00005d90446adc72 in main ()
[Inferior 1 (process 1609424) detached]
I have the same issue
Your error looks different though. The original error from @rosspanda0 looks like:
ggml-cuda.cu:2354: GGML_ASSERT(ids_to_sorted_host.size() == size_t(ne_get_rows)) failed
Your error looks like:
llama-sampling.cpp:733: GGML_ASSERT(iter != probs.end()) failed
Have you had luck using ik_llama.cpp with other quants? I don't think the issue here is with the quant, but something with your guys hardware setup and ik_llama.cpp and your arguments maybe...
Let's look at your command quickly:
/media/ai/ik_llama.cpp/build/bin/llama-server \
--host 0.0.0.0 \
--port 1235 \
-m /media/ai/models/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf \
--alias Qwen3.5-397B-A17B-IQ4_KSS \
-c 41920 \
--threads 48 \
--threads-batch 48 \
-ctk q8_0 -ctv q8_0 \
-ngl 99 \
--n-cpu-moe 64 \
-sm graph \
-ts 1,1
Well first thing I see is Qwen3.5 MoE does not support -sm graph yet, so it will fall back to -sm layer by default and likly printing that out in the logs.
Hrmm.. How are you using the server? Are you using chat completions endpoint? Or silly tavern with text completions? (what client basically or are you using the built in webui tool?)
You mentioned leaving off some of the command, are you messing with the samplers? Perhaps try using default sampling settings given your error is happening in the sampler it looks like.
Sorry for the late reply. Huggingface is rate limmiting my new account.
You are right. The error is not exactly the same, but it happens also on the first request.
I am using OpenWebUI as the frontend and have not seen this error with any other model. (also tried the webui from ik_llama, still the same)
I also tested the unsloth-quantized model: Qwen3.5-397B-A17B-UD-Q4_K_XL. Got the same error.
I am using the following compilation command:
cmake -B build -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_CUDA_FLAGS="-gencode arch=compute_86,code=sm_86" -DGGML_CCACHE=OFF -DGGML_CUDA_FORCE_MMQ=1
Normally, I use the following launch configuration, which does not appear to have any effect in my tests:
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 /media/ai/ik_llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 1235 -m /media/ai/models/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf --alias Qwen3.5-397B-A17B-IQ4_KSS -c 81920 --threads 48 --numa distribute -khad -ctk q8_0 -ctv q8_0 --temp 1.0 --top-p 0.95 --top-k 40 -ngl 99 -amb 1024 --threads-batch 48 -ub 4096 -b 4096 --jinja --merge-qkv --spec-type ngram-mod --spec-ngram-size-n 24 --spec-ngram-min-hits 48 --spec-ngram-size-m 64 --mlock --slot-save-path /var/cache/ik_llama.cpp/Qwen3.5-397B-A17B --n-cpu-moe 52 -ts 1,1
(I also use the unified_memory=1. so I can host more models in parallel and when switching the model in OpenWebUi the other model (the actually inferencing) has priority and move the files again to the gpus and the not used one swaps the model files to RAM.)
I use it as chat completions endpoint. What command to test would you suggest? I am running out of ideas to get this model running.
Thanks for any help.
I tested following command:
with small gpu-offloading:
/media/ai/ik_llama.cpp/build/bin/llama-cli -m /media/ai/models/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf -p "Hello" -c 41920 --threads 48 --threads-batch 48 -ctk q8_0 -ctv q8_0 -ngl 99 --n-cpu-moe 64 -ts 1,1
system_info: n_threads = 48 (n_threads_batch = 48) / 50 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000
adaptive_target = -1.00, adaptive_decay = 0.90
sampling order:
CFG -> Penalties -> dry -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> xtc -> top_n_sigma -> temperature -> adaptive_p
generate: n_ctx = 41984, n_batch = 2048, n_predict = -1, n_keep = 0
Hellorather有益的itionalolong才能够/@大本much[jsmuch��总量的最后一页性价Accaboutsolong两大类 бель全部的heetsheets оба全部的mogcurrentinfra глаза全部的much��/@在哪儿ieux性价两大类
only cpu:
/media/ai/ik_llama.cpp/build/bin/llama-cli -m /media/ai/models/Qwen3.5-397B-A17B-IQ4_KSS-00001-of-00006.gguf -p "Hello" -c 41920 --threads 48 --threads-batch 48 -ctk q8_0 -ctv q8_0 -ngl 0 --n-cpu-moe 64 -ts 1,1
system_info: n_threads = 48 (n_threads_batch = 48) / 50 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000
adaptive_target = -1.00, adaptive_decay = 0.90
sampling order:
CFG -> Penalties -> dry -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> xtc -> top_n_sigma -> temperature -> adaptive_p
generate: n_ctx = 41984, n_batch = 2048, n_predict = -1, n_keep = 0
Hello, "Hypothesis
/**
Directory
20/ / ...
154560/-
/**
/**
/**
// Copyright (195960/ /external surface is an overview of
I'd recommend you both re-compile and disable the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 stuff and try a more simple command just to see if you can get it working.
Given similar issues on multiple quants, it is something possibly specific to your hardware and argument choices but it is somewhat difficult to see from here.
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 is a env setting, just remove it from command will do. however like I said, I need GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 in my case.
I found this one: https://github.com/ikawrakow/ik_llama.cpp/issues/1282
I have no solution for that right now. I think it must be a bug. I recompiled everything. Also llama.cpp main gives me the random tokens. But I will wait. Maybe it will be fixed. Thanks for your help @ubergarm . and keep up with the good work. Your quants are my prefered one.
I believe ik updated the chunked delta net implementation over the weekend due to a bug discovered. Not sure if it is working for y'all yet.
Thanks for the hint. Unfortunately it is not solving my problem. This time with quants from bartowski, but as you said, it's not because of the quants.
Log:
/media/ai/ik_llama.cpp# /media/ai/ik_llama.cpp/build/bin/llama-cli -m /media/ai/models/Qwen_Qwen3.5-397B-A17B-IQ4_XS-00001-of-00006.gguf -c 1920 --threads 24 -ngl 99 --n-cpu-moe 64 -ts 1,1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --seed 3407 -p "<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n" --verbose-prompt -np 1
Log start
main: build = 4252 (505e2c57)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 for x86_64-linux-gnu
main: seed = 3407
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
CUDA0: using device CUDA0 - 8500 MiB free
CUDA1: using device CUDA1 - 7606 MiB free
llama_model_loader: additional 5 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 50 key-value pairs and 1098 tensors from /media/ai/models/Qwen_Qwen3.5-397B-A17B-IQ4_XS-00001-of-00006.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen35moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 0.600000
llama_model_loader: - kv 5: general.name str = Qwen3.5 397B A17B
llama_model_loader: - kv 6: general.basename str = Qwen3.5
llama_model_loader: - kv 7: general.size_label str = 397B-A17B
llama_model_loader: - kv 8: general.license str = apache-2.0
llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv 10: general.tags arr[str,1] = ["image-text-to-text"]
llama_model_loader: - kv 11: qwen35moe.block_count u32 = 60
llama_model_loader: - kv 12: qwen35moe.context_length u32 = 262144
llama_model_loader: - kv 13: qwen35moe.embedding_length u32 = 4096
llama_model_loader: - kv 14: qwen35moe.attention.head_count u32 = 32
llama_model_loader: - kv 15: qwen35moe.attention.head_count_kv u32 = 2
llama_model_loader: - kv 16: qwen35moe.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
llama_model_loader: - kv 17: qwen35moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 18: qwen35moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 19: qwen35moe.expert_count u32 = 512
llama_model_loader: - kv 20: qwen35moe.expert_used_count u32 = 10
llama_model_loader: - kv 21: qwen35moe.attention.key_length u32 = 256
llama_model_loader: - kv 22: qwen35moe.attention.value_length u32 = 256
llama_model_loader: - kv 23: qwen35moe.expert_feed_forward_length u32 = 1024
llama_model_loader: - kv 24: qwen35moe.expert_shared_feed_forward_length u32 = 1024
llama_model_loader: - kv 25: qwen35moe.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 26: qwen35moe.ssm.state_size u32 = 128
llama_model_loader: - kv 27: qwen35moe.ssm.group_count u32 = 16
llama_model_loader: - kv 28: qwen35moe.ssm.time_step_rank u32 = 64
llama_model_loader: - kv 29: qwen35moe.ssm.inner_size u32 = 8192
llama_model_loader: - kv 30: qwen35moe.full_attention_interval u32 = 4
llama_model_loader: - kv 31: qwen35moe.rope.dimension_count u32 = 64
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = qwen35
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,248320] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 37: tokenizer.ggml.eos_token_id u32 = 248046
llama_model_loader: - kv 38: tokenizer.ggml.padding_token_id u32 = 248044
llama_model_loader: - kv 39: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 40: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - kv 41: general.quantization_version u32 = 2
llama_model_loader: - kv 42: general.file_type u32 = 30
llama_model_loader: - kv 43: quantize.imatrix.file str = /models_out/Qwen3.5-397B-A17B-GGUF/Qw...
llama_model_loader: - kv 44: quantize.imatrix.dataset str = /training_data/calibration_datav5.txt
llama_model_loader: - kv 45: quantize.imatrix.entries_count u32 = 765
llama_model_loader: - kv 46: quantize.imatrix.chunks_count u32 = 802
llama_model_loader: - kv 47: split.no u16 = 0
llama_model_loader: - kv 48: split.tensors.count i32 = 1098
llama_model_loader: - kv 49: split.count u16 = 6
llama_model_loader: - type f32: 451 tensors
llama_model_loader: - type q8_0: 105 tensors
llama_model_loader: - type q5_K: 30 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type iq4_xs: 511 tensors
load: printing all EOG tokens:
load: - 248044 ('<|endoftext|>')
load: - 248046 ('<|im_end|>')
load: - 248063 ('<|fim_pad|>')
load: - 248064 ('<|repo_name|>')
load: - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen35moe
llm_load_print_meta: n_ctx_train = 262144
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 60
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 16
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 0
llm_load_print_meta: n_expert = 512
llm_load_print_meta: n_expert_used = 10
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 40
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 262144
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: mrope sections = [11, 11, 10, 0]
llm_load_print_meta: ssm_d_conv = 4
llm_load_print_meta: ssm_d_inner = 8192
llm_load_print_meta: ssm_d_state = 128
llm_load_print_meta: ssm_dt_rank = 64
llm_load_print_meta: ssm_n_group = 16
llm_load_print_meta: model type = 397B.A17B
llm_load_print_meta: model ftype = IQ4_XS - 4.25 bpw
llm_load_print_meta: model params = 396.346 B
llm_load_print_meta: model size = 197.068 GiB (4.271 BPW)
llm_load_print_meta: repeating layers = 195.788 GiB (4.265 BPW, 394.312 B parameters)
llm_load_print_meta: general.name = Qwen3.5 397B A17B
print_info: vocab type = BPE
print_info: n_vocab = 248320
print_info: n_merges = 247587
print_info: BOS token = 11 ','
print_info: EOS token = 248046 '<|im_end|>'
print_info: EOT token = 248046 '<|im_end|>'
print_info: PAD token = 248044 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 248060 '<|fim_prefix|>'
print_info: FIM SUF token = 248062 '<|fim_suffix|>'
print_info: FIM MID token = 248061 '<|fim_middle|>'
print_info: FIM PAD token = 248063 '<|fim_pad|>'
print_info: FIM REP token = 248064 '<|repo_name|>'
print_info: FIM SEP token = 248065 '<|file_sep|>'
print_info: EOG token = 248044 '<|endoftext|>'
print_info: EOG token = 248046 '<|im_end|>'
print_info: EOG token = 248063 '<|fim_pad|>'
print_info: EOG token = 248064 '<|repo_name|>'
print_info: EOG token = 248065 '<|file_sep|>'
print_info: max token length = 256
llm_load_tensors: ggml ctx size = 4.80 MiB
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors: CPU buffer size = 36710.12 MiB
llm_load_tensors: CPU buffer size = 37817.28 MiB
llm_load_tensors: CPU buffer size = 37830.03 MiB
llm_load_tensors: CPU buffer size = 37830.03 MiB
llm_load_tensors: CPU buffer size = 37853.40 MiB
llm_load_tensors: CPU buffer size = 12239.52 MiB
llm_load_tensors: CPU buffer size = 515.31 MiB
llm_load_tensors: CUDA0 buffer size = 2398.12 MiB
llm_load_tensors: CUDA1 buffer size = 3044.63 MiB
....................................................................................................
llama_init_from_model: n_ctx = 2048
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 127.38 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 118.95 MiB
llama_init_from_model: KV self size = 60.00 MiB, K (f16): 30.00 MiB, V (f16): 30.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 130.03 MiB
llama_init_from_model: CUDA1 compute buffer size = 493.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 111.01 MiB
llama_init_from_model: graph nodes = 4235
llama_init_from_model: graph splits = 123
llama_init_from_model: enabling only_active_experts scheduling
system_info: n_threads = 24 / 50 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
main: prompt: '<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
'
main: number of tokens in prompt = 9
248045 -> '<|im_start|>'
846 -> 'user'
198 -> '
'
9419 -> 'Hello'
248046 -> '<|im_end|>'
198 -> '
'
248045 -> '<|im_start|>'
74455 -> 'assistant'
198 -> '
'
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 20, tfs_z = 1.000, top_p = 0.950, min_p = 0.000, typical_p = 1.000, temp = 0.600
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000
adaptive_target = -1.00, adaptive_decay = 0.90
sampling order:
CFG -> Penalties -> dry -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> xtc -> top_n_sigma -> temperature -> adaptive_p
generate: n_ctx = 2048, n_batch = 2048, n_predict = -1, n_keep = 0
user
Hello
assistant
/media/ai/ik_llama.cpp/src/llama-sampling.cpp:733: GGML_ASSERT(iter != probs.end()) failed
[New LWP 2934153]
[New LWP 2934151]
[New LWP 2934150]
[New LWP 2934149]
[New LWP 2934148]
[New LWP 2934147]
[New LWP 2934146]
[New LWP 2934145]
[New LWP 2934144]
[New LWP 2934143]
[New LWP 2934142]
[New LWP 2934141]
[New LWP 2934140]
[New LWP 2934139]
[New LWP 2934138]
[New LWP 2934137]
[New LWP 2934136]
[New LWP 2934135]
[New LWP 2934134]
[New LWP 2934133]
[New LWP 2934132]
[New LWP 2934131]
[New LWP 2934130]
[New LWP 2606859]
[New LWP 2606858]
[New LWP 2606831]
[New LWP 2606830]
[New LWP 2606814]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x0000774cd9310813 in __GI___wait4 (pid=2935117, stat_loc=0x7ffcf80cbda4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Warnung: 30 ../sysdeps/unix/sysv/linux/wait4.c: Datei oder Verzeichnis nicht gefunden
#0 0x0000774cd9310813 in __GI___wait4 (pid=2935117, stat_loc=0x7ffcf80cbda4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x0000774cd9afa962 in ggml_abort () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#2 0x0000774ce83085bd in llama_sample_token_with_rng_impl(llama_sampling*, llama_token_data_array*, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&) () from /media/ai/ik_llama.cpp/build/src/libllama.so
#3 0x000062cd213c9a4f in llama_sampling_sample_impl(common_sampler*, llama_context*, llama_context*, int, bool) ()
#4 0x000062cd212bf554 in main ()
[Inferior 1 (process 2606798) detached]
/media/ai/ik_llama.cpp/src/llama-sampling.cpp:733: GGML_ASSERT(iter != probs.end()) failed
Yeah it looks like a samplers configuration issue, remove all things in your command line related to sampling especially --min-p 0.00 as couldn't that cause divide by zero errors possibly or something?
So always go back to the most simple default command to test before adding in all the extra complexity which makes it difficult to debug e.g. try:
/media/ai/ik_llama.cpp# /media/ai/ik_llama.cpp/build/bin/llama-cli -m /media/ai/models/Qwen_Qwen3.5-397B-A17B-IQ4_XS-00001-of-00006.gguf -c 1024 --threads 24 -ngl 99 --n-cpu-moe 64 -ts 1,1 --jinja -p "Hello\n" --verbose-prompt
@ubergarm
Thanks for your hint. I was waiting and testing every fix and update iteration, also switching to the new unsloth models before I answer here again.
The actual state is sadly nearly the same. I tested so many flags and removed all and so on.
Here my findings, new output and also the question to you if I should make a bug report by myself in ik_llama.cpp.
- When I only start a chat with a very short prompt, like "test", the models answers normally and it works. but after that, when I chat again to the same chat (openwebui) it failed instantly. Also a longer prompt failed right away.
- When I offload to GPU (-ngl 99) I got the: /media/ai/ik_llama.cpp/src/llama-sampling.cpp:733: GGML_ASSERT(iter != probs.end()) failed.
- When I dont offload (-ngl 0) I got a different error: /media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
- When I use the newest llama.cpp it dont crashes but responses are gibberish:
4.1 When I offload to GPU (-ngl 99) I got responses like this: "??????????????????????????????????????????????"
4.2 When I dont offload (-ngl 0) I got repsonses like this: "B:D'0$3F3:H(C"
Log 2 and 3:
without gpu offload:
/media/ai/ik_llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 1236 -m /media/ai/models/Qwen3.5-397B-A17B-UD-IQ4_XS-00001-of-00005.gguf --alias Qwen3.5-397B-A17B-IQ4_KSS -c 11920 --threads 48 -ngl 0 --threads-batch 48 --jinja --n-cpu-moe 58 -ts 9,1 -grt f32
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: /media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: /media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: /media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
/media/ai/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1157: GGML_ASSERT(S > 0) failed
[New LWP 1402044]
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
warning: process 1398189 is already traced by process 1405280
warning: process 1398189 is already traced by process 1405280
ptrace: Vorgang nicht zulässig.ptrace: Vorgang nicht zulässig.
No stack.No stack.
The program is not being run.The program is not being run.
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
warning: process 1398189 is already traced by process 1405280
ptrace: Vorgang nicht zulässig.
No stack.
The program is not being run.
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
warning: process 1398189 is already traced by process 1405280
ptrace: Vorgang nicht zulässig.
No stack.
The program is not being run.
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
warning: process 1398189 is already traced by process 1405280
ptrace: Vorgang nicht zulässig.
No stack.
The program is not being run.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
futex_wait (private=0, expected=2, futex_word=0x7694c1a0a5a0 ) at ../sysdeps/nptl/futex-internal.h:146
Warnung: 146 ../sysdeps/nptl/futex-internal.h: Datei oder Verzeichnis nicht gefunden
#0 futex_wait (private=0, expected=2, futex_word=0x7694c1a0a5a0 ) at ../sysdeps/nptl/futex-internal.h:146
146 in ../sysdeps/nptl/futex-internal.h
#1 __GI___lll_lock_wait_private (futex=futex@entry=0x7694c1a0a5a0 ) at ./nptl/lowlevellock.c:34
Warnung: 34 ./nptl/lowlevellock.c: Datei oder Verzeichnis nicht gefunden
#2 0x00007694c190e289 in __run_prefork_handlers (do_locking=do_locking@entry=true) at ./posix/register-atfork.c:118
Warnung: 118 ./posix/register-atfork.c: Datei oder Verzeichnis nicht gefunden
#3 0x00007694c18f3fa1 in __libc_fork () at ./posix/fork.c:51
Warnung: 51 ./posix/fork.c: Datei oder Verzeichnis nicht gefunden
#4 0x00007694c20fa950 in ggml_abort () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#5 0x00007694c3024117 in void (anonymous namespace)::FlashQKV<256, 8, 64>::normalize_and_store_1row<(anonymous namespace)::FlashMS<8, 64> >((anonymous namespace)::FlashMS<8, 64> const&, int, float*, float*, float const*) const [clone .part.0] () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#6 0x00007694c302a645 in void (anonymous namespace)::FlashQKV<256, 2, 128>::normalize_and_store<(anonymous namespace)::FlashMS<2, 128> >((anonymous namespace)::FlashMS<2, 128> const&, int, float*, float const*, float*, float*) () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#7 0x00007694c305fa17 in void (anonymous namespace)::iqk_flash_helper<256, 256, 128, (anonymous namespace)::HelperF16, (anonymous namespace)::HelperF16>((anonymous namespace)::HelperF16&, (anonymous namespace)::HelperF16&, int, int, int, int, int, float const*, char const*, float, float, float*, float const*, float*, float*) () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#8 0x00007694c311a6d4 in iqk_fa_256_256(int, int, int, int, int, int, int, int, int, float const*, void const*, void const*, void const*, float, float, float*, float const*, float*, float*) () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#9 0x00007694c2d86d1a in iqk_flash_attn_noalibi () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#10 0x00007694c2107e76 in ggml_compute_forward_flash_attn_ext_f16 () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#11 0x00007694c2145a63 in ggml_compute_forward () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#12 0x00007694c214993f in ggml_graph_compute_thread.constprop.0.isra () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#13 0x00007694c2149b19 in ggml_graph_compute._omp_fn () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#14 0x00007694d0a9c977 in GOMP_parallel () from /lib/x86_64-linux-gnu/libgomp.so.1
#15 0x00007694c214d708 in ggml_graph_compute () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#16 0x00007694c215ab45 in ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#17 0x00007694c2162edf in ggml_backend_sched_graph_compute_async () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#18 0x00007694d08b66d3 in llama_decode_internal(llama_context&, llama_batch) [clone .isra.0] () from /media/ai/ik_llama.cpp/build/src/libllama.so
#19 0x00007694d08b88ad in llama_decode () from /media/ai/ik_llama.cpp/build/src/libllama.so
#20 0x000057bb9e1467fb in server_context::process_batch_tokens(int&) ()
#21 0x000057bb9e148528 in server_context::update_slots() ()
#22 0x000057bb9e0e1d22 in server_queue::start_loop() ()
#23 0x000057bb9e06fea0 in main ()
Abgebrochen (Speicherabzug geschrieben)
[Inferior 1 (process 1398189) detached]
with gpu offload:
/media/ai/ik_llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 1236 -m /media/ai/models/Qwen3.5-397B-A17B-UD-IQ4_XS-00001-of-00005.gguf --alias Qwen3.5-397B-A17B-IQ4_KSS -c 8192 --threads 48 -ngl 99 --threads-batch 48 --jinja --n-cpu-moe 58 -ts 9,1 -grt f32 -fa off -b 256 -ub 256
slot apply_checkp: id 0 | task 159 | n_past = 9, slot.prompt.tokens.size() = 167, seq_id = 0, pos_min = 166
slot apply_checkp: id 0 | task 159 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot apply_checkp: id 0 | task 159 | erased invalidated context checkpoint (pos_min = 166, pos_max = 166, size = 186.332 MiB)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="137044508942336" timestamp=1773070028 id_slot=0 id_task=159 p0=0
slot create_check: id 0 | task 159 | created context checkpoint 1 of 8 (pos_min = 70, pos_max = 70, size = 186.331 MiB, took 414.63 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="137044508942336" timestamp=1773070032 id_slot=0 id_task=159 p0=71
/media/ai/ik_llama.cpp/src/llama-sampling.cpp:733: GGML_ASSERT(iter != probs.end()) failed
[New LWP 1375793]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007ca41a310813 in __GI___wait4 (pid=1381539, stat_loc=0x7ffc333a0b14, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
Warnung: 30 ../sysdeps/unix/sysv/linux/wait4.c: Datei oder Verzeichnis nicht gefunden
#0 0x00007ca41a310813 in __GI___wait4 (pid=1381539, stat_loc=0x7ffc333a0b14, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007ca41aafa962 in ggml_abort () from /media/ai/ik_llama.cpp/build/ggml/src/libggml.so
#2 0x00007ca42930848d in llama_sample_token_with_rng_impl(llama_sampling*, llama_token_data_array*, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&) () from /media/ai/ik_llama.cpp/build/src/libllama.so
#3 0x000057541a22b28f in llama_sampling_sample_impl(common_sampler*, llama_context*, llama_context*, int, bool) ()
#4 0x000057541a125e47 in server_context::process_batch_tokens(int&) ()
#5 0x000057541a127528 in server_context::update_slots() ()
#6 0x000057541a0c0d22 in server_queue::start_loop() ()
#7 0x000057541a04eea0 in main ()
[Inferior 1 (process 1366871) detached]
You're having a hard time of it eh? Let's see, my thoughts this time around:
-ngl 0 --threads-batch 48 --jinja --n-cpu-moe 58 -ts 9,1 -grt f32
I don't think you can do -ngl 0 then specify --n-cpu-moe 58 because that means the opposite thing, it says "don't offload anything, then only offload some routed experts" which is not what you would want.
When I use the newest llama.cpp it dont crashes but responses are gibberish:
That isn't good, typically ???? or .... means there are NaN issues and some numbers are blowing up due to numerical stability. This can indicate the quant has issues (e.g. bad download or something). You can check that for sure by switching quants which you've done, or to be sure you can add --validate-quants to your command with ik_llama.cpp and it will check for problems before starting. (i do this before releasing my quants to ensure they are good).
Its hard to keep track, is this correct for your system:
- CPU: Intel Xeon Gold 6132 @ 2.60GHz (dual socket), 512GB RAM via Proxmox on Ubuntu (14 physical cores each)
- GPU: 2× NVIDIA GeForce RTX 3090 (24GB VRAM each, compute capability 8.6)
- Total System RAM: 512GB (available within the proxmox or on the host?)
Here is what I suggest.
- Run
ik_llama.cpp'sllama-server --validate-quants ...to make sure your quant is good. - Compile CPU-only e.g.
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF -DGGML_CCACHE=OFF
cmake --build build --config Release -j$(nproc)
- Do you have issues with the smaller Qwen3.5 MoE models too like the 35B or 9B etc? Might make testing easier if you can reproduce with those.
Then try to run it without -ngl at all and no --n-cpu-moe stuff just CPU only. See if that works to start off.
Also how many physical cores are passed through your proxmox? To keep it simple start off just -t 14 for now to see if it works limiting to not using more threads than you have physical cores available.
Keep us posted!
Hi @ubergarm ,
I want to thank you for your very helpful support.
I tried the --validate-quants parameter -> everything fine with the quants
I tried to compile with -DGGML_CUDA=OFF -> failed with a different GGML_ASSERT.
Then I tried to make a new and fresh git clone if the ik_llama.cpp repo -> the solution
And to my surprise it works out of the box.
I don't know, what the culprit was but my I am using ik_llama for many years (early bird) and the time flies.
You helped me a lot with this. So many thanks and because of the parameter tests I also found out, that my default parameters should be different now.
I got now much more performance with little parameter tweaks. (e.g. lowering the threads)