llama.cpp-tq3 crashes loading model in "model router" mode, works fine when run with llama-server directly.
#3
by gdevenyi - opened
Sorry I'm putting this here, there's no issues on the github.
I can't tell if this is an upstream bug, or a bug in llama.cpp-tq3 because I can't test these GGUF in upstream, but the bug is this:
This launch command works fine:
llama-server \
-m YTan2000/Qwen3.6-27B-TQ3_4S/Qwen3.6-27B-TQ3_4S.gguf \
--mmproj YTan2000/Qwen3.6-27B-TQ3_4S/mmproj.gguf \
--host 0.0.0.0 --port 8080 \
--fit on --fit-ctx 16768 -np 1 \
-ctk q8_0 -ctv tq3_0 -fa on \
--jinja \
--chat-template-file /home/gdevenyi/models/chat_template.jinja \
--chat-template-kwargs '{"preserve_thinking": true}' \
--no-mmproj-offload \
--fit-target 512 \
--cache-ram 32767 \
--no-mmap --mlock \
--props \
--min-p 0.01 --temperature 0.6 --top-k 20 --top-p 0.95 \
--reasoning-budget 8192 \
--reasoning-budget-message "My reasoning budget is exhausted, but I have enough information to answer directly now."
However, if I dump this into my multi-model router ini file:
[*]
chat-template-kwargs = {"preserve_thinking": true}
models-dir = /home/gdevenyi/models
parallel = 1
models-max = 1
; memory
no-mmap = true
mlock = true
fit = on
fit-target = 512
cache-ram = 32768
; compute
flash-attn = on
cache-type-k = tq3_0
cache-type-v = q8_0
jinja = true
chat-template-file = /home/gdevenyi/models/chat_template.jinja
no-mmproj-offload = true
; sampling
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.01
; allow remote control
props = true
; https://whamp.github.io/blog/qwen-3-6-27b-livecodebench-reasoning-budget/
reasoning-budget-message = "My reasoning budget is exhausted, but I have enough information to answer directly now."
reasoning-budget = 8192
[qwen3.6-27B]
model = /home/gdevenyi/models/YTan2000/Qwen3.6-27B-TQ3_4S/Qwen3.6-27B-TQ3_4S.gguf
mmproj = /home/gdevenyi/models/YTan2000/Qwen3.6-27B-TQ3_4S/mmproj.gguf
fit-ctx = 16768
cache-type-k = tq3_0
cache-type-v = q4_0
It fails with this error:
[39955] /home/gdevenyi/projects/llama/llama.cpp-tq3/ggml/src/ggml-cuda/fattn.cu:304: fatal error
[39955] [New LWP 276111]
[39955] [New LWP 276110]
[39955] [New LWP 276109]
[39955] [New LWP 276108]
[39955] [New LWP 276107]
[39955] [New LWP 276106]
[39955] [New LWP 276105]
[39955] [New LWP 276104]
[39955] [New LWP 276102]
[39955] [New LWP 276101]
[39955] [New LWP 276100]
[39955] [New LWP 276099]
[39955] [New LWP 276098]
[39955] [New LWP 276097]
[39955] [New LWP 276096]
[39955] [New LWP 275984]
[39955] [New LWP 275983]
[39955] [New LWP 275982]
[39955] [New LWP 275981]
[39955] [New LWP 275980]
[39955] [New LWP 275979]
[39955] [New LWP 275978]
[39955] [New LWP 275977]
[39955] [New LWP 275976]
[39955] [New LWP 275975]
[39955] [New LWP 275974]
[39955] [New LWP 275973]
[39955] [New LWP 275972]
[39955] [New LWP 275971]
[39955] [New LWP 275970]
[39955] [New LWP 275969]
[39955] [New LWP 275968]
[39955] [New LWP 275967]
[39955] [New LWP 275966]
[39955] [New LWP 275965]
[39955] [New LWP 275964]
[39955] [New LWP 275963]
[39955] [New LWP 275962]
[39955] [New LWP 275961]
[39955] [New LWP 275960]
[39955] [New LWP 275959]
[39955] [New LWP 275958]
[39955] [New LWP 275957]
[39955] [New LWP 275956]
[39955] [New LWP 275955]
[39955] [New LWP 275954]
[39955] [New LWP 275953]
[39955] [New LWP 275952]
[39955] [New LWP 275951]
[39955] [New LWP 275950]
[39955] [New LWP 275949]
[39955] [New LWP 275948]
[39955] [New LWP 275947]
[39955] [New LWP 275945]
[39955] [New LWP 275936]
[39955]
[39955] This GDB supports auto-downloading debuginfo from the following URLs:
[39955] <https://debuginfod.neon.kde.org/>
[39955] <https://debuginfod.ubuntu.com/>
[39955] Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
[39955] Debuginfod has been disabled.
[39955] To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[39955] [Thread debugging using libthread_db enabled]
[39955] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[39955] 0x00007faf1e110813 in __GI___wait4 (pid=276318, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
[39955] warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
[39955] #0 0x00007faf1e110813 in __GI___wait4 (pid=276318, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
[39955] 30 in ../sysdeps/unix/sysv/linux/wait4.c
[39955] #1 0x00007faf1ec4d213 in ggml_print_backtrace () from /opt/llama.cpp/lib/libggml-base.so.0
[39955] #2 0x00007faf1ec4d3bb in ggml_abort () from /opt/llama.cpp/lib/libggml-base.so.0
[39955] #3 0x00007faf1ac1940b in ggml_cuda_flash_attn_ext(ggml_backend_cuda_context&, ggml_tensor*) () from /opt/llama.cpp/lib/libggml-cuda.so.0
[39955] #4 0x00007faf1ac5dbc5 in ggml_cuda_compute_forward(ggml_backend_cuda_context&, ggml_tensor*) () from /opt/llama.cpp/lib/libggml-cuda.so.0
[39955] #5 0x00007faf1ac63051 in ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context*, ggml_cgraph*, bool, bool, void const*) () from /opt/llama.cpp/lib/libggml-cuda.so.0
[39955] #6 0x00007faf1ac64e1e in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /opt/llama.cpp/lib/libggml-cuda.so.0
[39955] #7 0x00007faf1ec6aef7 in ggml_backend_sched_graph_compute_async () from /opt/llama.cpp/lib/libggml-base.so.0
[39955] #8 0x00007faf1e8c2511 in llama_context::graph_compute(ggml_cgraph*, bool) () from /opt/llama.cpp/lib/libllama.so.0
[39955] #9 0x00007faf1e8c4ee2 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /opt/llama.cpp/lib/libllama.so.0
[39955] #10 0x00007faf1e8cbb5f in llama_context::decode(llama_batch const&) () from /opt/llama.cpp/lib/libllama.so.0
[39955] #11 0x00007faf1e8cd76f in llama_decode () from /opt/llama.cpp/lib/libllama.so.0
[39955] #12 0x000055be36967ce2 in common_init_from_params(common_params&) ()
[39955] #13 0x000055be3686784c in server_context_impl::load_model(common_params&) ()
[39955] #14 0x000055be367aa706 in main ()
[39955] [Inferior 1 (process 275932) detached]
^Csrv operator(): operator(): cleaning up before exit...
srv unload_all: stopping model instance name=qwen3.6-27B
I have confirmed that the router model invokes llama-server with the same command options, but for some reason it doesn't result in the same effect