Anyone tried the Vlm in IKlamacpp fork?
im somehow stuck trying to load iq4nl on iklamacpp for the graph split function, but with no luck.
It loads with the mmproj perfectuly (i tried actually the 3 version mmproj) and then it doesnt generate any tokens when i write anything or upload a picture and just get stuck loading in the webui integrated iklamacpp , even though it works normal when i load it only text without the mmproj, anyone had it working, or is it a general problem, also if any ideas how to fix would be grateful.
my log :
~/ik_llama.cpp$ export CUDA_VISIBLE_DEVICES=1,2,3
cd ~/ik_llama.cpp
./build/bin/llama-server
--model "/mnt/d/models/qweqn35/Qwen3.5-122B-A10B-IQ4_NL-00001-of-00003.gguf"
--mmproj "/mnt/d/models/qweqn35/mmproj-F16.gguf"
--ctx-size 10000
-fa on
-sm graph -ngl 99
-ts 0.9,1,1
-b 128 -ub 128
--host 127.0.0.1 --port 8085
--no-mmap
INFO [ main] build info | tid="134889156063232" timestamp=1772681529 build=4252 commit="505e2c57"
INFO [ main] system info | tid="134889156063232" timestamp=1772681529 n_threads=12 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
=============================== NCCL main communicator initialized
=============================== NCCL pair communicators for 3 GPUs initialized
CUDA0: using device CUDA0 - 23168 MiB free
CUDA1: using device CUDA1 - 23184 MiB free
CUDA2: using device CUDA2 - 23184 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 55 key-value pairs and 879 tensors from /mnt/d/models/qweqn35/Qwen3.5-122B-A10B-IQ4_NL-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
.
Adjust batch size for mtmd: u_batch = 128, batch = 128
llama_init_from_model: n_ctx = 10240
llama_init_from_model: n_batch = 128
llama_init_from_model: n_ubatch = 128
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA_Split KV buffer size = 240.01 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 49.69 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 53.83 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 45.55 MiB
llama_kv_cache_init: KV cache size per device:
Device 0: 0 MiB
Device 1: 120 MiB
Device 2: 120 MiB
llama_init_from_model: KV self size = 240.00 MiB, K (f16): 120.00 MiB, V (f16): 120.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 44.22 MiB
llama_init_from_model: CUDA1 compute buffer size = 36.97 MiB
llama_init_from_model: CUDA2 compute buffer size = 123.50 MiB
llama_init_from_model: CUDA_Host compute buffer size = 4.00 MiB
llama_init_from_model: graph nodes = 5934
llama_init_from_model: graph splits = 230
llama_init_from_model: enabling only_active_experts scheduling
clip_model_loader: model name: Qwen3.5-122B-A10B
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 334
clip_model_loader: n_kv: 35
clip_model_loader: has vision encoder
clip_ctx: have 4 back-ends:
0: CPU
1: CUDA0
2: CUDA1
3: CUDA2
ggml_backend_cuda_context: a context for device 0 already exists?
clip_ctx: CLIP using CUDA0 backend
load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842
load_hparams: projector: qwen3vl_merger
load_hparams: n_embd: 1152
load_hparams: n_head: 16
load_hparams: n_ff: 4304
load_hparams: n_layer: 27
load_hparams: ffn_op: gelu
load_hparams: projection_dim: 3072
--- vision hparams ---
load_hparams: image_size: 768
load_hparams: patch_size: 16
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: n_merge: 2
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels: 8192
load_hparams: image_max_pixels: 4194304
load_hparams: model size: 866.61 MiB
load_hparams: metadata size: 0.12 MiB
warmup: warmup with image size = 1472 x 1472
alloc_compute_meta: CUDA0 compute buffer size = 558.58 MiB
alloc_compute_meta: CPU compute buffer size = 24.93 MiB
alloc_compute_meta: graph splits = 1, nodes = 3739
warmup: flash attention is disabled
INFO [ load_model] loaded multimodal model, '%s'
| ="/mnt/d/models/qweqn35/mmproj-F16.gguf"
WARN [ load_model] %s
| ="ctx_shift is not supported by multimodal, it will be disabled"
INFO [ init] initializing slots | tid="134889156063232" timestamp=1772681940 n_slots=1
srv init: Exclude reasoning tokens when selecting slot based on similarity: start: , end:
use --reasoning-tokens none to disable.
INFO [ init] new slot | tid="134889156063232" timestamp=1772681940 id_slot=0 n_ctx_slot=10240
prompt cache is enabled, size limit: 8192 MiB
use --cache-ram 0 to disable the prompt cache
no implementations specified for speculative decoding
INFO [ main] model loaded | tid="134889156063232" timestamp=1772681941
slot init: id 0 | task -1 | speculative decoding context not initialized
INFO [ main] chat template | tid="134889156063232" timestamp=1772681941 chat_template="
INFO [ main] chat template | tid="134889156063232" timestamp=1772681941 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [ main] HTTP server listening | tid="134889156063232" timestamp=1772681941 n_threads_http="23" port="8085" hostname="127.0.0.1"
INFO [ slots_idle] all slots are idle | tid="134889156063232" timestamp=1772681941
INFO [ log_server_request] request | tid="134886190997504" timestamp=1772681968 remote_addr="127.0.0.1" remote_port=37502 status=200 method="GET" path="/v1/props" params={}
INFO [ log_server_request] request | tid="134886190997504" timestamp=1772681968 remote_addr="127.0.0.1" remote_port=37502 status=200 method="GET" path="/v1/props" params={}
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="134889156063232" timestamp=1772681968 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 = 0, n_past1 = 0, n_past_prompt1 = 0, n_past2 = 0, n_past_prompt2 = 0
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="134889156063232" timestamp=1772681968 id_slot=0 id_task=0 p0=0
srv stop: cancel task, id_task = 0
INFO [ log_server_request] request | tid="134886190997504" timestamp=1772682164 remote_addr="127.0.0.1" remote_port=37502 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [ log_server_request] request | tid="134885124169728" timestamp=1772682199 remote_addr="127.0.0.1" remote_port=33846 status=200 method="POST" path="/v1/chat/completions" params={}
srv stop: cancel task, id_task = 3
INFO [ log_server_request] request | tid="134885115777024" timestamp=1772682308 remote_addr="127.0.0.1" remote_port=57618 status=200 method="GET" path="/v1/props" params={}