Anyone tried the Vlm in IKlamacpp fork?

#10
by theracn - opened

im somehow stuck trying to load iq4nl on iklamacpp for the graph split function, but with no luck.
It loads with the mmproj perfectuly (i tried actually the 3 version mmproj) and then it doesnt generate any tokens when i write anything or upload a picture and just get stuck loading in the webui integrated iklamacpp , even though it works normal when i load it only text without the mmproj, anyone had it working, or is it a general problem, also if any ideas how to fix would be grateful.
my log :
~/ik_llama.cpp$ export CUDA_VISIBLE_DEVICES=1,2,3

cd ~/ik_llama.cpp

./build/bin/llama-server
--model "/mnt/d/models/qweqn35/Qwen3.5-122B-A10B-IQ4_NL-00001-of-00003.gguf"
--mmproj "/mnt/d/models/qweqn35/mmproj-F16.gguf"
--ctx-size 10000
-fa on
-sm graph -ngl 99
-ts 0.9,1,1
-b 128 -ub 128
--host 127.0.0.1 --port 8085
--no-mmap
INFO [ main] build info | tid="134889156063232" timestamp=1772681529 build=4252 commit="505e2c57"
INFO [ main] system info | tid="134889156063232" timestamp=1772681529 n_threads=12 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
=============================== NCCL main communicator initialized
=============================== NCCL pair communicators for 3 GPUs initialized
CUDA0: using device CUDA0 - 23168 MiB free
CUDA1: using device CUDA1 - 23184 MiB free
CUDA2: using device CUDA2 - 23184 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 55 key-value pairs and 879 tensors from /mnt/d/models/qweqn35/Qwen3.5-122B-A10B-IQ4_NL-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.

.
Adjust batch size for mtmd: u_batch = 128, batch = 128
llama_init_from_model: n_ctx = 10240
llama_init_from_model: n_batch = 128
llama_init_from_model: n_ubatch = 128
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA_Split KV buffer size = 240.01 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 49.69 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 53.83 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 45.55 MiB
llama_kv_cache_init: KV cache size per device:
Device 0: 0 MiB
Device 1: 120 MiB
Device 2: 120 MiB
llama_init_from_model: KV self size = 240.00 MiB, K (f16): 120.00 MiB, V (f16): 120.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 44.22 MiB
llama_init_from_model: CUDA1 compute buffer size = 36.97 MiB
llama_init_from_model: CUDA2 compute buffer size = 123.50 MiB
llama_init_from_model: CUDA_Host compute buffer size = 4.00 MiB
llama_init_from_model: graph nodes = 5934
llama_init_from_model: graph splits = 230
llama_init_from_model: enabling only_active_experts scheduling
clip_model_loader: model name: Qwen3.5-122B-A10B
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 334
clip_model_loader: n_kv: 35
clip_model_loader: has vision encoder
clip_ctx: have 4 back-ends:
0: CPU
1: CUDA0
2: CUDA1
3: CUDA2
ggml_backend_cuda_context: a context for device 0 already exists?
clip_ctx: CLIP using CUDA0 backend
load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842
load_hparams: projector: qwen3vl_merger
load_hparams: n_embd: 1152
load_hparams: n_head: 16
load_hparams: n_ff: 4304
load_hparams: n_layer: 27
load_hparams: ffn_op: gelu
load_hparams: projection_dim: 3072
--- vision hparams ---
load_hparams: image_size: 768
load_hparams: patch_size: 16
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: n_merge: 2
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels: 8192
load_hparams: image_max_pixels: 4194304

load_hparams: model size: 866.61 MiB
load_hparams: metadata size: 0.12 MiB
warmup: warmup with image size = 1472 x 1472
alloc_compute_meta: CUDA0 compute buffer size = 558.58 MiB
alloc_compute_meta: CPU compute buffer size = 24.93 MiB
alloc_compute_meta: graph splits = 1, nodes = 3739
warmup: flash attention is disabled
INFO [ load_model] loaded multimodal model, '%s'
| ="/mnt/d/models/qweqn35/mmproj-F16.gguf"
WARN [ load_model] %s
| ="ctx_shift is not supported by multimodal, it will be disabled"
INFO [ init] initializing slots | tid="134889156063232" timestamp=1772681940 n_slots=1
srv init: Exclude reasoning tokens when selecting slot based on similarity: start: , end:
use --reasoning-tokens none to disable.
INFO [ init] new slot | tid="134889156063232" timestamp=1772681940 id_slot=0 n_ctx_slot=10240
prompt cache is enabled, size limit: 8192 MiB
use --cache-ram 0 to disable the prompt cache
no implementations specified for speculative decoding
INFO [ main] model loaded | tid="134889156063232" timestamp=1772681941
slot init: id 0 | task -1 | speculative decoding context not initialized
INFO [ main] chat template | tid="134889156063232" timestamp=1772681941 chat_template="
INFO [ main] chat template | tid="134889156063232" timestamp=1772681941 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [ main] HTTP server listening | tid="134889156063232" timestamp=1772681941 n_threads_http="23" port="8085" hostname="127.0.0.1"
INFO [ slots_idle] all slots are idle | tid="134889156063232" timestamp=1772681941
INFO [ log_server_request] request | tid="134886190997504" timestamp=1772681968 remote_addr="127.0.0.1" remote_port=37502 status=200 method="GET" path="/v1/props" params={}
INFO [ log_server_request] request | tid="134886190997504" timestamp=1772681968 remote_addr="127.0.0.1" remote_port=37502 status=200 method="GET" path="/v1/props" params={}
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="134889156063232" timestamp=1772681968 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 = 0, n_past1 = 0, n_past_prompt1 = 0, n_past2 = 0, n_past_prompt2 = 0
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="134889156063232" timestamp=1772681968 id_slot=0 id_task=0 p0=0
srv stop: cancel task, id_task = 0
INFO [ log_server_request] request | tid="134886190997504" timestamp=1772682164 remote_addr="127.0.0.1" remote_port=37502 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [ log_server_request] request | tid="134885124169728" timestamp=1772682199 remote_addr="127.0.0.1" remote_port=33846 status=200 method="POST" path="/v1/chat/completions" params={}
srv stop: cancel task, id_task = 3
INFO [ log_server_request] request | tid="134885115777024" timestamp=1772682308 remote_addr="127.0.0.1" remote_port=57618 status=200 method="GET" path="/v1/props" params={}

Sign up or log in to comment