Model won't load with vision in KCPP

#5
by ReXommendation - opened

Welcome to KoboldCpp - Version 1.102
For command line arguments, please refer to --help


Auto Selected CUDA Backend (flag=0)

Loading Chat Completions Adapter: /run/media/rexommendation/Mass Archive/Programs/koboldcpp/kcpp_adapters/AutoGuess.json
Chat Completions Adapter Loaded
System: Linux #1 SMP PREEMPT_DYNAMIC Fri, 19 Sep 2025 16:11:04 +0000 x86_64
Detected Available GPU Memory: 24576 MB
Unable to determine available RAM
Initializing dynamic library: koboldcpp_cublas.so

Namespace(model=[], model_param='/home/rexommendation/Programs/koboldcpp/model/GGUF/Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf', port=5001, port_param=5001, host='', launch=True, config=None, threads=1, usecuda=['normal', '0', 'mmq'], usevulkan=None, useclblast=None, usecpu=False, contextsize=8192, gpulayers=99, tensor_split=None, version=False, analyze='', maingpu=-1, batchsize=512, blasthreads=None, lora=None, loramult=1.0, noshift=False, nofastforward=False, useswa=False, ropeconfig=[0.0, 10000.0], overridenativecontext=0, usemmap=False, usemlock=False, noavx2=False, failsafe=False, debugmode=0, onready='', benchmark=None, prompt='', cli=False, genlimit=0, multiuser=1, multiplayer=False, websearch=False, remotetunnel=False, highpriority=False, foreground=False, preloadstory=None, savedatafile=None, quiet=False, ssl=None, nocertify=False, mmproj='/run/media/rexommendation/Mass Archive/Programs/koboldcpp/model/mmproj/mmproj-Qwen3-VL-30B-A3B-Instruct-BF16.gguf', mmprojcpu=False, visionmaxres=1024, draftmodel=None, draftamount=8, draftgpulayers=999, draftgpusplit=None, password=None, ratelimit=0, ignoremissing=False, chatcompletionsadapter='AutoGuess', jinja=False, flashattention=True, lowvram=False, quantkv=1, forceversion=0, smartcontext=False, unpack='', exportconfig='', exporttemplate='', nomodel=False, moeexperts=-1, moecpu=0, defaultgenamt=768, nobostoken=False, enableguidance=False, maxrequestsize=32, overridekv=None, overridetensors=None, showgui=False, skiplauncher=False, singleinstance=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=7, sdclamped=0, sdclampedsoft=0, sdt5xxl='', sdclip1='', sdclip2='', sdphotomaker='', sdflashattention=False, sdoffloadcpu=False, sdvaecpu=False, sdclipgpu=False, sdconvdirect='off', sdvae='', sdvaeauto=False, sdquant=0, sdlora='', sdloramult=1.0, sdtiledvae=768, sdgendefaults='', whispermodel='', ttsmodel='', ttswavtokenizer='', ttsgpu=False, ttsmaxlen=4096, ttsthreads=0, embeddingsmodel='', embeddingsmaxctx=0, embeddingsgpu=False, admin=False, adminpassword='', admindir='', hordeconfig=None, sdconfig=None, noblas=False, nommap=False, sdnotile=False)

Loading Text Model: /home/rexommendation/Programs/koboldcpp/model/GGUF/Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf

The reported GGUF Arch is: qwen3vlmoe
Arch Category: 0


Identified as GGUF model.
Attempting to Load...

Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
CUDA MMQ: True

Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...

ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (Tesla P40) (0000:03:00.0) - 24284 MiB free
llama_model_loader: loaded meta data with 45 key-value pairs and 579 tensors from /home/rexommendation/Programs/koboldcpp/model/GGUF/Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size = 16.49 GiB (4.64 BPW)
init_tokenizer: initializing tokenizer for type 2
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3vlmoe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_embd_inp = 8192
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 6144
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 40
print_info: rope scaling = linear
print_info: freq_base_train = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: mrope sections = [24, 20, 20, 0]
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3-Vl-30B-A3B-Instruct
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: relocated tensors: 1 of 579
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU model buffer size = 166.92 MiB
load_tensors: CUDA0 model buffer size = 16722.37 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
...................................................................................................

MRope is used, context shift will be disabled!
Automatic RoPE Scaling: Using model internal value.
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8448
llama_context: n_ctx_seq = 8448
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = true
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (8448) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: CUDA_Host output buffer size = 0.58 MiB
llama_kv_cache: CUDA0 KV buffer size = 420.75 MiB
llama_kv_cache: size = 420.75 MiB ( 8448 cells, 48 layers, 1/1 seqs), K (q8_0): 210.38 MiB, V (q8_0): 210.38 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 4632
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
llama_context: CUDA0 compute buffer size = 300.75 MiB
llama_context: CUDA_Host compute buffer size = 20.52 MiB
llama_context: graph nodes = 3031
llama_context: graph splits = 2
Threadpool set to 1 threads and 1 blasthreads...
attach_threadpool: call

Attempting to apply Multimodal Projector: /run/media/rexommendation/Mass Archive/Programs/koboldcpp/model/mmproj/mmproj-Qwen3-VL-30B-A3B-Instruct-BF16.gguf
clip_model_loader: model name: Qwen3-Vl-30B-A3B-Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 352
clip_model_loader: n_kv: 31

clip_model_loader: has vision encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector: qwen3vl_merger
load_hparams: n_embd: 1152
load_hparams: n_head: 16
load_hparams: n_ff: 4304
load_hparams: n_layer: 27
load_hparams: ffn_op: gelu
load_hparams: projection_dim: 2048

--- vision hparams ---
load_hparams: image_size: 768
load_hparams: patch_size: 16
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: n_merge: 2
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels: 8192
load_hparams: image_max_pixels: 2097152

load_hparams: model size: 1036.66 MiB
load_hparams: metadata size: 0.15 MiB
load_tensors: loaded 352 tensors from /run/media/rexommendation/Mass Archive/Programs/koboldcpp/model/mmproj/mmproj-Qwen3-VL-30B-A3B-Instruct-BF16.gguf
alloc_compute_meta: warmup with image size = 512 x 512
alloc_compute_meta: CUDA0 compute buffer size = 43.52 MiB
alloc_compute_meta: CPU compute buffer size = 3.02 MiB
alloc_compute_meta: graph splits = 1, nodes = 853
warmup: flash attention is enabled
gpttype_load_model: mmproj vision embedding mismatch (8192 and 2048)! Make sure you use the correct mmproj file!
Load Text Model OK: False

Error: Could not load text model: /home/rexommendation/Programs/koboldcpp/model/GGUF/Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf

Sign up or log in to comment