What version of llama.cpp are you using?
What version of llama.cpp are you using? I'm getting an error when I try to run it with version 12.4.
I am still working out the kinks. I've gotten the text models to perform inference no problem but still figuring out the mmproj. These ended up being a PoC for a separate script, had no idea it would get this much attention! I will update with my fork once it's working on both text and image.
I am still working out the kinks. I've gotten the text models to perform inference no problem but still figuring out the mmproj. These ended up being a PoC for a separate script, had no idea it would get this much attention! I will update with my fork once it's working on both text and image.
Hey, any news on this? This seems to be a pretty good small model. It would be awesome to have proper support in LM Studio (llamacpp based).
clip_model_loader: has vision encoder
clip_ctx: CLIP using CUDA0 backend
clip_init: failed to load model '.\LLM\mtmd\mmproj-Step3-VL-10b-F16.gguf': load_hparams: unknown projector type: step3vl
mtmd_init_from_file: error: Failed to load CLIP model from .\LLM\mtmd\mmproj-Step3-VL-10b-F16.gguf
Without -mm .\LLM\mtmd\mmproj-Step3-VL-10b-F16.gguf works fine. Tested on llama-b7954-bin-win-cuda-12.4-x64 and llama-b8194-bin-win-cuda-12.4-x64
clip_model_loader: has vision encoder clip_ctx: CLIP using CUDA0 backend clip_init: failed to load model '.\LLM\mtmd\mmproj-Step3-VL-10b-F16.gguf': load_hparams: unknown projector type: step3vl mtmd_init_from_file: error: Failed to load CLIP model from .\LLM\mtmd\mmproj-Step3-VL-10b-F16.ggufWithout
-mm .\LLM\mtmd\mmproj-Step3-VL-10b-F16.ggufworks fine. Tested onllama-b7954-bin-win-cuda-12.4-x64andllama-b8194-bin-win-cuda-12.4-x64
hey, could you try my version? i successfully convert it, it requires change both llama.cpp and the python conversion file, I also uploaded the binary (p13 on my repo)
hey, could you try my version? i successfully convert it, it requires change both llama.cpp and the python conversion file, I also uploaded the binary (p13 on my repo)
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: CPU model buffer size = 486.86 MiB
load_tensors: CUDA0 model buffer size = 5921.78 MiB
.......................................................................................
common_init_result: added </s> logit bias = -inf
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8192
llama_context: n_ctx_seq = 8192
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (65536) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
llama_kv_cache: CUDA0 KV buffer size = 1152.00 MiB
llama_kv_cache: size = 1152.00 MiB ( 8192 cells, 36 layers, 1/1 seqs), K (f16): 576.00 MiB, V (f16): 576.00 MiB
sched_reserve: reserving ...
sched_reserve: CUDA0 compute buffer size = 1219.00 MiB
sched_reserve: CUDA_Host compute buffer size = 128.05 MiB
sched_reserve: graph nodes = 1267
sched_reserve: graph splits = 2
sched_reserve: reserve took 24.19 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
D:\Apps\git_cloned_repos\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:97: CUDA error
ggml_cuda_compute_forward: MUL_MAT failed
I also tried with original's ggml-cuda.dll - almost the same:
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: CPU model buffer size = 486.86 MiB
load_tensors: CUDA0 model buffer size = 5921.78 MiB
.......................................................................................
common_init_result: added </s> logit bias = -inf
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8192
llama_context: n_ctx_seq = 8192
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (65536) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
llama_kv_cache: CUDA0 KV buffer size = 1152.00 MiB
llama_kv_cache: size = 1152.00 MiB ( 8192 cells, 36 layers, 1/1 seqs), K (f16): 576.00 MiB, V (f16): 576.00 MiB
sched_reserve: reserving ...
sched_reserve: CUDA0 compute buffer size = 1219.00 MiB
sched_reserve: CUDA_Host compute buffer size = 128.05 MiB
sched_reserve: graph nodes = 1267
sched_reserve: graph splits = 2
sched_reserve: reserve took 68.77 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)