What version of llama.cpp are you using?

by AndyLeaf666 - opened Feb 4

Discussion

AndyLeaf666

Feb 4

What version of llama.cpp are you using? I'm getting an error when I try to run it with version 12.4.

sean-edgerunner-ai

Feb 5

I am still working out the kinks. I've gotten the text models to perform inference no problem but still figuring out the mmproj. These ended up being a PoC for a separate script, had no idea it would get this much attention! I will update with my fork once it's working on both text and image.

MrDevolver

Feb 14

•

edited Feb 14

I am still working out the kinks. I've gotten the text models to perform inference no problem but still figuring out the mmproj. These ended up being a PoC for a separate script, had no idea it would get this much attention! I will update with my fork once it's working on both text and image.

Hey, any news on this? This seems to be a pretty good small model. It would be awesome to have proper support in LM Studio (llamacpp based).

DmitrySv

Mar 4

clip_model_loader: has vision encoder
clip_ctx: CLIP using CUDA0 backend
clip_init: failed to load model '.\LLM\mtmd\mmproj-Step3-VL-10b-F16.gguf': load_hparams: unknown projector type: step3vl

mtmd_init_from_file: error: Failed to load CLIP model from .\LLM\mtmd\mmproj-Step3-VL-10b-F16.gguf

Without -mm .\LLM\mtmd\mmproj-Step3-VL-10b-F16.gguf works fine. Tested on llama-b7954-bin-win-cuda-12.4-x64 and llama-b8194-bin-win-cuda-12.4-x64

kraven1109

Mar 5

clip_model_loader: has vision encoder
clip_ctx: CLIP using CUDA0 backend
clip_init: failed to load model '.\LLM\mtmd\mmproj-Step3-VL-10b-F16.gguf': load_hparams: unknown projector type: step3vl

mtmd_init_from_file: error: Failed to load CLIP model from .\LLM\mtmd\mmproj-Step3-VL-10b-F16.gguf
Without -mm .\LLM\mtmd\mmproj-Step3-VL-10b-F16.gguf works fine. Tested on llama-b7954-bin-win-cuda-12.4-x64 and llama-b8194-bin-win-cuda-12.4-x64

hey, could you try my version? i successfully convert it, it requires change both llama.cpp and the python conversion file, I also uploaded the binary (p13 on my repo)

DmitrySv

24 days ago

hey, could you try my version? i successfully convert it, it requires change both llama.cpp and the python conversion file, I also uploaded the binary (p13 on my repo)

load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:          CPU model buffer size =   486.86 MiB
load_tensors:        CUDA0 model buffer size =  5921.78 MiB
.......................................................................................
common_init_result: added </s> logit bias = -inf
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 2048
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (65536) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache:      CUDA0 KV buffer size =  1152.00 MiB
llama_kv_cache: size = 1152.00 MiB (  8192 cells,  36 layers,  1/1 seqs), K (f16):  576.00 MiB, V (f16):  576.00 MiB
sched_reserve: reserving ...
sched_reserve:      CUDA0 compute buffer size =  1219.00 MiB
sched_reserve:  CUDA_Host compute buffer size =   128.05 MiB
sched_reserve: graph nodes  = 1267
sched_reserve: graph splits = 2
sched_reserve: reserve took 24.19 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
D:\Apps\git_cloned_repos\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:97: CUDA error
ggml_cuda_compute_forward: MUL_MAT failed

I also tried with original's ggml-cuda.dll - almost the same:

load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 35 repeating layers to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:          CPU model buffer size =   486.86 MiB
load_tensors:        CUDA0 model buffer size =  5921.78 MiB
.......................................................................................
common_init_result: added </s> logit bias = -inf
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 2048
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (65536) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache:      CUDA0 KV buffer size =  1152.00 MiB
llama_kv_cache: size = 1152.00 MiB (  8192 cells,  36 layers,  1/1 seqs), K (f16):  576.00 MiB, V (f16):  576.00 MiB
sched_reserve: reserving ...
sched_reserve:      CUDA0 compute buffer size =  1219.00 MiB
sched_reserve:  CUDA_Host compute buffer size =   128.05 MiB
sched_reserve: graph nodes  = 1267
sched_reserve: graph splits = 2
sched_reserve: reserve took 68.77 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment