koboldcpp won't load MS3.2-24B-Magnum-Diamond.i1-Q6_K

#1
by plowthat1998 - opened

I'm trying to use koboldcpp to load MS3.2-24B-Magnum-Diamond.i1-Q6_K, but everytime I try either from my original download of the model, or even after trying to have koboldcpp download it, I get told it couldn't load the model, and then once I close the notification koboldcpp shuts down.

Have you tried loading it in latest llama.cpp? llama.cpp is the GGUF reference implementation and the tool most likely to load ouer models and what we used to quantize and compute the imatrix of this model. Given that we where able to compute an imatrix of it it is likely to work but I will give it a try.

I'm used to using koboldcpp, but looking at llamacpp it definitely seems different. for one I was told my argument for the models filepath was wrong using the server starting command on the windows cmd terminal.

I got llamacpp to find the file by just changing a space in the filepath to an _, but now I'm getting this error when trying to load the model on there too.
ggml_vulkan: Device memory allocation of size 9433417728 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 9433417728
graph_reserve: failed to allocate compute buffers
llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
common_init_from_params: failed to create context with model 'B:\Local_ai_chat\kobold_GGUFs\MS3.2-24B-Magnum-Diamond.i1-Q6_K.gguf'
srv load_model: failed to load model, 'B:\Local_ai_chat\kobold_GGUFs\MS3.2-24B-Magnum-Diamond.i1-Q6_K.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

The model works perfectly for me:

root@test:/pool16_2# CUDA_VISIBLE_DEVICES=1 /root/llama.cpp/build/bin/llama-cli -m MS3.2-24B-Magnum-Diamond.i1-Q6_K.gguf -c 2048 -ngl 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
build: 5930 (322338bf) with cc (Ubuntu 12.3.0-17ubuntu1) 12.3.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA A100-SXM4-40GB) - 3796 MiB free
llama_model_loader: loaded meta data with 53 key-value pairs and 363 tensors from MS3.2-24B-Magnum-Diamond.i1-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = MS3.2 24B Magnum Diamond
llama_model_loader: - kv   3:                           general.finetune str              = Magnum-Diamond
llama_model_loader: - kv   4:                           general.basename str              = MS3.2
llama_model_loader: - kv   5:                         general.size_label str              = 24B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Mistral Small 3.2 24B Instruct 2506
llama_model_loader: - kv   9:               general.base_model.0.version str              = 2506
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Mistralai
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/mistralai/Mist...
llama_model_loader: - kv  12:                               general.tags arr[str,3]       = ["axolotl", "chat", "text-generation"]
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          llama.block_count u32              = 40
llama_model_loader: - kv  15:                       llama.context_length u32              = 131072
llama_model_loader: - kv  16:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv  17:                  llama.feed_forward_length u32              = 32768
llama_model_loader: - kv  18:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  19:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  20:                       llama.rope.freq_base f32              = 1000000000.000000
llama_model_loader: - kv  21:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  22:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  23:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  24:                           llama.vocab_size u32              = 131072
llama_model_loader: - kv  25:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,269443]  = ["_ _", "_ t", "e r", "i n", "_ ...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 2
       = true
llama_model_loader: - kv  36:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  37:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  38:                    tokenizer.chat_template str              = {%- set today = strftime_now("%Y-%m-%...
llama_model_loader: - kv  39:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  40:               general.quantization_version u32              = 2
llama_model_loader: - kv  41:                          general.file_type u32              = 18
llama_model_loader: - kv  42:                                general.url str              = https://huggingface.co/mradermacher/M...
llama_model_loader: - kv  43:              mradermacher.quantize_version str              = 2
llama_model_loader: - kv  44:                  mradermacher.quantized_by str              = mradermacher
llama_model_loader: - kv  45:                  mradermacher.quantized_at str              = 2025-06-25T21:42:05+02:00
llama_model_loader: - kv  46:                  mradermacher.quantized_on str              = nico1
llama_model_loader: - kv  47:                         general.source.url str              = https://huggingface.co/Doctor-Shotgun...
llama_model_loader: - kv  48:                  mradermacher.convert_type str              = hf
llama_model_loader: - kv  49:                      quantize.imatrix.file str              = MS3.2-24B-Magnum-Diamond-i1-GGUF/imat...
llama_model_loader: - kv  50:                   quantize.imatrix.dataset str              = imatrix-training-full-3
llama_model_loader: - kv  51:             quantize.imatrix.entries_count u32              = 280
llama_model_loader: - kv  52:              quantize.imatrix.chunks_count u32              = 321
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q6_K:  282 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 18.01 GiB (6.56 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 32768
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 13B
print_info: model params     = 23.57 B
print_info: general.name     = MS3.2 24B Magnum Diamond
print_info: vocab type       = BPE
print_info: n_vocab          = 131072
print_info: n_merges         = 269443
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 11 '<pad>'
print_info: LF token         = 1010 '_'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 150
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 18442.21 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.50 MiB
llama_kv_cache_unified:        CPU KV buffer size =   320.00 MiB
llama_kv_cache_unified: size =  320.00 MiB (  2048 cells,  40 layers,  1 seqs), K (f16):  160.00 MiB, V (f16):  160.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:      CUDA0 compute buffer size =   791.00 MiB
llama_context:  CUDA_Host compute buffer size =    14.01 MiB
llama_context: graph nodes  = 1446
llama_context: graph splits = 444 (with bs=512), 1 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 128
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
[SYSTEM_PROMPT] You are a helpful assistant[/SYSTEM_PROMPT][INST] Hello[/INST] Hi there</s>[INST] How are you?[/INST]

system_info: n_threads = 128 (n_threads_batch = 128) / 256 | CUDA : ARCHS = 750,800 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: interactive mode on.
sampler seed: 1934949381
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 2048
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2048, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

What is the meaning of life?

There is no single, universally agreed upon answer to the meaning of life. Different philosophies, religions, and individuals have proposed various perspectives:

Some believe the meaning of life is to serve, love, or be close to God or a higher power. Others see the meaning as achieving enlightenment or spiritual growth. Atheists may believe there is no inherent meaning, and that we create our own meaning through our choices and actions.

Utilitarians believe the meaning is to maximize happiness and minimize suffering. Existentialists argue that life has no inherent meaning, and we must create our own meaning and purpose. Absurdist philosophers believe life is inherently meaningless and absurd, and the best we can do is embrace that absurdity.

Some find meaning through creativity, relationships, making a positive impact, or the pursuit of knowledge and truth. Others see meaning in the simple joys and beauty in the world.

Ultimately, the meaning of life is a deeply personal question. It may help to reflect on what gives your life a sense of purpose and fulfillment. But there is no single, objective answer that applies to everyone. We each have to grapple with this question and come to our own conclusions.

I got llamacpp to find the file by just changing a space in the filepath to an _,

Alternatively just put the path into quotes.

but now I'm getting this error when trying to load the model on there too.
ggml_vulkan: Device memory allocation of size 9433417728 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 9433417728

That just means the model or context doesn't fit on you GPU. Use -ngl 0 to only use the GPU for prompt processing acceleration or use -ngl n where n is the largest number that fits. Also specify -c n where n is the amount of context you need. If you need a lot of context also specify -fa to use flash attention so context uses less memory without any real disadvantages. If you want to run the entire model on GPU which is much faster just use a smaller quant.

well, I'm currently trying to tensor split the model across my 2 gpus, but I can't figure out the right format for that argument. I've tried -ts 5,5 along with N0 5, N1 5, then N5, N5. none of them are working and it's confusing me why I can't figure out the right format.

well, I'm currently trying to tensor split the model across my 2 gpus, but I can't figure out the right format for that argument. I've tried -ts 5,5 along with N0 5, N1 5, then N5, N5. none of them are working and it's confusing me why I can't figure out the right format.

No need to specify any tensorsplit assuming booth your GPUs gave the same amount of GPU memory. llama.cpp is quite good at evenly distribute the layers by default. Just make sure to specify -ngl 999 to offload all layers and -c 4090 or some other reasonably sized context as it defaults to max context if noting is specified which will far exceed GPU memory on any consumer GPU. Also condider enable flash attention using -fa

In the unlikely case you really need to specify a custom split:

-sm, --split-mode {none,layer,row} 	how to split the model across multiple GPUs, one of:
- none: use one GPU only
- layer (default): split layers and KV across GPUs
- row: split rows across GPUs
(env: LLAMA_ARG_SPLIT_MODE)
-ts, --tensor-split N0,N1,N2,... 	fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1
(env: LLAMA_ARG_TENSOR_SPLIT)
-mg, --main-gpu INDEX 	the GPU to use for the model (with split-mode = none), or for intermediate results and KV (with split-mode = row) (default: 0)
(env: LLAMA_ARG_MAIN_GPU)

So lets say you have a GPU with 24 and one with 16 GiB of memory so you could specify something like:
-c 4090 -ngl 999 -fa -sm row -ts 20,16 -mg 0
Assuming 4 GiB is enough for for intermediate results and KV for your use case. Bur really if possible I would use -sm layer on Vulkan as it is better tested/supported.

Sign up or log in to comment