Testing IQ5_K

#8
by shewin - opened

Tensor blk.61.ffn_down_exps.weight (size = 954.00 MiB) buffer type overriden to CUDA_Host

Allocating 154.28 GiB of pinned host memory, this may take a while.
Using pinned host memory improves PP performance by a significant margin.
But if it takes too long for your model and amount of patience, kill the process and run using

GGML_CUDA_NO_PINNED=1 your_command_goes_here
done allocating 154.28 GiB in 43265.6 ms

llm_load_tensors: offloading 62 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 63/63 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 157978.76 MiB
llm_load_tensors: CUDA0 buffer size = 3578.73 MiB
....................................................................................................
~ggml_backend_cuda_context: have 0 graphs
llama_init_from_model: n_ctx = 100096
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 2048
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 512
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 5000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 24242.00 MiB
llama_init_from_model: KV self size = 24242.00 MiB, K (f16): 12121.00 MiB, V (f16): 12121.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.76 MiB
llama_init_from_model: CUDA0 compute buffer size = 2125.01 MiB
llama_init_from_model: CUDA_Host compute buffer size = 415.02 MiB
llama_init_from_model: graph nodes = 2361
llama_init_from_model: graph splits = 126
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 100096, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 3.653 560.65 23.748 21.56
2048 512 2048 3.384 605.21 23.956 21.37
2048 512 4096 3.414 599.86 24.295 21.07
2048 512 6144 3.446 594.38 24.470 20.92
2048 512 8192 3.499 585.38 24.869 20.59

2026-04-14_18-04

2026-04-14_18-06
Top-tier open source model!

With other options:

~ggml_backend_cuda_context: have 0 graphs
llama_init_from_model: n_ctx = 130048
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 4096
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 5000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 16732.28 MiB
llama_init_from_model: KV self size = 16732.25 MiB, K (q8_0): 8366.12 MiB, V (q8_0): 8366.12 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.76 MiB
llama_init_from_model: CUDA0 compute buffer size = 3222.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 1064.05 MiB
llama_init_from_model: graph nodes = 2361
llama_init_from_model: graph splits = 126
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 130048, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 3.960 1034.23 44.808 22.85
4096 1024 4096 3.999 1024.23 33.446 30.62
4096 1024 8192 4.132 991.26 35.122 29.16
4096 1024 12288 4.250 963.68 35.847 28.57
4096 1024 16384 4.384 934.34 36.511 28.05

Donwload in progress..

i always love seeing these reports! thanks as usual @shewin

Hi ubergarm, and a huge thanks for providing this IQ5_K quantization of MiniMax 2.7!

I wanted to share my feedback using the latest version of ik_llama.cpp. The model runs perfectly in our environment. Here is the exact command I'm using:

~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-IQ5_K-00001-of-00005.gguf --alias MiniMax-M2.7-IQ5_K --host 0.0.0.0 --port 8080 --ctx-size 163840 --no-mmap --threads 32 --threads-batch 64 --batch-size 2048 --ubatch-size 2048 --parallel 1 --flash-attn 1 --n-gpu-layers 999 --split-mode graph --fit --fit-margin 4500 -gr -ger --cache-type-v q8_0 --cache-type-k q8_0 --jinja --cache-ram 32768 -muge --merge-qtv

The performance we are getting is excellent:

  • Generation: 17 to 20 tokens/sec
  • Prompt processing: 250 to 350 tokens/sec

Detailed Memory Breakdown (8-GPU Setup)
For anyone curious about how this behaves on my multi-GPU architecture, here is the detailed memory allocation breakdown from our tests:

Component Size Location
Model Tensors 135,789 MiB CUDA_Split (distributed across GPUs)
MoE Experts ~156,000 MiB (estimated) CUDA_Host (Pinned RAM)
KV Cache (q8_0) 21,080 MiB Distributed across 8 GPUs (2,635 MiB/GPU)
Compute Buffers ~10,209 MiB 8 GPUs (1,012 - 2,478 MiB/GPU)
Pinned Memory 35.3 GiB allocated CUDA_Host
ggml Context 39.3 MiB CUDA_Host
Output Layer 623 MiB GPU 7 only

Key Architecture Observations:

  • Experts on CPU: Since the 256 experts × 62 layers don't fit entirely in VRAM, the --fit flag automatically offloads them to CUDA_Host (RAM). This is expected and handled perfectly.
  • GPU 7 Allocation: It receives the output layer (623 MiB) plus a larger share of the compute buffer (1,599 MiB).
  • GPU 4 Overhead: Shows about ~300 MiB less available VRAM due to the X11 display.
  • CUDA_Split: Dense layers are successfully distributed across all GPUs using NVIDIA unified memory.
  • KV Cache: Evenly distributed, sitting at ~2.6 GiB per GPU for a 163K token context using q8_0.

My initial testing sessions using OpenCode have been incredibly pleasant and conclusive. I really can't wait for my dev team to dive into more extensive testing with this setup.

If anyone has ideas or parameter tweaks to further optimize this command, I am all ears!

Thanks again to ubergarm for the outstanding work!

image

@martossien

iirc you have 8x 3090 GPUs right? So 192 GB VRAM, and the MiniMax-M2.7 IQ5_K 157.771 GiB (5.926 BPW) so you should be able to do full offload and get much faster speeds right?

Give this a try, you had a typo with --merge-qkv too:

~/ik_llama.cpp/build/bin/llama-server \
  --model /home/admin_ia/.cache/lm-studio/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-IQ5_K-00001-of-00005.gguf \
  --alias MiniMax-M2.7-IQ5_K \
  -muge --merge-qkv \
  --host 0.0.0.0 --port 8080 \
  --ctx-size 163840 \
  --no-mmap \
  --threads 1 \
  --batch-size 2048 --ubatch-size 2048 \
  --parallel 1 \
  --flash-attn 1 \
  --n-gpu-layers 999 \
   --split-mode graph \
  --fit off \
  -gr \
  --cache-type-v q8_0 --cache-type-k q8_0 \
  --jinja \
  --cache-ram 32768 \
  --prompt-cache-all \
  --spec-type ngram-map-k4v --spec-ngram-size-n 8 --spec-ngram-size-m 8 --spec-ngram-min-hits 2 --draft-min 1 --draft-max 12

Let me know if you're unable to do full offload, maybe can tweak a few things but pretty sure it should fit.

Also, if you can give this a try and report on the github it might be nice: https://github.com/ikawrakow/ik_llama.cpp/pull/1644

--fit off -> error ( i have compil yesterday )

without :
/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-IQ5_K-00001-of-00005.gguf --alias MiniMax-M2.7-IQ5_K -muge --merge-qkv --host 0.0.0.0 --port 8080 --ctx-size 163840 --no-mmap --threads 1 --batch-size 2048 --ubatch-size 2048 --parallel 1 --flash-attn 1 --n-gpu-layers 999 --split-mode graph -gr --cache-type-v q8_0 --cache-type-k q8_0 --jinja --cache-ram 32768 --prompt-cache-all --spec-type ngram-map-k4v --spec-ngram-size-n 8 --spec-ngram-size-m 8 --spec-ngram-min-hits 2 --draft-min 1 --draft-max 12
INFO [ main] build info | tid="140485431283712" timestamp=1776445674 build=4419 commit="eaf83865"
INFO [ main] system info | tid="140485431283712" timestamp=1776445674 n_threads=1 n_threads_batch=-1 total_threads=64 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 6: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 7: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
=============================== NCCL main communicator initialized
CUDA0: using device CUDA0 - 23728 MiB free
CUDA1: using device CUDA1 - 23738 MiB free
CUDA2: using device CUDA2 - 23739 MiB free
CUDA3: using device CUDA3 - 23738 MiB free
CUDA4: using device CUDA4 - 23285 MiB free
CUDA5: using device CUDA5 - 23738 MiB free
CUDA6: using device CUDA6 - 23738 MiB free
CUDA7: using device CUDA7 - 23739 MiB free
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 44 key-value pairs and 809 tensors from /home/admin_ia/.cache/lm-studio/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-IQ5_K-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = minimax-m2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 40
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 5: general.name str = MiniMax M2.7
llama_model_loader: - kv 6: general.size_label str = 256x4.9B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = modified-mit
llama_model_loader: - kv 9: general.license.link str = https://github.com/MiniMax-AI/MiniMax...
llama_model_loader: - kv 10: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 11: minimax-m2.block_count u32 = 62
llama_model_loader: - kv 12: minimax-m2.context_length u32 = 196608
llama_model_loader: - kv 13: minimax-m2.embedding_length u32 = 3072
llama_model_loader: - kv 14: minimax-m2.feed_forward_length u32 = 1536
llama_model_loader: - kv 15: minimax-m2.attention.head_count u32 = 48
llama_model_loader: - kv 16: minimax-m2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 17: minimax-m2.rope.freq_base f32 = 5000000.000000
llama_model_loader: - kv 18: minimax-m2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 19: minimax-m2.expert_count u32 = 256
llama_model_loader: - kv 20: minimax-m2.expert_used_count u32 = 8
llama_model_loader: - kv 21: minimax-m2.expert_gating_func u32 = 2
llama_model_loader: - kv 22: minimax-m2.attention.key_length u32 = 128
llama_model_loader: - kv 23: minimax-m2.attention.value_length u32 = 128
llama_model_loader: - kv 24: general.file_type u32 = 141
llama_model_loader: - kv 25: minimax-m2.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 26: minimax-m2.rope.dimension_count u32 = 64
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 29: tokenizer.ggml.pre str = minimax-m2
llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,200064] = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ...
llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,200064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,199744] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r...
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 200034
llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 200020
llama_model_loader: - kv 35: tokenizer.ggml.unknown_token_id u32 = 200021
llama_model_loader: - kv 36: tokenizer.chat_template str = {# ----------‑‑‑ special token ...
llama_model_loader: - kv 37: quantize.imatrix.file str = /mnt/data/models/ubergarm/MiniMax-M2....
llama_model_loader: - kv 38: quantize.imatrix.dataset str = ubergarm-imatrix-calibration-corpus-v...
llama_model_loader: - kv 39: quantize.imatrix.entries_count i32 = 497
llama_model_loader: - kv 40: quantize.imatrix.chunks_count i32 = 796
llama_model_loader: - kv 41: split.no u16 = 0
llama_model_loader: - kv 42: split.count u16 = 5
llama_model_loader: - kv 43: split.tensors.count i32 = 809
llama_model_loader: - type f32: 373 tensors
llama_model_loader: - type q8_0: 250 tensors
llama_model_loader: - type iq5_k: 124 tensors
llama_model_loader: - type iq6_k: 62 tensors
load: 0 unused tokens
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 200004 ('')
load: - 200005 ('')
load: - 200020 ('[e
[')
load: special tokens cache size = 54
load: token to piece cache size = 1.3355 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = minimax-m2
llm_load_print_meta: n_ctx_train = 196608
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 62
llm_load_print_meta: n_head = 48
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 6
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 1536
llm_load_print_meta: n_expert = 256
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 5000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 196608
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_n_group = 0
llm_load_print_meta: model type = 230B.A10B
llm_load_print_meta: model ftype = IQ5_K - 5.5 bpw
llm_load_print_meta: model params = 228.690 B
llm_load_print_meta: model size = 157.771 GiB (5.926 BPW)
llm_load_print_meta: repeating layers = 156.555 GiB (5.912 BPW, 227.461 B parameters)
llm_load_print_meta: general.name = MiniMax M2.7
print_info: vocab type = BPE
print_info: n_vocab = 200064
print_info: n_merges = 199744
print_info: BOS token = 200034 ']!b['
print_info: EOS token = 200020 '[e
['
print_info: UNK token = 200021 ']!d['
print_info: LF token = 10 'Ċ'
print_info: FIM PRE token = 200001 ''
print_info: FIM SUF token = 200003 ''
print_info: FIM MID token = 200002 ''
print_info: FIM PAD token = 200004 ''
print_info: FIM REP token = 200005 ''
print_info: EOG token = 200004 ''
print_info: EOG token = 200005 ''
print_info: EOG token = 200020 '[e
['
print_info: max token length = 256
======================================= HAVE_FANCY_SIMD is NOT defined
------------------- Layer sizes:
Layer 0: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 1: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 2: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 3: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 4: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 5: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 6: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 7: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 8: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 9: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 10: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 11: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 12: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 13: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 14: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 15: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 16: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 17: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 18: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 19: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 20: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 21: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 22: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 23: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 24: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 25: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 26: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 27: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 28: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 29: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 30: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 31: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 32: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 33: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 34: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 35: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 36: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 37: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 38: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 39: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 40: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 41: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 42: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 43: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 44: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 45: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 46: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 47: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 48: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 49: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 50: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 51: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 52: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 53: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 54: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 55: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 56: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 57: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 58: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 59: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 60: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 61: 2585.68, 340.00, 2925.68 432.00 MiB
Layer 62: 622.76, 1131.00, 1753.76 MiB (output layer)

Total : 160311.96, 22211.00, 182522.96 MiB
Memory required for model tensors + cache: 183146 MiB
Memory available on all devices - compute: 177799 MiB
llm_load_tensors: ggml ctx size = 43.21 MiB

========================================================
merge_qkv is not compatible with split mode 'graph'
=> turning off merge_qkv

merge_up_gate_exps: merging up/gate in layer 0
merge_up_gate_exps: merging up/gate in layer 1
merge_up_gate_exps: merging up/gate in layer 2
merge_up_gate_exps: merging up/gate in layer 3
merge_up_gate_exps: merging up/gate in layer 4
merge_up_gate_exps: merging up/gate in layer 5
merge_up_gate_exps: merging up/gate in layer 6
merge_up_gate_exps: merging up/gate in layer 7
merge_up_gate_exps: merging up/gate in layer 8
merge_up_gate_exps: merging up/gate in layer 9
merge_up_gate_exps: merging up/gate in layer 10
merge_up_gate_exps: merging up/gate in layer 11
merge_up_gate_exps: merging up/gate in layer 12
merge_up_gate_exps: merging up/gate in layer 13
merge_up_gate_exps: merging up/gate in layer 14
merge_up_gate_exps: merging up/gate in layer 15
merge_up_gate_exps: merging up/gate in layer 16
merge_up_gate_exps: merging up/gate in layer 17
merge_up_gate_exps: merging up/gate in layer 18
merge_up_gate_exps: merging up/gate in layer 19
merge_up_gate_exps: merging up/gate in layer 20
merge_up_gate_exps: merging up/gate in layer 21
merge_up_gate_exps: merging up/gate in layer 22
merge_up_gate_exps: merging up/gate in layer 23
merge_up_gate_exps: merging up/gate in layer 24
merge_up_gate_exps: merging up/gate in layer 25
merge_up_gate_exps: merging up/gate in layer 26
merge_up_gate_exps: merging up/gate in layer 27
merge_up_gate_exps: merging up/gate in layer 28
merge_up_gate_exps: merging up/gate in layer 29
merge_up_gate_exps: merging up/gate in layer 30
merge_up_gate_exps: merging up/gate in layer 31
merge_up_gate_exps: merging up/gate in layer 32
merge_up_gate_exps: merging up/gate in layer 33
merge_up_gate_exps: merging up/gate in layer 34
merge_up_gate_exps: merging up/gate in layer 35
merge_up_gate_exps: merging up/gate in layer 36
merge_up_gate_exps: merging up/gate in layer 37
merge_up_gate_exps: merging up/gate in layer 38
merge_up_gate_exps: merging up/gate in layer 39
merge_up_gate_exps: merging up/gate in layer 40
merge_up_gate_exps: merging up/gate in layer 41
merge_up_gate_exps: merging up/gate in layer 42
merge_up_gate_exps: merging up/gate in layer 43
merge_up_gate_exps: merging up/gate in layer 44
merge_up_gate_exps: merging up/gate in layer 45
merge_up_gate_exps: merging up/gate in layer 46
merge_up_gate_exps: merging up/gate in layer 47
merge_up_gate_exps: merging up/gate in layer 48
merge_up_gate_exps: merging up/gate in layer 49
merge_up_gate_exps: merging up/gate in layer 50
merge_up_gate_exps: merging up/gate in layer 51
merge_up_gate_exps: merging up/gate in layer 52
merge_up_gate_exps: merging up/gate in layer 53
merge_up_gate_exps: merging up/gate in layer 54
merge_up_gate_exps: merging up/gate in layer 55
merge_up_gate_exps: merging up/gate in layer 56
merge_up_gate_exps: merging up/gate in layer 57
merge_up_gate_exps: merging up/gate in layer 58
merge_up_gate_exps: merging up/gate in layer 59
merge_up_gate_exps: merging up/gate in layer 60
merge_up_gate_exps: merging up/gate in layer 61
================================ max_gpu = 0
Estimated model buffer size per device:
Device 0: 21206.29 MiB
Device 1: 21626.51 MiB
Device 2: 21626.51 MiB
Device 3: 21601.38 MiB
Device 4: 20360.29 MiB
Device 5: 21601.38 MiB
Device 6: 21626.51 MiB
Device 7: 21626.51 MiB
No tensors in buffer type CUDA0
No tensors in buffer type CUDA1
No tensors in buffer type CUDA2
No tensors in buffer type CUDA3
No tensors in buffer type CUDA4
No tensors in buffer type CUDA5
No tensors in buffer type CUDA6
llm_load_tensors: offloading 62 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 63/63 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 622.76 MiB
llm_load_tensors: CUDA_Split buffer size = 171276.05 MiB
llm_load_tensors: CUDA7 buffer size = 622.77 MiB
...................................................................................................~ggml_backend_cuda_context: have 0 graphs
.
llama_init_from_model: n_ctx = 163840
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 2048
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 5000000.0
llama_init_from_model: freq_scale = 1
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->5
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 3->6
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 5->1
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 6->3
CUDA error: out of memory
current device: 7, in function ggml_backend_cuda_split_buffer_init_tensor at /home/admin_ia/ik_llama.cpp/ggml/src/ggml-cuda.cu:840
ggml_cuda_device_malloc((void**)&buf, padded_size, i)
/home/admin_ia/ik_llama.cpp/ggml/src/ggml-cuda.cu:132: CUDA error
[New LWP 73364]
[New LWP 73363]
[New LWP 73362]
[New LWP 73361]
[New LWP 73360]
[New LWP 73359]
[New LWP 73358]
[New LWP 73357]
[New LWP 73356]
[New LWP 73355]
[New LWP 73354]
[New LWP 73353]
[New LWP 73352]
[New LWP 73351]
[New LWP 73350]
[New LWP 73349]
[New LWP 73348]
[New LWP 73339]
[New LWP 73338]
[New LWP 73337]
[New LWP 73336]
[New LWP 73335]
[New LWP 73334]
[New LWP 73333]
[New LWP 73332]
[New LWP 73331]
[New LWP 73330]
[New LWP 73329]
[New LWP 73328]
[New LWP 73327]
[New LWP 73326]
[New LWP 73325]
[New LWP 73324]
[New LWP 73298]

This GDB supports auto-downloading debuginfo from the following URLs:
https://debuginfod.fedoraproject.org/
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fc540889422 in __syscall_cancel_arch () from /lib64/libc.so.6
#0 0x00007fc540889422 in __syscall_cancel_arch () from /lib64/libc.so.6
#1 0x00007fc54087d71c in __internal_syscall_cancel () from /lib64/libc.so.6
#2 0x00007fc54087d764 in __syscall_cancel () from /lib64/libc.so.6
#3 0x00007fc5408edc0f in wait4 () from /lib64/libc.so.6
#4 0x00007fc540efebc1 in ggml_abort () from /home/admin_ia/ik_llama.cpp/build/ggml/src/libggml.so
#5 0x00007fc54111db83 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/admin_ia/ik_llama.cpp/build/ggml/src/libggml.so
#6 0x00007fc541123e41 in ggml_backend_cuda_split_buffer_init_tensor(ggml_backend_buffer*, ggml_tensor*) () from /home/admin_ia/ik_llama.cpp/build/ggml/src/libggml.so
#7 0x00007fc540f5fd6c in alloc_tensor_range () from /home/admin_ia/ik_llama.cpp/build/ggml/src/libggml.so
#8 0x00007fc540f616b4 in ggml_backend_alloc_ctx_tensors_from_buft () from /home/admin_ia/ik_llama.cpp/build/ggml/src/libggml.so
#9 0x00007fc55046ff00 in llama_kv_cache_init(llama_kv_cache&, llama_context const*, ggml_type, ggml_type, unsigned int, bool, ggml_type, ggml_type, ggml_type, ggml_type, int, int, int, int) () from /home/admin_ia/ik_llama.cpp/build/src/libllama.so
#10 0x00007fc5504728b3 in llama_init_from_model () from /home/admin_ia/ik_llama.cpp/build/src/libllama.so
#11 0x000000000067a8b6 in llama_init_from_gpt_params(gpt_params&) ()
#12 0x000000000059394f in server_context::load_model(gpt_params const&) ()
#13 0x00000000004a9283 in main ()
[Inferior 1 (process 73297) detached]

Work with l --cts-size 1024 :ok ( 2048 Nok)

/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-IQ5_K-00001-of-00005.gguf --alias MiniMax-M2.7-IQ5_K -muge --merge-qkv --host 0.0.0.0 --port 8080 --ctx-size 1024 --no-mmap --threads 1 --batch-size 2048 --ubatch-size 2048 --parallel 1 --flash-attn 1 --n-gpu-layers 999 --split-mode graph -gr --cache-type-v q8_0 --cache-type-k q8_0 --jinja --cache-ram 32768 --prompt-cache-all --spec-type ngram-map-k4v --spec-ngram-size-n 8 --spec-ngram-size-m 8 --spec-ngram-min-hits 2 --draft-min 1 --draft-max 12
INFO [ main] build info | tid="140336959574016" timestamp=1776446930 build=4419 commit="eaf83865"
INFO [ main] system info | tid="140336959574016" timestamp=1776446930 n_threads=1 n_threads_batch=-1 total_threads=64 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 6: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 7: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
=============================== NCCL main communicator initialized
CUDA0: using device CUDA0 - 23739 MiB free
CUDA1: using device CUDA1 - 23738 MiB free
CUDA2: using device CUDA2 - 23739 MiB free
CUDA3: using device CUDA3 - 23738 MiB free
CUDA4: using device CUDA4 - 23343 MiB free
CUDA5: using device CUDA5 - 23738 MiB free
CUDA6: using device CUDA6 - 23738 MiB free
CUDA7: using device CUDA7 - 23739 MiB free
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 44 key-value pairs and 809 tensors from /home/admin_ia/.cache/lm-studio/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-IQ5_K-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = minimax-m2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 40
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 5: general.name str = MiniMax M2.7
llama_model_loader: - kv 6: general.size_label str = 256x4.9B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = modified-mit
llama_model_loader: - kv 9: general.license.link str = https://github.com/MiniMax-AI/MiniMax...
llama_model_loader: - kv 10: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 11: minimax-m2.block_count u32 = 62
llama_model_loader: - kv 12: minimax-m2.context_length u32 = 196608
llama_model_loader: - kv 13: minimax-m2.embedding_length u32 = 3072
llama_model_loader: - kv 14: minimax-m2.feed_forward_length u32 = 1536
llama_model_loader: - kv 15: minimax-m2.attention.head_count u32 = 48
llama_model_loader: - kv 16: minimax-m2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 17: minimax-m2.rope.freq_base f32 = 5000000.000000
llama_model_loader: - kv 18: minimax-m2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 19: minimax-m2.expert_count u32 = 256
llama_model_loader: - kv 20: minimax-m2.expert_used_count u32 = 8
llama_model_loader: - kv 21: minimax-m2.expert_gating_func u32 = 2
llama_model_loader: - kv 22: minimax-m2.attention.key_length u32 = 128
llama_model_loader: - kv 23: minimax-m2.attention.value_length u32 = 128
llama_model_loader: - kv 24: general.file_type u32 = 141
llama_model_loader: - kv 25: minimax-m2.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 26: minimax-m2.rope.dimension_count u32 = 64
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 29: tokenizer.ggml.pre str = minimax-m2
llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,200064] = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ...
llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,200064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,199744] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r...
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 200034
llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 200020
llama_model_loader: - kv 35: tokenizer.ggml.unknown_token_id u32 = 200021
llama_model_loader: - kv 36: tokenizer.chat_template str = {# ----------‑‑‑ special token ...
llama_model_loader: - kv 37: quantize.imatrix.file str = /mnt/data/models/ubergarm/MiniMax-M2....
llama_model_loader: - kv 38: quantize.imatrix.dataset str = ubergarm-imatrix-calibration-corpus-v...
llama_model_loader: - kv 39: quantize.imatrix.entries_count i32 = 497
llama_model_loader: - kv 40: quantize.imatrix.chunks_count i32 = 796
llama_model_loader: - kv 41: split.no u16 = 0
llama_model_loader: - kv 42: split.count u16 = 5
llama_model_loader: - kv 43: split.tensors.count i32 = 809
llama_model_loader: - type f32: 373 tensors
llama_model_loader: - type q8_0: 250 tensors
llama_model_loader: - type iq5_k: 124 tensors
llama_model_loader: - type iq6_k: 62 tensors
load: 0 unused tokens
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 200004 ('')
load: - 200005 ('')
load: - 200020 ('[e
[')
load: special tokens cache size = 54
load: token to piece cache size = 1.3355 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = minimax-m2
llm_load_print_meta: n_ctx_train = 196608
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 62
llm_load_print_meta: n_head = 48
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 6
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 1536
llm_load_print_meta: n_expert = 256
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 5000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 196608
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_n_group = 0
llm_load_print_meta: model type = 230B.A10B
llm_load_print_meta: model ftype = IQ5_K - 5.5 bpw
llm_load_print_meta: model params = 228.690 B
llm_load_print_meta: model size = 157.771 GiB (5.926 BPW)
llm_load_print_meta: repeating layers = 156.555 GiB (5.912 BPW, 227.461 B parameters)
llm_load_print_meta: general.name = MiniMax M2.7
print_info: vocab type = BPE
print_info: n_vocab = 200064
print_info: n_merges = 199744
print_info: BOS token = 200034 ']!b['
print_info: EOS token = 200020 '[e
['
print_info: UNK token = 200021 ']!d['
print_info: LF token = 10 'Ċ'
print_info: FIM PRE token = 200001 ''
print_info: FIM SUF token = 200003 ''
print_info: FIM MID token = 200002 ''
print_info: FIM PAD token = 200004 ''
print_info: FIM REP token = 200005 ''
print_info: EOG token = 200004 ''
print_info: EOG token = 200005 ''
print_info: EOG token = 200020 '[e
['
print_info: max token length = 256
======================================= HAVE_FANCY_SIMD is NOT defined
------------------- Layer sizes:
Layer 0: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 1: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 2: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 3: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 4: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 5: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 6: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 7: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 8: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 9: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 10: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 11: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 12: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 13: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 14: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 15: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 16: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 17: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 18: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 19: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 20: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 21: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 22: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 23: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 24: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 25: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 26: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 27: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 28: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 29: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 30: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 31: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 32: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 33: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 34: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 35: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 36: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 37: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 38: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 39: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 40: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 41: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 42: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 43: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 44: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 45: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 46: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 47: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 48: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 49: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 50: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 51: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 52: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 53: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 54: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 55: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 56: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 57: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 58: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 59: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 60: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 61: 2585.68, 2.12, 2587.80 216.00 MiB
Layer 62: 622.76, 565.50, 1188.26 MiB (output layer)

Total : 160311.96, 697.25, 161009.21 MiB
Memory required for model tensors + cache: 161632 MiB
Memory available on all devices - compute: 179596 MiB
Setting default device in layer 0 to 0
Setting default device in layer 1 to 0
Setting default device in layer 2 to 0
Setting default device in layer 3 to 0
Setting default device in layer 4 to 0
Setting default device in layer 5 to 0
Setting default device in layer 6 to 0
Setting default device in layer 7 to 0
Setting default device in layer 8 to 1
Setting default device in layer 9 to 1
Setting default device in layer 10 to 1
Setting default device in layer 11 to 1
Setting default device in layer 12 to 1
Setting default device in layer 13 to 1
Setting default device in layer 14 to 1
Setting default device in layer 15 to 1
Setting default device in layer 16 to 2
Setting default device in layer 17 to 2
Setting default device in layer 18 to 2
Setting default device in layer 19 to 2
Setting default device in layer 20 to 2
Setting default device in layer 21 to 2
Setting default device in layer 22 to 2
Setting default device in layer 23 to 3
Setting default device in layer 24 to 3
Setting default device in layer 25 to 3
Setting default device in layer 26 to 3
Setting default device in layer 27 to 3
Setting default device in layer 28 to 3
Setting default device in layer 29 to 3
Setting default device in layer 30 to 3
Setting default device in layer 31 to 4
Setting default device in layer 32 to 4
Setting default device in layer 33 to 4
Setting default device in layer 34 to 4
Setting default device in layer 35 to 4
Setting default device in layer 36 to 4
Setting default device in layer 37 to 4
Setting default device in layer 38 to 4
Setting default device in layer 39 to 5
Setting default device in layer 40 to 5
Setting default device in layer 41 to 5
Setting default device in layer 42 to 5
Setting default device in layer 43 to 5
Setting default device in layer 44 to 5
Setting default device in layer 45 to 5
Setting default device in layer 46 to 5
Setting default device in layer 47 to 6
Setting default device in layer 48 to 6
Setting default device in layer 49 to 6
Setting default device in layer 50 to 6
Setting default device in layer 51 to 6
Setting default device in layer 52 to 6
Setting default device in layer 53 to 6
Setting default device in layer 54 to 6
Setting default device in layer 55 to 7
Setting default device in layer 56 to 7
Setting default device in layer 57 to 7
Setting default device in layer 58 to 7
Setting default device in layer 59 to 7
Setting default device in layer 60 to 7
Setting default device in layer 61 to 7
Setting default device in layer 62 to 7
llm_load_tensors: ggml ctx size = 43.21 MiB

========================================================
merge_qkv is not compatible with split mode 'graph'
=> turning off merge_qkv

merge_up_gate_exps: merging up/gate in layer 0
merge_up_gate_exps: merging up/gate in layer 1
merge_up_gate_exps: merging up/gate in layer 2
merge_up_gate_exps: merging up/gate in layer 3
merge_up_gate_exps: merging up/gate in layer 4
merge_up_gate_exps: merging up/gate in layer 5
merge_up_gate_exps: merging up/gate in layer 6
merge_up_gate_exps: merging up/gate in layer 7
merge_up_gate_exps: merging up/gate in layer 8
merge_up_gate_exps: merging up/gate in layer 9
merge_up_gate_exps: merging up/gate in layer 10
merge_up_gate_exps: merging up/gate in layer 11
merge_up_gate_exps: merging up/gate in layer 12
merge_up_gate_exps: merging up/gate in layer 13
merge_up_gate_exps: merging up/gate in layer 14
merge_up_gate_exps: merging up/gate in layer 15
merge_up_gate_exps: merging up/gate in layer 16
merge_up_gate_exps: merging up/gate in layer 17
merge_up_gate_exps: merging up/gate in layer 18
merge_up_gate_exps: merging up/gate in layer 19
merge_up_gate_exps: merging up/gate in layer 20
merge_up_gate_exps: merging up/gate in layer 21
merge_up_gate_exps: merging up/gate in layer 22
merge_up_gate_exps: merging up/gate in layer 23
merge_up_gate_exps: merging up/gate in layer 24
merge_up_gate_exps: merging up/gate in layer 25
merge_up_gate_exps: merging up/gate in layer 26
merge_up_gate_exps: merging up/gate in layer 27
merge_up_gate_exps: merging up/gate in layer 28
merge_up_gate_exps: merging up/gate in layer 29
merge_up_gate_exps: merging up/gate in layer 30
merge_up_gate_exps: merging up/gate in layer 31
merge_up_gate_exps: merging up/gate in layer 32
merge_up_gate_exps: merging up/gate in layer 33
merge_up_gate_exps: merging up/gate in layer 34
merge_up_gate_exps: merging up/gate in layer 35
merge_up_gate_exps: merging up/gate in layer 36
merge_up_gate_exps: merging up/gate in layer 37
merge_up_gate_exps: merging up/gate in layer 38
merge_up_gate_exps: merging up/gate in layer 39
merge_up_gate_exps: merging up/gate in layer 40
merge_up_gate_exps: merging up/gate in layer 41
merge_up_gate_exps: merging up/gate in layer 42
merge_up_gate_exps: merging up/gate in layer 43
merge_up_gate_exps: merging up/gate in layer 44
merge_up_gate_exps: merging up/gate in layer 45
merge_up_gate_exps: merging up/gate in layer 46
merge_up_gate_exps: merging up/gate in layer 47
merge_up_gate_exps: merging up/gate in layer 48
merge_up_gate_exps: merging up/gate in layer 49
merge_up_gate_exps: merging up/gate in layer 50
merge_up_gate_exps: merging up/gate in layer 51
merge_up_gate_exps: merging up/gate in layer 52
merge_up_gate_exps: merging up/gate in layer 53
merge_up_gate_exps: merging up/gate in layer 54
merge_up_gate_exps: merging up/gate in layer 55
merge_up_gate_exps: merging up/gate in layer 56
merge_up_gate_exps: merging up/gate in layer 57
merge_up_gate_exps: merging up/gate in layer 58
merge_up_gate_exps: merging up/gate in layer 59
merge_up_gate_exps: merging up/gate in layer 60
merge_up_gate_exps: merging up/gate in layer 61
================================ max_gpu = 0
Estimated model buffer size per device:
Device 0: 21626.51 MiB
Device 1: 21601.38 MiB
Device 2: 21626.51 MiB
Device 3: 21206.29 MiB
Device 4: 20783.29 MiB
Device 5: 21601.38 MiB
Device 6: 21203.51 MiB
Device 7: 21626.51 MiB
No tensors in buffer type CUDA0
No tensors in buffer type CUDA1
No tensors in buffer type CUDA2
No tensors in buffer type CUDA3
No tensors in buffer type CUDA4
No tensors in buffer type CUDA5
No tensors in buffer type CUDA6
llm_load_tensors: offloading 62 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 63/63 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 622.76 MiB
llm_load_tensors: CUDA_Split buffer size = 171276.05 MiB
llm_load_tensors: CUDA7 buffer size = 622.77 MiB
...................................................................................................ggml_backend_cuda_context: have 0 graphs
.
llama_init_from_model: n_ctx = 1024
llama_init_from_model: n_batch = 1024
llama_init_from_model: n_ubatch = 1024
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 5000000.0
llama_init_from_model: freq_scale = 1
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->5
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 3->6
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 5->1
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 6->3
llama_kv_cache_init: CUDA_Split KV buffer size = 131.95 MiB
llama_kv_cache_init: KV cache size per device:
Device 0: 16.4688 MiB
Device 1: 16.2031 MiB
Device 2: 16.4688 MiB
Device 3: 16.7344 MiB
Device 4: 16.7344 MiB
Device 5: 16.2031 MiB
Device 6: 16.4688 MiB
Device 7: 16.4688 MiB
llama_init_from_model: KV self size = 131.75 MiB, K (q8_0): 65.88 MiB, V (q8_0): 65.88 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.76 MiB
llama_init_from_model: CUDA0 compute buffer size = 211.00 MiB
llama_init_from_model: CUDA1 compute buffer size = 218.00 MiB
llama_init_from_model: CUDA2 compute buffer size = 212.00 MiB
llama_init_from_model: CUDA3 compute buffer size = 208.00 MiB
llama_init_from_model: CUDA4 compute buffer size = 220.00 MiB
llama_init_from_model: CUDA5 compute buffer size = 218.00 MiB
llama_init_from_model: CUDA6 compute buffer size = 220.00 MiB
llama_init_from_model: CUDA7 compute buffer size = 793.50 MiB
llama_init_from_model: CUDA_Host compute buffer size = 14.01 MiB
llama_init_from_model: graph nodes = 17946
llama_init_from_model: graph splits = 991
llama_init_from_model: enabling only_active_experts scheduling
INFO [ init] initializing slots | tid="140336959574016" timestamp=1776447003 n_slots=1
INFO [ init] new slot | tid="140336959574016" timestamp=1776447003 id_slot=0 n_ctx_slot=1024
srv init: Exclude reasoning tokens when selecting slot based on similarity: start: , end:
use --reasoning-tokens none to disable.
slot init: id 0 | task -1 | speculative decoding context initialized
prompt cache is enabled, size limit: 32768 MiB
use --cache-ram 0 to disable the prompt cache
render_message_to_json: Neither string content nor typed content is supported by the template. This is unexpected and may lead to issues.
init: chat template, example_format: ']
!b[]b]system
You are a helpful assistant[e
[
]b]user
Hello[e
[
]b]ai
Hi there[e
[
]b]user
How are you?[e
[
]~b]ai

'
render_message_to_json: Neither string content nor typed content is supported by the template. This is unexpected and may lead to issues.
render_message_to_json: Neither string content nor typed content is supported by the template. This is unexpected and may lead to issues.
INFO [ main] model loaded | tid="140336959574016" timestamp=1776447004
srv init: init: chat template, thinking = 1
INFO [ main] HTTP server listening | tid="140336959574016" timestamp=1776447004 n_threads_http="63" port="8080" hostname="0.0.0.0"
INFO [ slots_idle] all slots are idle | tid="140336959574016" timestamp=1776447004

Ahh right, with -fit off you might have to have eight -ts 1,.95,.9,..... kind of thing.. maybe leave -fit on and does it just does the right thing?

looking more closely through the crossed out stuff i see it is fitting at first, but then OOMs due to cuda buffers likely:

CUDA error: out of memory
current device: 7, in function ggml_backend_cuda_split_buffer_init_tensor at /home/admin_ia/ik_llama.cpp/ggml/src/ggml-cuda.cu:840`

So I'd recommend removing the -ub 2048 -b 2048 lines completely and leave it at default batch sizes e.g. -ub 512 -b 2048 then you might be able to add some more context length.

though i suppose it could be just a little too big for full GPU offload due to additional space needed for buffers across so many cards. in which case adding --n-cpu-moe 8 or something might give enough VRAM for kv-cache.

have fun playing with it and let us know where you end up! cheers!

with change : -ub 512 -b 2048 works with 16K of context
with add :--n-cpu-moe 8
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-IQ5_K-00001-of-00005.gguf --alias MiniMax-M2.7-IQ5_K -muge --merge-qkv --host 0.0.0.0 --port 8080 --ctx-size 163840 --no-mmap --threads 1 -ub 512 -b 2048 --parallel 1 --flash-attn 1 --n-gpu-layers 999 --split-mode graph -gr --cache-type-v q8_0 --cache-type-k q8_0 --jinja --cache-ram 32768 --prompt-cache-all --spec-type ngram-map-k4v --spec-ngram-size-n 8 --spec-ngram-size-m 8 --spec-ngram-min-hits 2 --draft-min 1 --draft-max 12 --n-cpu-moe 8
Works but very , very slowly ... ( 8 token/s pp 30 tok/s )
i try a lot , after i use :
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/MiniMax-M2.7-GGUF/MiniMax-M2.7-IQ5_K-00001-of-00005.gguf --alias MiniMax-M2.7-IQ5_K --host 0.0.0.0 --port 8080 --ctx-size 163840 --no-mmap --threads 32 --threads-batch 64 -ub 512 -b 2048 --parallel 1 --flash-attn 1 --n-gpu-layers 999 --split-mode graph --fit --fit-margin 4500 -gr -ger --cache-type-v q8_0 --cache-type-k q8_0 --jinja --cache-ram 131072 -muge --prompt-cache-all --spec-type ngram-map-k4v --spec-ngram-size-n 8 --spec-ngram-size-m 8 --spec-ngram-min-hits 2 --draft-min 1 --draft-max 12 -rtr --cache-ram-similarity 0.45
but speculative decoding with this parameters make the llm slowly with long context. I make a file ( sorry in french ) for explain all, wait i put it when i have finish tests

Sign up or log in to comment