can someone help me with how to ofload tensors

#12
by theracn - opened

please excuse my noobins, i have 3x 3090 and 1 4070 (m.2 PCIE adapter)with all in all 84gb 1 of the 3090 is on 16 lane pcie the rest are 4 x , i have DDR4 96 gb RAM and a 13700k CPU on a what would be the best fastest command (tensors overloading) to run the 160gb Model with 32k or 65k context for example?

Owner

It can be a bit intimidating to get your first commands dialed in for your rig, but gets easier with some practice.

You can check here for some explanation: https://gist.github.com/DocShotgun/a02a4c0c0a57e43ff4f038b46ca66ae0

Also check out beaver ai discord and there is a lot of discussion of running MoEs on various rigs similar to yours: https://huggingface.co/BeaverAI

Also check out some of my other model cards for possibly some other ways to do it.

Give you're using ik_llama.cpp you'll want to use -sm graph with this model for best speeds across multiple GPUs.

Thanks for the fast answer <3 ! the discord link is actually not working it says faild to join, i think i need to be invited for it dragonsword1. thats my username if u can invite me >.<, and im gonna give it a read, well i got it running but a bit on the edge of my limits to be honest xD its going on 5 PP, and 2 TG, i think im still doing it somehow wrong ,my CPU usage doesnt go more than 35% while GPU 1 go 100 and the others are at 50% , ill just paste my log maybe u can tell me what went wrong here C:\Program Files\Microsoft Visual Studio\2022\Community>D:\iklama\ik_llama.cpp\build\bin\Release\llama-server.exe ^
More? --model "D:\models\GLM 4.7\4.6\GLM-4.6-IQ3_KS-00001-of-00004.gguf" ^
More? --device CUDA0,CUDA1,CUDA2 ^
More? --ctx-size 8000 ^
More? -sm graph -mea 256 ^
More? -smgs ^
More? -ngl 99 ^
More? --n-cpu-moe 58 ^
More? --cache-type-k q4_0 ^
More? --cache-type-v q4_0 ^
More? -ts 34,33,33 ^
More? -b 512 -ub 512 ^
More? --threads 24 ^
More? --parallel 1 ^
More? --host 127.0.0.1 ^
More? --port 8085 ^
More? --no-mmap ^
More? --jinja
INFO [ main] build info | tid="11572" timestamp=1770585122 build=4189 commit="e22b2d12"
INFO [ main] system info | tid="11572" timestamp=1770585122 n_threads=24 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
CUDA0: using device CUDA0 - 23304 MiB free
CUDA1: using device CUDA1 - 23304 MiB free
CUDA2: using device CUDA2 - 23304 MiB free
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 50 key-value pairs and 1759 tensors from D:\models\GLM 4.7\4.6\GLM-4.6-IQ3_KS-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = glm4moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = GLM 4.6
llama_model_loader: - kv 3: general.version str = 4.6
llama_model_loader: - kv 4: general.basename str = GLM
llama_model_loader: - kv 5: general.size_label str = 160x19B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 8: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 9: glm4moe.block_count u32 = 93
llama_model_loader: - kv 10: glm4moe.context_length u32 = 202752
llama_model_loader: - kv 11: glm4moe.embedding_length u32 = 5120
llama_model_loader: - kv 12: glm4moe.feed_forward_length u32 = 12288
llama_model_loader: - kv 13: glm4moe.attention.head_count u32 = 96
llama_model_loader: - kv 14: glm4moe.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: glm4moe.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 16: glm4moe.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: glm4moe.expert_used_count u32 = 8
llama_model_loader: - kv 18: glm4moe.attention.key_length u32 = 128
llama_model_loader: - kv 19: glm4moe.attention.value_length u32 = 128
llama_model_loader: - kv 20: general.file_type u32 = 154
llama_model_loader: - kv 21: glm4moe.rope.dimension_count u32 = 64
llama_model_loader: - kv 22: glm4moe.expert_count u32 = 160
llama_model_loader: - kv 23: glm4moe.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 24: glm4moe.expert_shared_count u32 = 1
llama_model_loader: - kv 25: glm4moe.leading_dense_block_count u32 = 3
llama_model_loader: - kv 26: glm4moe.expert_gating_func u32 = 2
llama_model_loader: - kv 27: glm4moe.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 28: glm4moe.expert_weights_norm bool = true
llama_model_loader: - kv 29: glm4moe.nextn_predict_layers u32 = 1
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 32: tokenizer.ggml.pre str = glm4
llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,151552] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,318088] = ["─á ─á", "─á ─á─á─á", "─á─á ─á─á", "...
llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 151329
llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 151329
llama_model_loader: - kv 38: tokenizer.ggml.bos_token_id u32 = 151331
llama_model_loader: - kv 39: tokenizer.ggml.eot_token_id u32 = 151336
llama_model_loader: - kv 40: tokenizer.ggml.unknown_token_id u32 = 151329
llama_model_loader: - kv 41: tokenizer.ggml.eom_token_id u32 = 151338
llama_model_loader: - kv 42: tokenizer.chat_template str = [gMASK]\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv 43: quantize.imatrix.file str = /mnt/data/models/ubergarm/GLM-4.6-GGU...
llama_model_loader: - kv 44: quantize.imatrix.dataset str = ubergarm-imatrix-calibration-corpus-v...
llama_model_loader: - kv 45: quantize.imatrix.entries_count i32 = 1001
llama_model_loader: - kv 46: quantize.imatrix.chunks_count i32 = 814
llama_model_loader: - kv 47: split.no u16 = 0
llama_model_loader: - kv 48: split.count u16 = 4
llama_model_loader: - kv 49: split.tensors.count i32 = 1759
llama_model_loader: - type f32: 835 tensors
llama_model_loader: - type q8_0: 193 tensors
llama_model_loader: - type iq4_k: 1 tensors
llama_model_loader: - type iq6_k: 1 tensors
llama_model_loader: - type iq4_kss: 90 tensors
llama_model_loader: - type iq5_ks: 459 tensors
llama_model_loader: - type iq3_ks: 180 tensors
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 151329 ('<|endoftext|>')
load: - 151336 ('<|user|>')
load: - 151338 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9713 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = glm4moe
llm_load_print_meta: n_ctx_train = 202752
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_layer = 93
llm_load_print_meta: n_head = 96
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 12
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 12288
llm_load_print_meta: n_expert = 160
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 202752
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 355B.A32B
llm_load_print_meta: model ftype = IQ3_KS - 3.1875 bpw
llm_load_print_meta: model params = 356.786 B
llm_load_print_meta: model size = 148.390 GiB (3.573 BPW)
llm_load_print_meta: repeating layers = 147.385 GiB (3.564 BPW, 355.234 B parameters)
llm_load_print_meta: general.name = GLM 4.6
print_info: vocab type = BPE
print_info: n_vocab = 151552
print_info: n_merges = 318088
print_info: BOS token = 151331 '[gMASK]'
print_info: EOS token = 151329 '<|endoftext|>'
print_info: EOT token = 151336 '<|user|>'
print_info: EOM token = 151338 '<|observation|>'
print_info: UNK token = 151329 '<|endoftext|>'
print_info: PAD token = 151329 '<|endoftext|>'
print_info: LF token = 198 '─è'
print_info: FIM PRE token = 151347 '<|code_prefix|>'
print_info: FIM SUF token = 151349 '<|code_suffix|>'
print_info: FIM MID token = 151348 '<|code_middle|>'
print_info: EOG token = 151329 '<|endoftext|>'
print_info: EOG token = 151336 '<|user|>'
print_info: EOG token = 151338 '<|observation|>'
print_info: max token length = 1024
llm_load_tensors: ggml ctx size = 10.15 MiB
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.attn_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.attn_q.weight (size = 41336832 bytes) -- ignoring
model has unused tensor blk.92.attn_k.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_v.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.92.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_output.weight (size = 41308160 bytes) -- ignoring
model has unused tensor blk.92.attn_q_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.attn_k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.post_attention_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_inp.weight (size = 3276800 bytes) -- ignoring
model has unused tensor blk.92.exp_probs_b.bias (size = 640 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_exps.weight (size = 501841920 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_exps.weight (size = 501841920 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_exps.weight (size = 632422400 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_shexp.weight (size = 5167104 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_shexp.weight (size = 5181440 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_shexp.weight (size = 5167104 bytes) -- ignoring
model has unused tensor blk.92.nextn.eh_proj.weight (size = 55705600 bytes) -- ignoring
model has unused tensor blk.92.nextn.enorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.hnorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_norm.weight (size = 20480 bytes) -- ignoring
================================ max_gpu = 0
Estimated model buffer size per device:
Device 0: 21610.16 MiB
Device 1: 21282.19 MiB
Device 2: 21274.76 MiB
llm_load_tensors: offloading 93 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 94/94 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 416.25 MiB
llm_load_tensors: CPU buffer size = 85817.19 MiB
llm_load_tensors: CUDA0 buffer size = 612.83 MiB
llm_load_tensors: CUDA_Split buffer size = 64168.64 MiB
.........................................................~ggml_backend_cuda_context: have 0 graphs
...........................................
===================================== llama_new_context_with_model: f16
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 1
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: split_mode_graph_scheduling = 1
llama_new_context_with_model: reduce_type = f16
llama_new_context_with_model: sched_async = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Split KV buffer size = 828.07 MiB
llama_kv_cache_init: KV cache size per device:
Device 0: 298.125 MiB
Device 1: 265.5 MiB
Device 2: 264.375 MiB
llama_new_context_with_model: KV self size = 828.00 MiB, K (q4_0): 414.00 MiB, V (q4_0): 414.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 311.00 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 118.00 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 126.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 122.01 MiB
llama_new_context_with_model: graph nodes = 10824
llama_new_context_with_model: graph splits = 847
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
XXXXXXXX Split Mode Graph Scheduling is FORCED despite tensor overrides due to user choice.
XXXXXXXX It may or might NOT infer properly due to unsupported combinations between SMGS and every possible tensor overrides.
Failed to open logfile 'llama.log' with error 'Permission denied'
[1770585260] llama_init_from_gpt_params: setting dry_penalty_last_n to ctx_size = 8192
[1770585260] warming up the model with an empty run
INFO [ init] initializing slots | tid="11572" timestamp=1770585264 n_slots=1
INFO [ init] new slot | tid="11572" timestamp=1770585264 id_slot=0 n_ctx_slot=8192
srv init: Exclude reasoning tokens when selecting slot based on similarity: start: , end:
use --reasoning-tokens none to disable.
prompt cache is enabled, size limit: 8192 MiB
use --cache-ram 0 to disable the prompt cache
INFO [ main] model loaded | tid="11572" timestamp=1770585264
INFO [ main] chat template | tid="11572" timestamp=1770585264 chat_template="[gMASK]\n{%- if tools -%}\n<|system|>\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n\n{% for tool in tools %}\n{{ tool | tojson(ensure_ascii=False) }}\n{% endfor %}\n\n\nFor each function call, output the function name and arguments within the following XML format:\n{function-name}\n{arg-key-1}\n{arg-value-1}\n{arg-key-2}\n{arg-value-2}\n...\n{%- endif -%}\n{%- macro visible_text(content) -%}\n {%- if content is string -%}\n {{- content }}\n {%- elif content is iterable and content is not mapping -%}\n {%- for item in content -%}\n {%- if item is mapping and item.type == 'text' -%}\n {{- item.text }}\n {%- elif item is string -%}\n {{- item }}\n {%- endif -%}\n {%- endfor -%}\n {%- else -%}\n {{- content }}\n {%- endif -%}\n{%- endmacro -%}\n{%- set ns = namespace(last_user_index=-1) %}\n{%- for m in messages %}\n {%- if m.role == 'user' %}\n {% set ns.last_user_index = loop.index0 -%}\n {%- endif %}\n{%- endfor %}\n{% for m in messages %}\n{%- if m.role == 'user' -%}<|user|>\n{% set content = visible_text(m.content) %}{{ content }}\n{{- '/nothink' if (enable_thinking is defined and not enable_thinking and not content.endswith("/nothink")) else '' -}}\n{%- elif m.role == 'assistant' -%}\n<|assistant|>\n{%- set reasoning_content = '' %}\n{%- set content = visible_text(m.content) %}\n{%- if m.reasoning_content is string %}\n {%- set reasoning_content = m.reasoning_content %}\n{%- else %}\n {%- if '' in content %}\n {%- set reasoning_content = content.split('')[0].rstrip('\n').split('')[-1].lstrip('\n') %}\n {%- set content = content.split('')[-1].lstrip('\n') %}\n {%- endif %}\n{%- endif %}\n{%- if loop.index0 > ns.last_user_index and reasoning_content -%}\n{{ '\n' + reasoning_content.strip() + ''}}\n{%- else -%}\n{{ '\n' }}\n{%- endif -%}\n{%- if content.strip() -%}\n{{ '\n' + content.strip() }}\n{%- endif -%}\n{% if m.tool_calls %}\n{% for tc in m.tool_calls %}\n{%- if tc.function %}\n {%- set tc = tc.function %}\n{%- endif %}\n{{ '\n' + tc.name }}\n{% set _args = tc.arguments %}\n{% for k, v in _args.items() %}\n{{ k }}\n{{ v | tojson(ensure_ascii=False) if v is not string else v }}\n{% endfor %}\n{% endfor %}\n{% endif %}\n{%- elif m.role == 'tool' -%}\n{%- if m.content is string -%}\n{%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}\n {{- '<|observation|>' }}\n{%- endif %}\n{{- '\n\n' }}\n{{- m.content }}\n{{- '\n' }}\n{%- else -%}\n<|observation|>{% for tr in m.content %}\n\n\n{{ tr.output if tr.output is defined else tr }}\n{% endfor -%}\n{% endif -%}\n{%- elif m.role == 'system' -%}\n<|system|>\n{{ visible_text(m.content) }}\n{%- endif -%}\n{%- endfor -%}\n{%- if add_generation_prompt -%}\n <|assistant|>{{- '\n' if (enable_thinking is defined and not enable_thinking) else '' -}}\n{%- endif -%}"
INFO [ main] chat template | tid="11572" timestamp=1770585264 chat_example="[gMASK]<|system|>\nYou are a helpful assistant<|user|>\nHello<|assistant|>\n\nHi there<|user|>\nHow are you?<|assistant|>" built_in=true
INFO [ main] HTTP server listening | tid="11572" timestamp=1770585264 hostname="127.0.0.1" port="8085" n_threads_http="23"
INFO [ slots_idle] all slots are idle | tid="11572" timestamp=1770585264
INFO [ log_server_request] request | tid="31552" timestamp=1770586941 remote_addr="127.0.0.1" remote_port=53794 status=200 method="GET" path="/" params={}
INFO [ log_server_request] request | tid="31552" timestamp=1770586942 remote_addr="127.0.0.1" remote_port=53794 status=200 method="GET" path="/v1/props" params={}
INFO [ log_server_request] request | tid="23000" timestamp=1770586954 remote_addr="127.0.0.1" remote_port=64849 status=200 method="GET" path="/v1/props" params={}
INFO [ log_server_request] request | tid="23000" timestamp=1770586954 remote_addr="127.0.0.1" remote_port=64849 status=200 method="GET" path="/v1/props" params={}
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
[1770586954] Preserved token: 151329
[1770586954] Preserved token: 151330
[1770586954] Preserved token: 151331
[1770586954] Preserved token: 151332
[1770586954] Preserved token: 151333
[1770586954] Preserved token: 151334
[1770586954] Preserved token: 151335
[1770586954] Preserved token: 151336
[1770586954] Preserved token: 151337
[1770586954] Preserved token: 151338
[1770586954] Preserved token: 151339
[1770586954] Preserved token: 151340
[1770586954] Preserved token: 151341
[1770586954] Preserved token: 151342
[1770586954] Preserved token: 151343
[1770586954] Preserved token: 151344
[1770586954] Preserved token: 151345
[1770586954] Preserved token: 151346
[1770586954] Preserved token: 151347
[1770586954] Preserved token: 151348
[1770586954] Preserved token: 151349
[1770586954] Preserved token: 151360
[1770586954] Preserved token: 151350
[1770586954] Preserved token: 151351
[1770586954] Preserved token: 151352
[1770586954] Preserved token: 151353
[1770586954] Preserved token: 151356
[1770586954] Preserved token: 151357
[1770586954] Preserved token: 151358
[1770586954] Preserved token: 151359
INFO [ launch_slot_with_task] slot is processing task | tid="11572" timestamp=1770586954 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 = 0, n_past1 = 0, n_past_prompt1 = 0, n_past2 = 0, n_past_prompt2 = 0
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="11572" timestamp=1770586954 id_slot=0 id_task=0 p0=0
[1770587114] Matched tool start: "<"
[1770587119] Matched tool start: "<"
[1770587128] Matched tool start: "<"
[1770587133] Matched tool start: "<"
[1770587830] Matched tool start: "<"
[1770587836] Matched tool start: "<"
[1770587842] Matched tool start: "<"
[1770587847] Matched tool start: "<"
[1770587855] Matched tool start: "<"
[1770587859] Matched tool start: "<"
[1770587862] Matched tool start: "<"
[1770587870] Matched tool start: "<"
[1770587889] Matched tool start: "<"
[1770587893] Matched tool start: "<"
[1770587896] Matched tool start: "<"
[1770587906] Matched tool start: "<"
[1770587924] Matched tool start: "<"
[1770587928] Matched tool start: "<"
[1770587931] Matched tool start: "<"
[1770587938] Matched tool start: "<"
[1770587955] Matched tool start: "<"
[1770587959] Matched tool start: "<"
[1770587966] Matched tool start: "<"
[1770587976] Matched tool start: "<"
[1770587980] Matched tool start: "<"
[1770587987] Matched tool start: "<"
[1770588001] Matched tool start: "<"
[1770588005] Matched tool start: "<"
[1770588012] Matched tool start: "<"
[1770588022] Matched tool start: "<"
[1770588026] Matched tool start: "<"
[1770588032] Matched tool start: "<"
[1770588043] Matched tool start: "<"
[1770588053] Matched tool start: "<"
[1770588059] Matched tool start: "<"
[1770588066] Matched tool start: "<"
[1770588077] Matched tool start: "<"
[1770588087] Matched tool start: "<"
[1770588488] Matched tool start: "<"
slot print_timing: id 0 | task -1 |
prompt eval time = 4150.02 ms / 21 tokens ( 197.62 ms per token, 5.06 tokens per second)
eval time = 3082821.91 ms / 6234 tokens ( 494.52 ms per token, 2.02 tokens per second)
total time = 3086971.92 ms / 6255 tokens
INFO [ log_server_request] request | tid="23000" timestamp=1770590041 remote_addr="127.0.0.1" remote_port=64849 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [ release_slots] slot released | tid="11572" timestamp=1770590041 id_slot=0 id_task=0 n_ctx=8192 n_past=6254 n_system_tokens=0 n_cache_tokens=6254 truncated=false
INFO [ slots_idle] all slots are idle | tid="11572" timestamp=1770590041
INFO [ slots_idle] all slots are idle | tid="11572" timestamp=1770590358
~ggml_backend_cuda_context: have 480 graphs
~ggml_backend_cuda_context: have 368 graphs
~ggml_backend_cuda_context: have 734 graphs
Received second interrupt, terminating immediately.

Owner

@theracn

i think i need to be invited for it dragonsword1.

Oh hrmm, i think i sent u a request to message on discord (ubergarm is me), i'll send u the link on there.

its going on 5 PP, and 2 TG, i think im still doing it somehow wrong ,my CPU usage doesnt go more than 35% while GPU 1 go 100 and the others are at 50%

hrmm, does seem kind of slow still given you've fully offloaded it with --no-mmap so none is sitting on disk.

you're on the right track, but you jumped right into the deep end trying to load it up so much with a big quant before getting the hang of it with a smaller model. a few thoughts though:

  • i only see 3x 3090s being detected, not sure where your 4070 is or why its not showing up?
  • you can play games with the ordering of the devices, set the fastest PCIe card as device 0 and use -mg 0 might help.. you can pass in env vars to re-order devices i think.
  • what speed ddr4 do you have? in theory that memory bandwidth should be the bottle neck. you can check speed with aida64 or intel memory latency checker mlc
  • do you have any linux installed to try with that? often it can be faster than windows
  • how many physical CPU cores does your 13700k CPU? only 8 pcores? you have threads set to high, try --threads 8 --threads-batch 16 perhaps?

keep massaging it and feel free to try a smaller simple model first to get the hang of loading things. you can benchmark with llama-sweep-bench too to compare your commands to see what works best on your hardware...

Sign up or log in to comment