Testing IQ3_K

#3
by shewin - opened

W790E Sage + QYFS + 512G + RTX5090


Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
===================================== llama_new_context_with_model: f16
llama_new_context_with_model: n_ctx = 176640
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 1
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: split_mode_graph_scheduling = 0
llama_new_context_with_model: reduce_type = f16
llama_new_context_with_model: sched_async = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 6288.87 MiB
llama_new_context_with_model: KV self size = 6288.84 MiB, c^KV (q8_0): 6288.84 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11075.36 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1491.88 MiB
llama_new_context_with_model: graph nodes = 4075
llama_new_context_with_model: graph splits = 122
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload

main: n_kv_max = 176640, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4090 1022 0 46.090 88.74 77.462 13.19
4090 1022 4090 46.290 88.36 72.298 14.14
4090 1022 8180 46.676 87.63 76.400 13.38
4090 1022 12270 47.055 86.92 66.061 15.47
4090 1022 16360 47.510 86.09 88.707 11.52
4090 1022 20450 47.875 85.43 67.692 15.10

2026-02-08_14-49
--merge-qkv
--ctx-size 176608
-amb 512
-ctk q8_0
-mla 3
--parallel 1
--threads 101
--no-mmap
--jinja
--special
--chat-template-file ./models/templates/Kimi-K2-Thinking.jinja
-b 4090 -ub 4090
--n-gpu-layers 99
--override-tensor exps=CPU \

When I do some coding and then run tests, the s_tg drops to 8.5.
I don't know why.

without no-mmap option:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4090 1022 0 45.584 89.72 92.669 11.03
4090 1022 4090 45.918 89.07 63.816 16.01
4090 1022 8180 47.111 86.82 70.393 14.52
4090 1022 12270 47.460 86.18 65.261 15.66
4090 1022 16360 47.849 85.48 88.664 11.53
4090 1022 20450 48.241 84.78 89.413 11.43

-b 4090 -ub 4090 -> 4096

main: n_kv_max = 176640, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 45.342 90.34 72.740 14.08
4096 1024 4096 45.603 89.82 63.989 16.00
4096 1024 8192 78.511 52.17 65.520 15.63
4096 1024 12288 78.933 51.89 80.054 12.79
4096 1024 16384 79.055 51.81 81.476 12.57
4096 1024 20480 79.561 51.48 81.695 12.53

I'm late but good model at iq3

prompt eval time = 2765.75 ms / 45 tokens ( 61.46 ms per token, 16.27 tokens per second)
eval time = 693698.26 ms / 5435 tokens ( 127.64 ms per token, 7.83 tokens per second)

image

./build/bin/llama-server
--model "/mnt/ExtraStorage/Models/Kimi-K2.5-IQ3_K-00001-of-00012.gguf"
--alias "KimiK2.5IQ3"
--slot-save-path "/tmp/claw_cache/mem"
--prompt-cache "/tmp/claw_cache/mem/step_35_base.bin"
--prompt-cache-all
-c 32768 -ctk q8_0 -ctv q8_0
-b 4096
-amb 2048
-mla 3
-fa on
-ub 4096
-ngl 99
-sm graph
-gr
-smgs
-ger
--n-cpu-moe 99
-ts 1,1
--parallel 1
--threads 42
--host 0.0.0.0
--port 8080
--jinja
--special
--mirostat 3
--mirostat-lr 0.05

-b 4090 -ub 4090 -> 4096

main: n_kv_max = 176640, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 45.342 90.34 72.740 14.08
4096 1024 4096 45.603 89.82 63.989 16.00
4096 1024 8192 78.511 52.17 65.520 15.63
4096 1024 12288 78.933 51.89 80.054 12.79
4096 1024 16384 79.055 51.81 81.476 12.57
4096 1024 20480 79.561 51.48 81.695 12.53

Are the 'low' PP numbers normal with such a setup? I have the same system with 2x5090 and my PP numbers are almost the same. What's the bottleneck?

@kzoltan

Are the 'low' PP numbers normal with such a setup? I have the same system with 2x5090 and my PP numbers are almost the same. What's the bottleneck?

'low' relative to what? Kimi-K2.5 is about the biggest model people run at home, so keep that in mind. Also it uses MLA attention which while using less VRAM to store kv-cache it uses more compute psure.

How are you handling NUMA setup in your BIOS e.g. SNC=Disable etc to get as much RAM bandwidth into a single NUMA node (use intel mlc to check details). Assuming you have 2x 5090s at full PCIe Gen 5 16x lanes each you might want to use unquantized kv-cache e.g. default of -ctk f16 (no need to specify ctv on an MLA model, it uses whatever ctk is for both for MLA) that can possibly help. There is no -sm graph support for MLA models yet either psure, so give your full command here if you want to workshop it. Also I recommend having a script for each start configuration and then test them with llama-sweep-bench like shown above here to figure out the best strategy for your specific desired workload e.g. "maximize PP"... you might be able to go up to -ub 8192 -b 8192 also in some cases for a little more PP etc.

Sure, thanks for the offer :)

Relative to these results (although some of those machines are very diff): https://www.reddit.com/r/LocalLLaMA/comments/1qriwnv/post_your_hardwaresoftwaremodel_quant_and/
My setup is a bit tricky as I use KVM (with 1G hugepages, CPU pinning , PCI passthrough). This should not cause a significant drop in performance.

For NUMA, I have a flat setup with a single node (command output from host):

numactl --hardware

available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 0 size: 515265 MB
node 0 free: 8921 MB
node distances:
node 0
0: 10

MLC shows this on the host (it is a few percent less on the guest):

./mlc --peak_injection_bandwidth

Intel(R) Memory Latency Checker - v3.11b
Command line parameters: --peak_injection_bandwidth
Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes
*** Unable to modify prefetchers (try executing 'modprobe msr')
*** So, enabling random access for latency measurements
Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 230077.0
3:1 Reads-Writes : 198895.5
2:1 Reads-Writes : 190655.3
1:1 Reads-Writes : 177761.8
Stream-triad like: 196297.9

I only have the 5090s on PCIe 4x16 because of the need for risers.
-b 8192 -ub 8192 doubles the PP, I'm just trying to avoid it because of occasional shorter prompts (I'm not entirely sure this makes sense though).

This is my test script:

#!/bin/bash

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1,2

MODEL_PATH=/mnt/1/models/ubergarm/Kimi-K2.5-GGUF/IQ3_K/Kimi-K2.5-IQ3_K-00001-of-00012.gguf

/mnt/1/ik_llama.cpp/build/bin/llama-sweep-bench
--model "$MODEL_PATH"
--no-mmap
--merge-qkv
-c 65565
-ctk f16
-amb 512
-mla 3
-gr
--threads 108
-b 8192 -ub 8192
-ngl 999
-ot exps=CPU
--warmup-batch
-n 128

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 46.866 87.40 9.848 13.00
4096 128 4096 47.433 86.35 10.083 12.69
4096 128 8192 47.595 86.06 10.269 12.47
4096 128 12288 47.960 85.40 10.389 12.32
4096 128 16384 48.446 84.55 10.357 12.36
4096 128 20480 48.888 83.78 10.370 12.34
4096 128 24576 49.204 83.25 10.187 12.57
4096 128 28672 49.569 82.63 9.937 12.88
4096 128 32768 52.001 78.77 10.103 12.67
4096 128 36864 50.271 81.48 10.108 12.66
4096 128 40960 50.708 80.78 10.305 12.42
4096 128 45056 51.073 80.20 10.541 12.14
4096 128 49152 51.475 79.57 10.904 11.74
4096 128 53248 51.808 79.06 10.898 11.75
4096 128 57344 52.415 78.15 11.126 11.50
4096 128 61440 52.563 77.93 10.928 11.71

/mnt/1/ik_llama.cpp/build/bin/llama-sweep-bench
--model "$MODEL_PATH"
--no-mmap
--merge-qkv
-c 65565
-ctk f16
-amb 512
-mla 3
-gr
--threads 108
-b 8192 -ub 8192
-ngl 999
-ot exps=CPU
--warmup-batch
-n 128

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 128 0 51.988 157.57 10.159 12.60
8192 128 8192 53.208 153.96 10.301 12.43
8192 128 16384 54.641 149.92 10.515 12.17
8192 128 24576 56.219 145.72 10.684 11.98
8192 128 32768 57.549 142.35 10.891 11.75
8192 128 40960 59.056 138.72 11.207 11.42
8192 128 49152 62.427 131.23 19.938 6.42
8192 128 57344 61.776 132.61 11.768 10.88
Owner
β€’
edited Feb 22

@kzoltan

Relative to these results (although some of those machines are very diff)

Many of those rigs have RTX 6000 PRO GPUs and/or are running sglang it seems. Some of the llama.cpp reports from that reddit thread have much lower scores than you. Even some of the faster reports suggested that at long context it drops off quite a bit with long context as I think makes sense given additional MLA compute required.

Also, a wise man once said, "comparison is a bitch" πŸ˜…

How much speed do you need for your desired workloads? There are many good model options available now if you prefer high PP for processing long context etc. I'm guessing most people will want at least a couple models e.g. big Kimi-K2.5 or GLM-5 for slow grunt work laying out a new vibe code project that doesn't have much context yet. Then possibly smaller models like GLM-4.7-Flash or Qwen3-Coder-Next fully offloaded on GPU for zippy small refactoring jobs etc. Also don't sleep on the new Qwen3.5 MoE as it is a good blend of speed and quality imo.

MLC shows this on the host (it is a few percent less on the guest)πŸ˜€

Okay, about 230MB/s read bandwidth is pretty good, and will set the upper bound of your token generation speed for most quants. PP though is mostly compute bottlenecked and possibly effected by the PCIe Gen4 depending on if it is doing the offloading routed experts stuff (e.g. experiment with --offload-only-active-experts or -ooae maybe https://github.com/ikawrakow/ik_llama.cpp/pull/698 or there are advanced -cuda ... things i see people use but which I don't use much myself.

-b 8192 -ub 8192 doubles the PP, I'm just trying to avoid it because of occasional shorter prompts (I'm not entirely sure this makes sense though).

You might even be able to push up to -ub 16384 -b 16384 but personally i stick around -ub 4096 -b 4096 for testing. Though I just started going up to 8k with Qwen3.5 MoE as the extra prompt processing makes a difference vibe coding over 65k context. Play around with it and see because the extra PP likely outweighs latency for small batches for a typical vibe coding sesh.

This is my test script:

Thanks, okay I believe you copy-paste errored the first one as the result shows 4096 batch size, but I understand.

A few things to consider:

  1. You're all good given a single NUMA node that makes this easier now, but keep in mind the software is not really NUMA optimized.
  2. You are likely leavin a lot of VRAM on the table given you have 2x5090 = 64GB total and are not offloading additional routed expert layers and kv-cache is already very efficient. You can check with nvidia-smi or nvitop or nvtop or btop to confirm the large amount of unused VRAM.
  3. Have you searched around the optimum setting for threads (for PP) and threads-batch (for TG)? Usually I set threads lower than threads-batch as PP benefits from more threads but TG can become worse due to memory contention. If you only specify threads it is used for both settings.
  4. You might be able to squeek a little more speed out increasing the -amb buffer up to say 1024 or 2048 just to see, but don't expect much.
  5. I removed -gr as that is on by default now: https://github.com/ikawrakow/ik_llama.cpp/pull/1094
  6. I added -ger to try, but honestly not sure to which models it applies.
  7. Added explicit additional routed exps layer offloads example, increase as much as possible until OOMing VRAM.

So here is a potentialliy workshopped command you could try:

#!/bin/bash

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1,2

MODEL_PATH=/mnt/1/models/ubergarm/Kimi-K2.5-GGUF/IQ3_K/Kimi-K2.5-IQ3_K-00001-of-00012.gguf

/mnt/1/ik_llama.cpp/build/bin/llama-sweep-bench \
  --model "$MODEL_PATH" \
  -c 65565 \
  -mla 3 \
  -amb 1024 \
  -ctk f16 \
  -ger \
  --merge-qkv \
  -b 8192 -ub 8192 \
  -ngl 999 \
  -ot "blk\.(3|4|5)\.ffn_(gate|up|down)_exps.*=CUDA0" \
  -ot "blk\.(58|59|60)\.ffn_(gate|up|down)_exps.*=CUDA1" \
  -ot exps=CPU \
  --threads 96 \
  --threads-batch 108 \
  --no-mmap \
  --warmup-batch \
  -n 128

EDIT

Also for actual llama-server use for vibe coding LLM, there are many cache options that might help you not have to re-process similar enough contexts.

I don't know if any of this stuff helps, but you get the idea of the kinds of things you could try or research:

    --slot-save-path ./slot-saves/ \
    --slot-prompt-similarity 0.1 \
    --cache-ram 65536 \
    --cache-ram-n-min 128 \
    --cache-ram-similarity 1 \
    --context-shift on

Thanks for the review and the settings. To give you a bit of context:

I have 4 GPUs total (5, but the last on is for displays only): 2xRTX Pro 6000 max-q on 5x16 (they run Minimax M2.5 AWQ) and 2x5090 on 4x16 for anything else. At work I only have access to closed frontier models for coding and good planning can make a huge difference even with those, so I had the idea that I could use the 5090s with experts on the CPU to run a larger model for planning (while keeping Minimax loaded), exactly as you wrote. Kimi K2.5 is just the first try, as I saw great reports from people about it's speed and capability. The planner model does not need to be super fast (~10t/s TG is just fine), but faster PP is still very useful here (I do planning both at the start of a project AND on subtasks, so context can grow to 40-80k).

The results with your initial script (with the GPU memory filled):

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 128 0 49.897 164.18 8.878 14.42
8192 128 8192 51.648 158.61 9.357 13.68
8192 128 16384 52.976 154.64 9.372 13.66
8192 128 24576 54.710 149.74 9.533 13.43
8192 128 32768 56.085 146.07 9.708 13.19
8192 128 40960 57.727 141.91 9.868 12.97
8192 128 49152 59.296 138.15 10.099 12.67
8192 128 57344 60.923 134.46 10.308 12.42

What's interesting is that when doing PP, one GPU is around 50% utilization and nvidia-smi shows ~10GiB/s Rx, is this expected? Now that I looked into this, it seems the second RTX Pro is only running on 4x16. This might be an issue with the ES CPU (considering one user on the liked localllama page reports way better PP on an Intel CPU with half the cores, it is diff software though).

Just to see how it looks, I ran the same test with the RTX Pro 6000 on 5x16 (with approx the same offloading):

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 128 0 49.565 165.28 9.089 14.08
8192 128 8192 51.219 159.94 9.326 13.73
8192 128 16384 52.667 155.54 9.533 13.43
8192 128 24576 54.379 150.65 9.753 13.12
8192 128 32768 55.889 146.58 9.973 12.83
8192 128 40960 57.747 141.86 10.167 12.59
8192 128 49152 59.336 138.06 10.288 12.44
8192 128 57344 60.924 134.46 10.336 12.38

The results are not bad, but I still have a suspicion that there is a setup/hw issue with my system (or maybe the ES CPU has some surprises).

@kzoltan

Oh sounds like a very nice setup with multiple models! What client are you using for your vibe coding harness? opencode or something else? Does it automatically make use of two models e.g. big slow planner and smaller faster model too?

What's interesting is that when doing PP, one GPU is around 50% utilization and nvidia-smi shows ~10GiB/s Rx, is this expected?

This is with Kimi-K2.5 right? It does not support -sm graph tensor parallel, and uses the old -sm layer with which usually only a single GPU at a time will max out I think. I too like to watch with nvtop in one tmux pane to see what is going on. If you can use -sm graph with many non MLA models, both GPUs will max out almost 95% most of the time.

Prompt processing tends to be compute bottlenecked.. Does your ES CPU have real avx512 instructions specifically lscpu | grep avx512_vnni? ik_llama.cpp gives a big boost for PP with that, it will print out HAVE FANCY SIMD or something sometimes if detected.

I use opencode. Although I think all agents could be configured to use different models in Opencode, I'm switching models manually for now for main tasks and have only configured subagents with specific models.

Yes, this is still Kimi K2.5 IQ3_K (I will try others soon, especially with the graph mode). I think I need to rephrase. When prompt processing:

  • The CPU cores seem almost idle,
  • The first GPU is around 50% utilization (with ~10GiB/s Rx transfer rate),
  • The second GPU is utilized less than 10% (with very low transfer rates).

It seems like the GPUs are bottlenecked by something, I'm just not sure what. That would explain my lower PP results compared to others.

To answer your question about the instructions, the instruction is there on both host and guest

lscpu | grep --color=auto avx512_vnni # on the host

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm avx512_vp2intersect md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities ibpb_exit_to_user

$ lscpu | grep --color=auto avx512_vnni # on the guest
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm avx512_vp2intersect md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities

@kzoltan

The CPU cores seem almost idle,
The first GPU is around 50% utilization (with ~10GiB/s Rx transfer rate),
The second GPU is utilized less than 10% (with very low transfer rates).

Because the MLA models like Kimi-K2.5 and DS and new GLM-5 don't yet have --graph parallel support, I think this is normal?

You'd have to check with some other multi-GPU folks or take some extra time and maybe compare with SGLANG and one of their special quants etc.

I haven't run a good llama-sweep-bench test with Kimi-K2.5 on the dual RTX A6000 rig (thread ripper pro zen 4 with 24x cores, no avx512_vnni 256GB DDR5@4800 220GB/s ram bandwidth) yet myself.

Also heads up, new Qwen models incoming: https://huggingface.co/ubergarm/Qwen3.5-122B-A10B-GGUF

Thank you for all your help!
Just to make sure, I ran a test on my KVM host to see if something is off with the virtualization, but the result is almost the same. The good news is that 1GB hugepages still make a difference and with using them the virtualization overhead is almost zero. I will try sglang eventually, but Kimi is a bit too large for my 512GB RAM, so I can't fit the version everyone is using.

Is there a list of models supported by --graph parallel? Those new models might be useful for the same purpose...

Is there a list of models supported by --graph parallel? Those new models might be useful for the same purpose...

Yes, the best list is right here: https://github.com/ikawrakow/ik_llama.cpp/blob/main/src/llama.cpp#L1910-L1926

ik is working more on -sm graph for gated delta net stuff as mentioned in this recent PR: https://github.com/ikawrakow/ik_llama.cpp/pull/1320

Sign up or log in to comment