Testing IQ3_K

by shewin - opened Feb 8

Feb 8

W790E Sage + QYFS + 512G + RTX5090

Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
===================================== llama_new_context_with_model: f16
llama_new_context_with_model: n_ctx = 176640
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 1
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: split_mode_graph_scheduling = 0
llama_new_context_with_model: reduce_type = f16
llama_new_context_with_model: sched_async = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 6288.87 MiB
llama_new_context_with_model: KV self size = 6288.84 MiB, c^KV (q8_0): 6288.84 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11075.36 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1491.88 MiB
llama_new_context_with_model: graph nodes = 4075
llama_new_context_with_model: graph splits = 122
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload

main: n_kv_max = 176640, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	46.090	88.74	77.462	13.19
4090	1022	4090	46.290	88.36	72.298	14.14
4090	1022	8180	46.676	87.63	76.400	13.38
4090	1022	12270	47.055	86.92	66.061	15.47
4090	1022	16360	47.510	86.09	88.707	11.52
4090	1022	20450	47.875	85.43	67.692	15.10

shewin

Feb 8

--merge-qkv
--ctx-size 176608
-amb 512
-ctk q8_0
-mla 3
--parallel 1
--threads 101
--no-mmap
--jinja
--special
--chat-template-file ./models/templates/Kimi-K2-Thinking.jinja
-b 4090 -ub 4090
--n-gpu-layers 99
--override-tensor exps=CPU \

shewin

Feb 8

When I do some coding and then run tests, the s_tg drops to 8.5.
I don't know why.

shewin

Feb 8

without no-mmap option:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	45.584	89.72	92.669	11.03
4090	1022	4090	45.918	89.07	63.816	16.01
4090	1022	8180	47.111	86.82	70.393	14.52
4090	1022	12270	47.460	86.18	65.261	15.66
4090	1022	16360	47.849	85.48	88.664	11.53
4090	1022	20450	48.241	84.78	89.413	11.43

shewin

Feb 9

-b 4090 -ub 4090 -> 4096

main: n_kv_max = 176640, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	45.342	90.34	72.740	14.08
4096	1024	4096	45.603	89.82	63.989	16.00
4096	1024	8192	78.511	52.17	65.520	15.63
4096	1024	12288	78.933	51.89	80.054	12.79
4096	1024	16384	79.055	51.81	81.476	12.57
4096	1024	20480	79.561	51.48	81.695	12.53

Hunterx

Feb 12

I'm late but good model at iq3

prompt eval time = 2765.75 ms / 45 tokens ( 61.46 ms per token, 16.27 tokens per second)
eval time = 693698.26 ms / 5435 tokens ( 127.64 ms per token, 7.83 tokens per second)

./build/bin/llama-server
--model "/mnt/ExtraStorage/Models/Kimi-K2.5-IQ3_K-00001-of-00012.gguf"
--alias "KimiK2.5IQ3"
--slot-save-path "/tmp/claw_cache/mem"
--prompt-cache "/tmp/claw_cache/mem/step_35_base.bin"
--prompt-cache-all
-c 32768 -ctk q8_0 -ctv q8_0
-b 4096
-amb 2048
-mla 3
-fa on
-ub 4096
-ngl 99
-sm graph
-gr
-smgs
-ger
--n-cpu-moe 99
-ts 1,1
--parallel 1
--threads 42
--host 0.0.0.0
--port 8080
--jinja
--special
--mirostat 3
--mirostat-lr 0.05

kzoltan

Feb 22

•

edited Feb 22

-b 4090 -ub 4090 -> 4096

main: n_kv_max = 176640, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s

4096 1024 0 45.342 90.34 72.740 14.08

4096 1024 4096 45.603 89.82 63.989 16.00

4096 1024 8192 78.511 52.17 65.520 15.63

4096 1024 12288 78.933 51.89 80.054 12.79

4096 1024 16384 79.055 51.81 81.476 12.57

4096 1024 20480 79.561 51.48 81.695 12.53

Are the 'low' PP numbers normal with such a setup? I have the same system with 2x5090 and my PP numbers are almost the same. What's the bottleneck?

ubergarm

Owner Feb 22

@kzoltan

Are the 'low' PP numbers normal with such a setup? I have the same system with 2x5090 and my PP numbers are almost the same. What's the bottleneck?

'low' relative to what? Kimi-K2.5 is about the biggest model people run at home, so keep that in mind. Also it uses MLA attention which while using less VRAM to store kv-cache it uses more compute psure.

How are you handling NUMA setup in your BIOS e.g. SNC=Disable etc to get as much RAM bandwidth into a single NUMA node (use intel mlc to check details). Assuming you have 2x 5090s at full PCIe Gen 5 16x lanes each you might want to use unquantized kv-cache e.g. default of -ctk f16 (no need to specify ctv on an MLA model, it uses whatever ctk is for both for MLA) that can possibly help. There is no -sm graph support for MLA models yet either psure, so give your full command here if you want to workshop it. Also I recommend having a script for each start configuration and then test them with llama-sweep-bench like shown above here to figure out the best strategy for your specific desired workload e.g. "maximize PP"... you might be able to go up to -ub 8192 -b 8192 also in some cases for a little more PP etc.

kzoltan

Feb 22

Sure, thanks for the offer :)

Relative to these results (although some of those machines are very diff): https://www.reddit.com/r/LocalLLaMA/comments/1qriwnv/post_your_hardwaresoftwaremodel_quant_and/
My setup is a bit tricky as I use KVM (with 1G hugepages, CPU pinning , PCI passthrough). This should not cause a significant drop in performance.

For NUMA, I have a flat setup with a single node (command output from host):

numactl --hardware

available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 0 size: 515265 MB
node 0 free: 8921 MB
node distances:
node 0
0: 10

MLC shows this on the host (it is a few percent less on the guest):

./mlc --peak_injection_bandwidth

Intel(R) Memory Latency Checker - v3.11b
Command line parameters: --peak_injection_bandwidth
Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes
*** Unable to modify prefetchers (try executing 'modprobe msr')
*** So, enabling random access for latency measurements
Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 230077.0
3:1 Reads-Writes : 198895.5
2:1 Reads-Writes : 190655.3
1:1 Reads-Writes : 177761.8
Stream-triad like: 196297.9

I only have the 5090s on PCIe 4x16 because of the need for risers.
-b 8192 -ub 8192 doubles the PP, I'm just trying to avoid it because of occasional shorter prompts (I'm not entirely sure this makes sense though).

This is my test script:

#!/bin/bash

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1,2

MODEL_PATH=/mnt/1/models/ubergarm/Kimi-K2.5-GGUF/IQ3_K/Kimi-K2.5-IQ3_K-00001-of-00012.gguf

/mnt/1/ik_llama.cpp/build/bin/llama-sweep-bench
--model "$MODEL_PATH"
--no-mmap
--merge-qkv
-c 65565
-ctk f16
-amb 512
-mla 3
-gr
--threads 108
-b 8192 -ub 8192
-ngl 999
-ot exps=CPU
--warmup-batch
-n 128

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	46.866	87.40	9.848	13.00
4096	128	4096	47.433	86.35	10.083	12.69
4096	128	8192	47.595	86.06	10.269	12.47
4096	128	12288	47.960	85.40	10.389	12.32
4096	128	16384	48.446	84.55	10.357	12.36
4096	128	20480	48.888	83.78	10.370	12.34
4096	128	24576	49.204	83.25	10.187	12.57
4096	128	28672	49.569	82.63	9.937	12.88
4096	128	32768	52.001	78.77	10.103	12.67
4096	128	36864	50.271	81.48	10.108	12.66
4096	128	40960	50.708	80.78	10.305	12.42
4096	128	45056	51.073	80.20	10.541	12.14
4096	128	49152	51.475	79.57	10.904	11.74
4096	128	53248	51.808	79.06	10.898	11.75
4096	128	57344	52.415	78.15	11.126	11.50
4096	128	61440	52.563	77.93	10.928	11.71

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	128	0	51.988	157.57	10.159	12.60
8192	128	8192	53.208	153.96	10.301	12.43
8192	128	16384	54.641	149.92	10.515	12.17
8192	128	24576	56.219	145.72	10.684	11.98
8192	128	32768	57.549	142.35	10.891	11.75
8192	128	40960	59.056	138.72	11.207	11.42
8192	128	49152	62.427	131.23	19.938	6.42
8192	128	57344	61.776	132.61	11.768	10.88

ubergarm

Owner Feb 22

•

edited Feb 22

@kzoltan

Relative to these results (although some of those machines are very diff)

Many of those rigs have RTX 6000 PRO GPUs and/or are running sglang it seems. Some of the llama.cpp reports from that reddit thread have much lower scores than you. Even some of the faster reports suggested that at long context it drops off quite a bit with long context as I think makes sense given additional MLA compute required.

Also, a wise man once said, "comparison is a bitch" 😅

How much speed do you need for your desired workloads? There are many good model options available now if you prefer high PP for processing long context etc. I'm guessing most people will want at least a couple models e.g. big Kimi-K2.5 or GLM-5 for slow grunt work laying out a new vibe code project that doesn't have much context yet. Then possibly smaller models like GLM-4.7-Flash or Qwen3-Coder-Next fully offloaded on GPU for zippy small refactoring jobs etc. Also don't sleep on the new Qwen3.5 MoE as it is a good blend of speed and quality imo.

MLC shows this on the host (it is a few percent less on the guest)😀

Okay, about 230MB/s read bandwidth is pretty good, and will set the upper bound of your token generation speed for most quants. PP though is mostly compute bottlenecked and possibly effected by the PCIe Gen4 depending on if it is doing the offloading routed experts stuff (e.g. experiment with --offload-only-active-experts or -ooae maybe https://github.com/ikawrakow/ik_llama.cpp/pull/698 or there are advanced -cuda ... things i see people use but which I don't use much myself.

-b 8192 -ub 8192 doubles the PP, I'm just trying to avoid it because of occasional shorter prompts (I'm not entirely sure this makes sense though).

You might even be able to push up to -ub 16384 -b 16384 but personally i stick around -ub 4096 -b 4096 for testing. Though I just started going up to 8k with Qwen3.5 MoE as the extra prompt processing makes a difference vibe coding over 65k context. Play around with it and see because the extra PP likely outweighs latency for small batches for a typical vibe coding sesh.

This is my test script:

Thanks, okay I believe you copy-paste errored the first one as the result shows 4096 batch size, but I understand.

A few things to consider:

You're all good given a single NUMA node that makes this easier now, but keep in mind the software is not really NUMA optimized.
You are likely leavin a lot of VRAM on the table given you have 2x5090 = 64GB total and are not offloading additional routed expert layers and kv-cache is already very efficient. You can check with nvidia-smi or nvitop or nvtop or btop to confirm the large amount of unused VRAM.
Have you searched around the optimum setting for threads (for PP) and threads-batch (for TG)? Usually I set threads lower than threads-batch as PP benefits from more threads but TG can become worse due to memory contention. If you only specify threads it is used for both settings.
You might be able to squeek a little more speed out increasing the -amb buffer up to say 1024 or 2048 just to see, but don't expect much.
I removed -gr as that is on by default now: https://github.com/ikawrakow/ik_llama.cpp/pull/1094
I added -ger to try, but honestly not sure to which models it applies.
Added explicit additional routed exps layer offloads example, increase as much as possible until OOMing VRAM.

So here is a potentialliy workshopped command you could try:

#!/bin/bash

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=1,2

MODEL_PATH=/mnt/1/models/ubergarm/Kimi-K2.5-GGUF/IQ3_K/Kimi-K2.5-IQ3_K-00001-of-00012.gguf

/mnt/1/ik_llama.cpp/build/bin/llama-sweep-bench \
  --model "$MODEL_PATH" \
  -c 65565 \
  -mla 3 \
  -amb 1024 \
  -ctk f16 \
  -ger \
  --merge-qkv \
  -b 8192 -ub 8192 \
  -ngl 999 \
  -ot "blk\.(3|4|5)\.ffn_(gate|up|down)_exps.*=CUDA0" \
  -ot "blk\.(58|59|60)\.ffn_(gate|up|down)_exps.*=CUDA1" \
  -ot exps=CPU \
  --threads 96 \
  --threads-batch 108 \
  --no-mmap \
  --warmup-batch \
  -n 128

EDIT

Also for actual llama-server use for vibe coding LLM, there are many cache options that might help you not have to re-process similar enough contexts.

I don't know if any of this stuff helps, but you get the idea of the kinds of things you could try or research:

    --slot-save-path ./slot-saves/ \
    --slot-prompt-similarity 0.1 \
    --cache-ram 65536 \
    --cache-ram-n-min 128 \
    --cache-ram-similarity 1 \
    --context-shift on

kzoltan

Feb 23

Thanks for the review and the settings. To give you a bit of context:

I have 4 GPUs total (5, but the last on is for displays only): 2xRTX Pro 6000 max-q on 5x16 (they run Minimax M2.5 AWQ) and 2x5090 on 4x16 for anything else. At work I only have access to closed frontier models for coding and good planning can make a huge difference even with those, so I had the idea that I could use the 5090s with experts on the CPU to run a larger model for planning (while keeping Minimax loaded), exactly as you wrote. Kimi K2.5 is just the first try, as I saw great reports from people about it's speed and capability. The planner model does not need to be super fast (~10t/s TG is just fine), but faster PP is still very useful here (I do planning both at the start of a project AND on subtasks, so context can grow to 40-80k).

The results with your initial script (with the GPU memory filled):

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	128	0	49.897	164.18	8.878	14.42
8192	128	8192	51.648	158.61	9.357	13.68
8192	128	16384	52.976	154.64	9.372	13.66
8192	128	24576	54.710	149.74	9.533	13.43
8192	128	32768	56.085	146.07	9.708	13.19
8192	128	40960	57.727	141.91	9.868	12.97
8192	128	49152	59.296	138.15	10.099	12.67
8192	128	57344	60.923	134.46	10.308	12.42

What's interesting is that when doing PP, one GPU is around 50% utilization and nvidia-smi shows ~10GiB/s Rx, is this expected? Now that I looked into this, it seems the second RTX Pro is only running on 4x16. This might be an issue with the ES CPU (considering one user on the liked localllama page reports way better PP on an Intel CPU with half the cores, it is diff software though).

Just to see how it looks, I ran the same test with the RTX Pro 6000 on 5x16 (with approx the same offloading):

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	128	0	49.565	165.28	9.089	14.08
8192	128	8192	51.219	159.94	9.326	13.73
8192	128	16384	52.667	155.54	9.533	13.43
8192	128	24576	54.379	150.65	9.753	13.12
8192	128	32768	55.889	146.58	9.973	12.83
8192	128	40960	57.747	141.86	10.167	12.59
8192	128	49152	59.336	138.06	10.288	12.44
8192	128	57344	60.924	134.46	10.336	12.38

The results are not bad, but I still have a suspicion that there is a setup/hw issue with my system (or maybe the ES CPU has some surprises).

ubergarm

Owner Feb 23

@kzoltan

Oh sounds like a very nice setup with multiple models! What client are you using for your vibe coding harness? opencode or something else? Does it automatically make use of two models e.g. big slow planner and smaller faster model too?

What's interesting is that when doing PP, one GPU is around 50% utilization and nvidia-smi shows ~10GiB/s Rx, is this expected?

This is with Kimi-K2.5 right? It does not support -sm graph tensor parallel, and uses the old -sm layer with which usually only a single GPU at a time will max out I think. I too like to watch with nvtop in one tmux pane to see what is going on. If you can use -sm graph with many non MLA models, both GPUs will max out almost 95% most of the time.

Prompt processing tends to be compute bottlenecked.. Does your ES CPU have real avx512 instructions specifically lscpu | grep avx512_vnni? ik_llama.cpp gives a big boost for PP with that, it will print out HAVE FANCY SIMD or something sometimes if detected.

kzoltan

Feb 24

I use opencode. Although I think all agents could be configured to use different models in Opencode, I'm switching models manually for now for main tasks and have only configured subagents with specific models.

Yes, this is still Kimi K2.5 IQ3_K (I will try others soon, especially with the graph mode). I think I need to rephrase. When prompt processing:

The CPU cores seem almost idle,
The first GPU is around 50% utilization (with ~10GiB/s Rx transfer rate),
The second GPU is utilized less than 10% (with very low transfer rates).

It seems like the GPUs are bottlenecked by something, I'm just not sure what. That would explain my lower PP results compared to others.

To answer your question about the instructions, the instruction is there on both host and guest

lscpu | grep --color=auto avx512_vnni # on the host

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm avx512_vp2intersect md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities ibpb_exit_to_user

$ lscpu | grep --color=auto avx512_vnni # on the guest
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm avx512_vp2intersect md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities

ubergarm

Owner Feb 24

@kzoltan

The CPU cores seem almost idle,
The first GPU is around 50% utilization (with ~10GiB/s Rx transfer rate),
The second GPU is utilized less than 10% (with very low transfer rates).

Because the MLA models like Kimi-K2.5 and DS and new GLM-5 don't yet have --graph parallel support, I think this is normal?

You'd have to check with some other multi-GPU folks or take some extra time and maybe compare with SGLANG and one of their special quants etc.

I haven't run a good llama-sweep-bench test with Kimi-K2.5 on the dual RTX A6000 rig (thread ripper pro zen 4 with 24x cores, no avx512_vnni 256GB DDR5@4800 220GB/s ram bandwidth) yet myself.

Also heads up, new Qwen models incoming: https://huggingface.co/ubergarm/Qwen3.5-122B-A10B-GGUF

kzoltan

Feb 25

Thank you for all your help!
Just to make sure, I ran a test on my KVM host to see if something is off with the virtualization, but the result is almost the same. The good news is that 1GB hugepages still make a difference and with using them the virtualization overhead is almost zero. I will try sglang eventually, but Kimi is a bit too large for my 512GB RAM, so I can't fit the version everyone is using.

Is there a list of models supported by --graph parallel? Those new models might be useful for the same purpose...

ubergarm

Owner Feb 25

Is there a list of models supported by --graph parallel? Those new models might be useful for the same purpose...

Yes, the best list is right here: https://github.com/ikawrakow/ik_llama.cpp/blob/main/src/llama.cpp#L1910-L1926

ik is working more on -sm graph for gated delta net stuff as mentioned in this recent PR: https://github.com/ikawrakow/ik_llama.cpp/pull/1320

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment