Testing smol-IQ4_KSS

by shewin - opened Sep 8, 2025

Sep 8, 2025

W790E Sage + QYFS + 512G + RTX5090

Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 170240
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.015625
llama_kv_cache_init: CUDA0 KV buffer size = 6061.01 MiB
llama_new_context_with_model: KV self size = 6060.98 MiB, c^KV (q8_0): 6060.98 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11347.85 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2771.88 MiB
llama_new_context_with_model: graph nodes = 24387
llama_new_context_with_model: graph splits = 122

main: n_kv_max = 170240, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	52.482	77.93	86.375	11.83
4090	1022	4090	52.895	77.32	82.629	12.37
4090	1022	8180	83.991	48.70	80.144	12.75
4090	1022	12270	54.705	74.77	77.761	13.14
4090	1022	16360	54.601	74.91	96.437	10.60

ubergarm

Owner Sep 8, 2025

•

edited Sep 8, 2025

Why do you use -t 101 on the 56Core QYFS CPU?

Have you tried like --threads 48 --threads-batch 56 for example which I assume would do better? Unless we had this discussion on another thread already haha... Generally SMT/Hyperthreading doesn't help or actually hurts speed and makes more heat. Also using a power of 2 feels nicer and might have some benefit, but maybe I'm just superstitious lol.

shewin

Sep 9, 2025

Many bios options for my workstation performance.

In my experience, half thread is not significantly better in power, speed or so

When SMT is turned off on the OS

But I like this

shewin

Sep 9, 2025

Why do you not use -ctv q8_0 ?
On performance, it's not any better. So is it because of stability?

SFPLM

Sep 9, 2025

@shewin
do you mind sharing what BIOS options for Sage SE + QYFS worked for you?

shewin

Sep 9, 2025

you'd better read this thread one by one:

shewin

Sep 9, 2025

https://forums.servethehome.com/index.php?threads/asus-pro-ws-w790e-sage-se-intel-xeon-sapphire-rapids-spr-sp.41306/

shewin

Sep 9, 2025

ubergarm

Owner Sep 10, 2025

@shewin

Thanks for sharing more details and always interesting to see how the various models/quantizations/hardware combinations work best in practice. benchmark benchmark benchmark! haha

Why do you not use -ctv q8_0 ?

So for any MLA model like DeepSeek 671B or Kimi-K2 if you specify only -ctk q8_0 because the kv-cache is in latent space you don't need to specify -ctv q8_0 too, it already does it together.

In some of my own testing, I've noticed that when you keep all kv-cache on VRAM/GPU it can actually be faster to keep kv-cache at full f16 size (especially GLM-4.5 models seem slower with q8_0 kv-cache).

Quality-wise, q8_0 is barely measurably "worse" perplexity than full f16 though and definitely a good way to save VRAM compressing kv-cache in many configurations.

Cool voxel engine demo!

geveent

Sep 16, 2025

•

edited Sep 16, 2025

I took three paragraphs from the "Lancelot, The Knight of the Cart" (https://www.heroofcamelot.com/docs/Lancelot-Knight-of-the-Cart.pdf), and asked Kimi K2 Instruct 0905 smol IQ4 KSS to summarize. I get gibberish for the answer. For example, I got ", the two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same. The two companies are the same...." as an answer.

I get better answer when I feed one paragraph at a time. I don't know why it is having a hard time when I give it two pages to summarize. In the llama.cpp argument, I gave it plenty of context length (32K) for two-page summary. I used the following arguments using ik_llama.cpp:
./llama-server -m "/.../.lmstudio/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-smol-IQ4_KSS-00001-of-00011.gguf" -fa -fmoe -mla 3 -amb 512 -ctk q8_0 -ctv q8_0 -b 4096 -ub 4096 -c 32768 -t 92 -ngl 65 -ot "blk.(4).ffn_.*=CUDA0" -ot exps=CPU --host 0.0.0.0 --port 11234

I also get similar gibberish when I ask DeepSeek-V3.1-IQ4_K to summarize two pages from "Lancelot, The Knight of the Cart." If I assign a repeat penalty, I get "1, 2, 3, 4, 5, ... ". Am I doing something wrong, or is Kimi-K2 or DeepSeek V3.1 not smart enough? ChatGPT5, Grok3, and Gemini 2.5 don't have a problem summarizing the same texts.

geveent

Sep 16, 2025

I just removed "-b 4096 -ub 4096" from the arguments, and I'm not getting the gibberish any more. Does anyone know the reason?

ubergarm

Owner Sep 16, 2025

@geveent

I've heard some other folks having issues going above -ub 2048 -b 2048 batch sizes. The default values are -ub 512 -b 2048 if you don't specify anything. Increasing batch sizes can improve PP aggregate throughput at the cost of some extra VRAM/latency for small chats.

There is some discussion about it here: https://discord.com/channels/1238219753324281886/1399386495495835762/1417189589386264691 (discord join link here if u interested: https://huggingface.co/BeaverAI) that was specific to GLM-4.5 though.

I've heard some folks report success up to -ub 16384 -b 16384 but myself avoid going above 4096. It probably depends on exact hardware configuration etc.

Otherwise your samplers are going to possibly come into play e.g. stuff like

    --jinja \
    --temp 0.6 \
    --min-p 0.0 \
    --top-p 0.95 \
    --top-k 20 \
    --repeat-penalty 1.5 \

You might be setting all that from your client side, not sure how you're using llama-server here in this case.

But yeah Kimi-K2 can easily summarize a couple pages no problemo when it is working properly!

geveent

Sep 23, 2025

When I load the Kimi-K2-Instruct-0905-smol-IQ4_KSS to RAM, the memory throughput is about 1,300 MB/s. When Kimi K2 is answering my questions using CPU+GPU, the memory throughput is 16,000 MB/s. When I use only CPU for the inference, the memory throughput is 21,000 MB/s. I'm using 8-channel DDR5 5600MHz memory. This is for all models, not only Kimi-K2.

Why is the memory throughput so low when loading the model? Is there a way for me to speed up the loading?

ubergarm

Owner Sep 23, 2025

•

edited Sep 23, 2025

@geveent

Without knowing exact details on NUMA node configuration, how you're measuring disk and memory i/o speeds (e.g. btop, netdata, iotop, AMD's e_smi_tool, etc.), exact command, disk RAID configuration, etc i can't answer with complete certainty.

My initial thought is that your memory bandwidth is not the bottleneck when loading a large LLM off of disk cold, but it is your disk i/o. Do you have older PCIe Gen 3 NVMe drive or spinning rust harddrives holding your quants? Some folks keep a "slush drive" that is PCIe Gen5 NVMe e.g. a T700 or similar for their most used models, and slower drives for archiving older unused models. To help load faster. Also it should load faster the second time assuming you're in Linux and have enough unused RAM for page cache to hold it so it will do memory<->memory i/o but if you're using almost all your RAM for inference it will have to fall-back to disk reads which are much slower than RAM.

EDIT You can test disk sequential and random read i/o using tools like fio e.g. this non-destructive test will give you an estimate of sequential reads from your drive containing the quants which would be an upper bound on load speeds to expect:

# replace nvme0n1 with your actual block device e.g. `md0` etc...
$ sudo fio \
      --filename=/dev/nvme0n1 \
      --readonly \
      --rw=read \
      --direct=1 \
      --bs=1M \
      --ioengine=libaio \
      --runtime=60 \
      --numjobs=24 \
      --time_based=1 \
      --group_reporting \
      --name=SEQREAD_1M \
      --iodepth=32

# this is the fastest quad raid0 array of Gen 5 NVMe T705s i've benched getting about 40GB/s sequential reads
Run status group 0 (all jobs):
   READ: bw=37.8GiB/s (40.6GB/s), 37.8GiB/s-37.8GiB/s (40.6GB/s-40.6GB/s), io=2268GiB (2435GB), run=60036-60036msec

geveent

Sep 24, 2025

•

edited Sep 24, 2025

@ubergarm
When I asked you a question, I measured the memory throughput using "sudo /home/geveent/pcm/build/bin/pcm-memory 1". I have 512GB total RAM with 8-channel, so I don't have to load any part of the model onto NVMe SSD. I used Samsung 990 Evo Plus on Gigabyte MS73-HB1. I'm using Ubuntu 24.04. It takes about 9 minutes to load Ubergarm/Kimi-K2-Instruct-0905-smol-IQ4_KSS. I thought it's slow, and that's why I asked you. I get 60 t/s prompt eval and 16 t/s generation.

When I ran your fio command, I got:
Run status group 0 (all jobs):
READ: bw=3507MiB/s (3678MB/s), 3507MiB/s-3507MiB/s (3678MB/s-3678MB/s), io=206GiB (222GB), run=60237-60237msec

Disk stats (read/write):
nvme0n1: ios=1686977/202, sectors=431866112/6176, merge=0/74, ticks=365555853/29184, in_queue=365585096, util=99.94%

numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 … 119
node 0 size: 515416 MB

/dev/nvme0n1 Samsung SSD 990 EVO Plus 4TB
/dev/nvme1n1 Samsung SSD 990 EVO Plus 4TB

/dev/nvme0n1p2 on / type ext4 (rw,relatime)
/dev/nvme0n1p1 on /boot/efi type vfat …

I guess the bottleneck is my PCIe Gen 4 interface on my motherboard and my SSD.

ubergarm

Owner Sep 24, 2025

@geveent

Ahh you are using intel/pcm performance counter monitor to measure memory bandwidth usage, very cool! Thanks for that tip.

So looks like you have two Gen 4 NVMe drives to hold your models. If you have a different boot drive holding your operating system, you could use mdadm to make a software raid array of those two drives which should allow you to load roughly twice as fast. I have some example commands on level1techs forum here but those are destructive commands so make sure you know what you're getting into.

For example, that specific ubergarm/Kimi-K2-Instruct-0905-smol-IQ4_KSS is 485.008 GiB. Your fio suggests sequential reads on a single NVMe drive cap out at around 3678MB/s. So absolute best case scenario it would take about 520GB / 3.7GB/s = 140 seconds to load. But you are reporting 9 minutes to load which does seem slow. hrmm...

After you are loaded, it seems like the speeds are fine, but where is the bottleneck during loading?

You could try running something like https://github.com/netdata/netdata static binary build and watch that in a local running webapp to monitor visually and try to see your bottleneck. You have a single NUMA node, but not 100% sure how the PCIe Gen4 lanes to each NVMe drive are routed e.g. directly to the CPU or through an i/o bridge, you'd have to refer to the motherboard manual to see if you plugged them into the best available m.2 slots etc.

Let us know if you find a way to speed loading the model weights into RAM!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment