Full Quality Q4_X
Hi - Just FYI full quality loads on llama.cpp but cant get disk offloading to work without full crash on ik_llama.cpp . llama.cpp work at about 6 tokens a second tg and 10 ish pp (it speeds up after turn 3 so turn 1 about 1.3 tok/s then 3 then 6 as stable tg) using 512 gigs ram 2x 3090s and 2x 905p optain as offload. (im not posting an error report since i dont think anyone uses mmap like this anyways)
Ill leave this here if anyone wants to try disk offloading just testing to see how engram would work of disk really, seems fine with optaine 2x 905ps and t705s also but not preferable method if ik_llama work my build is a bit older so maybe it will get solved. I got tired of the kernel crashes so ill move onto other models for now
export LLAMA_SET_ROWS=1
./build/bin/llama-server
--model "/mnt/optane_kimi/Models2/Q4_X_FRESH/Kimi-K2.5-Q4_X-NEW-00001-of-00014.gguf"
--temp 1.0
--min-p 0.01
--top-p 0.95
--ctx-size 16384
--seed 3407
--fit on
--jinja
--batch-size 4096
--ubatch-size 1024
--threads 54
--numa distribute
--host 0.0.0.0
--port 8080
I didn't do the math, but ~512GiB DDR + 48GiB VRAM = ~560 GiB to fit a 543.617 GiB quant is cutting it pretty close.
Can you fit the whole thing pre-allocated in memory with --no-mmap or why are you using mmap() ("disk offloading" you are calling that?). It sounds like the model is too big to fit into available memory, so you're leaving some of it "hang off" onto disk?
I have run large models with the default mmap behavior leaving the files on disk and letting the Linux Kernel's page/disk cache juggle the routed exps as needed. Even with good PCIe Gen 5 NVMe drives there tends to be a bottle neck with random iops that not even optane can probably bypass.
While running, watch your btop or sudo iotop etc to see how much disk i/o you're able to fully saturate. Usually kswapd0 process will peg one core per CPU socket to 100% and saturates around 5GiB/sec
How are you NUMA nodes configured and have you tried using numactl --interleave=all llama-server --numa distribute ... and all that too?
Odd ik isn't working for you, I've been running the Kimi-K2.5 quants on CPU-only compiled rig with 768GB RAM and haven't had problems (using --no-mmap)
Anyway, enjoy tweaking your rig to get the max output! I have some slightly smaller quants of Kimi-K2.5 that are still very good and would probably be much easier to deal with as they would fit into RAM for you.
Cheers and good see'n ya 'round!
Yup about 4.6 gigs a second on the iotop on loading. And yes i cant use no-map because it doesnt fit into ram + vram with context. I am getting about 600 m/s hits max on the iotop when generating text.
I'm testing if I can do if it works well enough to use with mmap spilling over to fast enough RAID 0. I mean 1million random read iops isnt too good but I dont think i will be able to fit a 1.2 trillion peram model if deepseek v4 comes out with engram.
The funny thing is that because the experts keep getting reused in conversations that continue multiple terns it gets faster the more you use it (within the same topic). so if you can manage some warm up it would be usable over longer context lengths especially with cache hits.
I was mostly waiting for your smaller quants for the last week :D so i thought i would experiment with models larger than RAM. to see if people can use models larger than ram pools. (which they can)
Oh I figured it out I have to remove -numa distribute to slow down the model load... I also removed swap space so that the model uses mmap only. I also played with swappiness settings before also which dont help dont do it... honestly its probably not an ik_llama issue its just how much pressure my kernel can take, maybe slower llama works because its so slow.
Waiting for deepseek v4 with engram stuff... got another 2 905p optaine in the mail... hope 4x raid 0 will be enough space so i stop abusing my t705 with constant downloads of all the great models you make. Im off to test your other upload stepfun3.5
....................................................................................................
common_init_result: added [EOS] logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
llama_context: constructing llama_context
llama_context: setting new yarn_attn_factor = 1.0000 (mscale == 1.0, mscale_all_dim = 1.0)
llama_context: n_seq_max = 4
llama_context: n_ctx = 16384
llama_context: n_ctx_seq = 16384
llama_context: n_batch = 4096
llama_context: n_ubatch = 1024
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 50000.0
llama_context: freq_scale = 0.015625
llama_context: n_ctx_seq (16384) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 2.50 MiB
llama_kv_cache: CUDA0 KV buffer size = 36.00 MiB
llama_kv_cache: CUDA1 KV buffer size = 1062.00 MiB
llama_kv_cache: size = 1098.00 MiB ( 16384 cells, 61 layers, 4/1 seqs), K (f16): 1098.00 MiB, V (f16): 0.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: CUDA0 compute buffer size = 6077.50 MiB
sched_reserve: CUDA1 compute buffer size = 724.00 MiB
sched_reserve: CUDA_Host compute buffer size = 120.02 MiB
sched_reserve: graph nodes = 4791
sched_reserve: graph splits = 240 (with bs=1024), 121 (with bs=1)
sched_reserve: reserve took 43.10 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv load_model: initializing slots, n_slots = 4
no implementations specified for speculative decoding
slot load_model: id 0 | task -1 | speculative decoding context not initialized
slot load_model: id 0 | task -1 | new slot, n_ctx = 16384
no implementations specified for speculative decoding
slot load_model: id 1 | task -1 | speculative decoding context not initialized
slot load_model: id 1 | task -1 | new slot, n_ctx = 16384
no implementations specified for speculative decoding
slot load_model: id 2 | task -1 | speculative decoding context not initialized
slot load_model: id 2 | task -1 | new slot, n_ctx = 16384
no implementations specified for speculative decoding
slot load_model: id 3 | task -1 | speculative decoding context not initialized
slot load_model: id 3 | task -1 | new slot, n_ctx = 16384
srv load_model: prompt cache is enabled, size limit: 8192 MiB
srv load_model: use --cache-ram 0 to disable the prompt cache
srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|><|im_user|>user<|im_middle|>Hello<|im_end|><|im_assistant|>assistant<|im_middle|>Hi there<|im_end|><|im_user|>user<|im_middle|>How are you?<|im_end|><|im_assistant|>assistant<|im_middle|>'
srv init: init: chat template, thinking = 0
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv update_slots: all slots are idle
srv params_from_: Chat format: Kimi K2
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, task.n_tokens = 26
slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 26, batch.n_tokens = 26, progress = 1.000000
slot update_slots: id 3 | task 0 | prompt done, n_tokens = 26, batch.n_tokens = 26
slot init_sampler: id 3 | task 0 | init sampler, took 0.00 ms, tokens: text = 26, total = 26
slot print_timing: id 3 | task 0 |
prompt eval time = 22700.45 ms / 26 tokens ( 873.09 ms per token, 1.15 tokens per second)
eval time = 29279.13 ms / 89 tokens ( 328.98 ms per token, 3.04 tokens per second)
total time = 51979.58 ms / 115 tokens
slot release: id 3 | task 0 | stop processing: n_tokens = 114, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: Kimi K2
slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.529 (> 0.100 thold), f_keep = 0.237
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 114, total state size = 7.642 MiB
srv load: - looking for better prompt, base f_keep = 0.237, sim = 0.529
srv update: - cache state: 1 prompts, 7.642 MiB (limits: 8192.000 MiB, 16384 tokens, 122206 est)
srv update: - prompt 0x6232539292c0: 114 tokens, checkpoints: 0, 7.642 MiB
srv get_availabl: prompt cache update took 44.93 ms
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 3 | task 90 | processing task, is_child = 0
slot update_slots: id 3 | task 90 | new prompt, n_ctx_slot = 16384, n_keep = 0, task.n_tokens = 51
slot update_slots: id 3 | task 90 | n_tokens = 27, memory_seq_rm [27, end)
slot update_slots: id 3 | task 90 | prompt processing progress, n_tokens = 51, batch.n_tokens = 24, progress = 1.000000
slot update_slots: id 3 | task 90 | prompt done, n_tokens = 51, batch.n_tokens = 24
slot init_sampler: id 3 | task 90 | init sampler, took 0.01 ms, tokens: text = 51, total = 51
slot print_timing: id 3 | task 90 |
prompt eval time = 5550.84 ms / 24 tokens ( 231.29 ms per token, 4.32 tokens per second)
eval time = 642339.83 ms / 3137 tokens ( 204.76 ms per token, 4.88 tokens per second)
total time = 647890.67 ms / 3161 tokens
slot release: id 3 | task 90 | stop processing: n_tokens = 3187, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: Kimi K2
slot get_availabl: id 2 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 2 | task 3228 | processing task, is_child = 0
slot update_slots: id 2 | task 3228 | new prompt, n_ctx_slot = 16384, n_keep = 0, task.n_tokens = 1262
slot update_slots: id 2 | task 3228 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 2 | task 3228 | prompt processing progress, n_tokens = 1262, batch.n_tokens = 1262, progress = 1.000000
slot update_slots: id 2 | task 3228 | prompt done, n_tokens = 1262, batch.n_tokens = 1262
slot init_sampler: id 2 | task 3228 | init sampler, took 0.50 ms, tokens: text = 1262, total = 1262
slot print_timing: id 2 | task 3228 |
prompt eval time = 191523.31 ms / 1262 tokens ( 151.76 ms per token, 6.59 tokens per second)
eval time = 41455.89 ms / 226 tokens ( 183.43 ms per token, 5.45 tokens per second)
total time = 232979.20 ms / 1488 tokens
slot release: id 2 | task 3228 | stop processing: n_tokens = 1487, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: Kimi K2
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.966 (> 0.100 thold), f_keep = 0.849
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 2 | task 3455 | processing task, is_child = 0
slot update_slots: id 2 | task 3455 | new prompt, n_ctx_slot = 16384, n_keep = 0, task.n_tokens = 1308
slot update_slots: id 2 | task 3455 | n_tokens = 1263, memory_seq_rm [1263, end)
slot update_slots: id 2 | task 3455 | prompt processing progress, n_tokens = 1308, batch.n_tokens = 45, progress = 1.000000
slot update_slots: id 2 | task 3455 | prompt done, n_tokens = 1308, batch.n_tokens = 45
slot init_sampler: id 2 | task 3455 | init sampler, took 0.53 ms, tokens: text = 1308, total = 1308
slot print_timing: id 2 | task 3455 |
prompt eval time = 19372.33 ms / 45 tokens ( 430.50 ms per token, 2.32 tokens per second)
eval time = 43134.57 ms / 199 tokens ( 216.76 ms per token, 4.61 tokens per second)
total time = 62506.90 ms / 244 tokens
slot release: id 2 | task 3455 | stop processing: n_tokens = 1506, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: Kimi K2
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.981 (> 0.100 thold), f_keep = 0.869
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 2 | task 3655 | processing task, is_child = 0
slot update_slots: id 2 | task 3655 | new prompt, n_ctx_slot = 16384, n_keep = 0, task.n_tokens = 1335
slot update_slots: id 2 | task 3655 | n_tokens = 1309, memory_seq_rm [1309, end)
slot update_slots: id 2 | task 3655 | prompt processing progress, n_tokens = 1335, batch.n_tokens = 26, progress = 1.000000
slot update_slots: id 2 | task 3655 | prompt done, n_tokens = 1335, batch.n_tokens = 26
slot init_sampler: id 2 | task 3655 | init sampler, took 0.22 ms, tokens: text = 1335, total = 1335
slot print_timing: id 2 | task 3655 |
prompt eval time = 2428.91 ms / 26 tokens ( 93.42 ms per token, 10.70 tokens per second)
eval time = 74160.49 ms / 403 tokens ( 184.02 ms per token, 5.43 tokens per second)
total time = 76589.40 ms / 429 tokens
slot release: id 2 | task 3655 | stop processing: n_tokens = 1737, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
I also removed swap space so that the model uses mmap only.
omg yes disable swap, do not use swap ever for running LLMs when you don't have enough RAM haha... I totally disable swap on my rig to avoid accidently write wearing my SSDs.
mmap() is read-only so it is fine to use, and yes I like this "troll rig" method when the weights are too big. I have a video of it back during my ktransformer days hitting almost 6gb/s here: https://www.youtube.com/watch?v=4ucmn3b44x4
As for adding more optane drives, in my experience even with 4x RAID0 T705s in an ICY DOCK PCIe Gen 5 it didn't help and I couldn't break the barrier beyond what a single drive could offer. Here is the writeup on that thanks to Wendell of level1techs forum and YT channels: https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/
And yeah I'm excited to see what the next deepseek with engrams can do!
