Need help getting faster speeds.

#4
by SFPLM - opened

Hey I am a beginner at this and trying out the IQ4 Quant with 1x 3090 and 512 GB RAM on a Intel Xeon Plat 8480 ES QYFS (56 core),

I use cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86" and then cmake --build ./build --config Release -j $(nproc) to build ik_llama.cpp.

I also use

CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-server \
    --model /mnt/.../ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
    --alias ubergarm/DeepSeek-R1-V3-0324-IQ4_K_R4 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -ctv q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    -rtr \
    --temp 0.3 \
    --min-p 0.05 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 56 \
    --host 127.0.0.1 \
    --port 8080

I get the speeds below

INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750924132 id_slot=0 id_task=0 p0=0
INFO [           print_timings] prompt eval time     =    1009.41 ms /    10 tokens (  100.94 ms per token,     9.91 tokens per second) | tid="134454010454016" timestamp=1750924135 id_slot=0 id_task=0 t_prompt_processing=1009.41 n_prompt_tokens_processed=10 t_token=100.941 n_tokens_second=9.906777226300512
INFO [           print_timings] generation eval time =    1280.22 ms /    12 runs   (  106.69 ms per token,     9.37 tokens per second) | tid="134454010454016" timestamp=1750924135 id_slot=0 id_task=0 t_token_generation=1280.221 n_decoded=12 t_token=106.68508333333334 n_tokens_second=9.37338162707845
INFO [           print_timings]           total time =    2289.63 ms | tid="134454010454016" timestamp=1750924135 id_slot=0 id_task=0 t_prompt_processing=1009.41 t_token_generation=1280.221 t_total=2289.631
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750924135 id_slot=0 id_task=0 n_ctx=32768 n_past=21 n_system_tokens=0 n_cache_tokens=21 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924135
INFO [      log_server_request] request | tid="134451669495808" timestamp=1750924135 remote_addr="127.0.0.1" remote_port=33058 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924135
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750924140 id_slot=0 id_task=14
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750924140 id_slot=0 id_task=14 p0=21
INFO [           print_timings] prompt eval time     =     442.20 ms /     6 tokens (   73.70 ms per token,    13.57 tokens per second) | tid="134454010454016" timestamp=1750924144 id_slot=0 id_task=14 t_prompt_processing=442.197 n_prompt_tokens_processed=6 t_token=73.6995 n_tokens_second=13.568613084213597
INFO [           print_timings] generation eval time =    3623.74 ms /    32 runs   (  113.24 ms per token,     8.83 tokens per second) | tid="134454010454016" timestamp=1750924144 id_slot=0 id_task=14 t_token_generation=3623.739 n_decoded=32 t_token=113.24184375 n_tokens_second=8.830658057878892
INFO [           print_timings]           total time =    4065.94 ms | tid="134454010454016" timestamp=1750924144 id_slot=0 id_task=14 t_prompt_processing=442.197 t_token_generation=3623.739 t_total=4065.936
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750924144 id_slot=0 id_task=14 n_ctx=32768 n_past=58 n_system_tokens=0 n_cache_tokens=58 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924144
INFO [      log_server_request] request | tid="134451661103104" timestamp=1750924144 remote_addr="127.0.0.1" remote_port=33064 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924144
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750924161 id_slot=0 id_task=48
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750924161 id_slot=0 id_task=48 p0=58
INFO [      log_server_request] request | tid="134451652710400" timestamp=1750924214 remote_addr="127.0.0.1" remote_port=53082 status=200 method="GET" path="/" params={}
INFO [           print_timings] prompt eval time     =    1115.50 ms /    15 tokens (   74.37 ms per token,    13.45 tokens per second) | tid="134454010454016" timestamp=1750924261 id_slot=0 id_task=48 t_prompt_processing=1115.499 n_prompt_tokens_processed=15 t_token=74.3666 n_tokens_second=13.446896859611709
INFO [           print_timings] generation eval time =   99090.24 ms /   865 runs   (  114.56 ms per token,     8.73 tokens per second) | tid="134454010454016" timestamp=1750924261 id_slot=0 id_task=48 t_token_generation=99090.236 n_decoded=865 t_token=114.55518612716763 n_tokens_second=8.72941709413226
INFO [           print_timings]           total time =  100205.74 ms | tid="134454010454016" timestamp=1750924261 id_slot=0 id_task=48 t_prompt_processing=1115.499 t_token_generation=99090.236 t_total=100205.735
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750924261 id_slot=0 id_task=48 n_ctx=32768 n_past=937 n_system_tokens=0 n_cache_tokens=937 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924261
INFO [      log_server_request] request | tid="134451753373696" timestamp=1750924261 remote_addr="127.0.0.1" remote_port=54154 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924261
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750924580 id_slot=0 id_task=915
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750924580 id_slot=0 id_task=915 p0=937
INFO [           print_timings] prompt eval time     =    1559.26 ms /    25 tokens (   62.37 ms per token,    16.03 tokens per second) | tid="134454010454016" timestamp=1750924662 id_slot=0 id_task=915 t_prompt_processing=1559.258 n_prompt_tokens_processed=25 t_token=62.37032 n_tokens_second=16.033267105251344
INFO [           print_timings] generation eval time =   80799.59 ms /   686 runs   (  117.78 ms per token,     8.49 tokens per second) | tid="134454010454016" timestamp=1750924662 id_slot=0 id_task=915 t_token_generation=80799.59 n_decoded=686 t_token=117.78365889212827 n_tokens_second=8.490142091067542
INFO [           print_timings]           total time =   82358.85 ms | tid="134454010454016" timestamp=1750924662 id_slot=0 id_task=915 t_prompt_processing=1559.258 t_token_generation=80799.59 t_total=82358.848
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750924662 id_slot=0 id_task=915 n_ctx=32768 n_past=1647 n_system_tokens=0 n_cache_tokens=1647 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924662
INFO [      log_server_request] request | tid="134451644317696" timestamp=1750924662 remote_addr="127.0.0.1" remote_port=35684 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924662
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750925002 id_slot=0 id_task=1603
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750925002 id_slot=0 id_task=1603 p0=8


INFO [           print_timings] prompt eval time     =    9485.77 ms /   435 tokens (   21.81 ms per token,    45.86 tokens per second) | tid="134454010454016" timestamp=1750925024 id_slot=0 id_task=1603 t_prompt_processing=9485.769 n_prompt_tokens_processed=435 t_token=21.806365517241378 n_tokens_second=45.85816922170464
INFO [           print_timings] generation eval time =   11827.47 ms /   101 runs   (  117.10 ms per token,     8.54 tokens per second) | tid="134454010454016" timestamp=1750925024 id_slot=0 id_task=1603 t_token_generation=11827.468 n_decoded=101 t_token=117.10364356435645 n_tokens_second=8.539443945229866
INFO [           print_timings]           total time =   21313.24 ms | tid="134454010454016" timestamp=1750925024 id_slot=0 id_task=1603 t_prompt_processing=9485.769 t_token_generation=11827.468 t_total=21313.237
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750925024 id_slot=0 id_task=1603 n_ctx=32768 n_past=543 n_system_tokens=0 n_cache_tokens=543 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925024
INFO [      log_server_request] request | tid="134451635924992" timestamp=1750925024 remote_addr="127.0.0.1" remote_port=48162 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925024
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750925068 id_slot=0 id_task=1706
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750925068 id_slot=0 id_task=1706 p0=8
INFO [      log_server_request] request | tid="134451627532288" timestamp=1750925102 remote_addr="127.0.0.1" remote_port=33158 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750925102 id_slot=0 id_task=1706 n_ctx=32768 n_past=1668 n_system_tokens=0 n_cache_tokens=1668 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925102
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750925110 id_slot=0 id_task=1710
INFO [            update_slots] we have to evaluate at least 1 token to generate logits | tid="134454010454016" timestamp=1750925110 id_slot=0 id_task=1710
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750925110 id_slot=0 id_task=1710 p0=1666
INFO [      log_server_request] request | tid="134451619139584" timestamp=1750925118 remote_addr="127.0.0.1" remote_port=44816 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750925118 id_slot=0 id_task=1710 n_ctx=32768 n_past=1731 n_system_tokens=0 n_cache_tokens=1731 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925118
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750925133 id_slot=0 id_task=1777
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750925133 id_slot=0 id_task=1777 p0=1730
INFO [           print_timings] prompt eval time     =    1074.76 ms /    13 tokens (   82.67 ms per token,    12.10 tokens per second) | tid="134454010454016" timestamp=1750925215 id_slot=0 id_task=1777 t_prompt_processing=1074.764 n_prompt_tokens_processed=13 t_token=82.67415384615384 n_tokens_second=12.095678679226324
INFO [           print_timings] generation eval time =   80388.76 ms /   662 runs   (  121.43 ms per token,     8.23 tokens per second) | tid="134454010454016" timestamp=1750925215 id_slot=0 id_task=1777 t_token_generation=80388.76 n_decoded=662 t_token=121.43317220543805 n_tokens_second=8.234982104463361
INFO [           print_timings]           total time =   81463.52 ms | tid="134454010454016" timestamp=1750925215 id_slot=0 id_task=1777 t_prompt_processing=1074.764 t_token_generation=80388.76 t_total=81463.52399999999
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750925215 id_slot=0 id_task=1777 n_ctx=32768 n_past=2404 n_system_tokens=0 n_cache_tokens=2404 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925215
INFO [      log_server_request] request | tid="134451610746880" timestamp=1750925215 remote_addr="127.0.0.1" remote_port=48836 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925215
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750925215 id_slot=0 id_task=2441
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750925215 id_slot=0 id_task=2441 p0=1979
INFO [           print_timings] prompt eval time     =   10217.66 ms /   436 tokens (   23.43 ms per token,    42.67 tokens per second) | tid="134454010454016" timestamp=1750925255 id_slot=0 id_task=2441 t_prompt_processing=10217.656 n_prompt_tokens_processed=436 t_token=23.434990825688075 n_tokens_second=42.67123496817665
INFO [           print_timings] generation eval time =   29813.12 ms /   244 runs   (  122.18 ms per token,     8.18 tokens per second) | tid="134454010454016" timestamp=1750925255 id_slot=0 id_task=2441 t_token_generation=29813.119 n_decoded=244 t_token=122.18491393442622 n_tokens_second=8.18431644136261
INFO [           print_timings]           total time =   40030.78 ms | tid="134454010454016" timestamp=1750925255 id_slot=0 id_task=2441 t_prompt_processing=10217.656 t_token_generation=29813.119 t_total=40030.775
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750925255 id_slot=0 id_task=2441 n_ctx=32768 n_past=2658 n_system_tokens=0 n_cache_tokens=2658 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925255
INFO [      log_server_request] request | tid="134451610746880" timestamp=1750925255 remote_addr="127.0.0.1" remote_port=48836 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925255
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925283

I also tried with 2x 3090 and in 32k ctx or 64 ctx its both similar in speed gen to the 1x3090 I have above

The PP feels too slow I was hoping for maybe at least 9.5 t/s gen speed and perhaps much better PP speeds. I am not sure if the PP speeds are to be expected or how to get better numbers from this as I would like to try to see if I can comfortably run DeepSeek given I have high RAM and a deemed good CPU. Is the 3090 the fault or its something not present in my launch arguments.

How I can further improve this?

@SFPLM

Glad you got it going and can now begin to optimzie for your exact system. There is a lot of info out thee already so check the discussions on ik_llama.cpp github and other threads here.

First off compile like this especially when using 2x 3090TI. The SCHED_MAX_COPIES will make it easier to understand VRAM allocation and increase batch sizes for more PP. The FORCE_BF16 may increase speed on 3090's and also prevent numerical issues specific to DeepSeek MLA.

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CUDA_F16=ON
cmake --build ./build --config Release -j $(nproc)

Next, how is your NUMA configured? I assume this is a single socket system? Make sure you have BIOS configured to show only a single numa node e.. numactl --hardware shows only one. As no llama.cpp fork has optimizations here for this and you don't have enough RAM to try ktransformers USE_NUMA=1 which doubles RAM usage (one per CPU socket numa node).

If you can't get a single NUMA node let me know and you can use some numactl stuff to glue them togther to get the best you can.

Regarding your command, try this, I will assume you are using 2x 3090TI which will give more VRAM to offload additional layers

./build/bin/llama-server \
    --model /mnt/.../ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
    --alias ubergarm/DeepSeek-R1-V3-0324-IQ4_K_R4 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    --temp 0.3 \
    --min-p 0.05 \
    --n-gpu-layers 63 \
    -ot "blk\.(3|4)\.ffn_.*=CUDA0" \
    -ot "blk\.(5|6)\.ffn_.*=CUDA1" \
    -ot exps=CPU \
    -ub 2048 -b 2048 \
    --parallel 1 \
    --threads 56 \
    --host 127.0.0.1 \
    --port 8080

Adjust number of additional ffn layers offloaded to CUDA0 and CUDA1 based on if you OOM or not. Basically crank it up until you OOM and dial back by one. Increasing batch sizes takes a little more VRAM but gives a lot more PP.

When you want to benchmark speeds, I advise to use llama-sweep-bench. Simply replace llama-server with llama-sweep-bench and add --warmup-batch and it should ignore alias/host/port fine.

Finally use the most recent version of ik_llama.cpp as PP just got a boost for the _R4 quants running on CUDA like u are doing.

TG will be limited by your RAM bandwidth and in my experience can be below theoretical maximum on intel rigs even when mlc (intel memory latency checker) bandwidth reads higher.

Cheers and keep us all posted!

@ubergarm

Glad to hear and appreciate the advice. After doing a bit of tampering I seem to cannot get it to load more than 1 expert layer to each GPU. I have tried ts 23,24 and ctv (v cache) to q8 and i just cant.

For example

Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_up_shexp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_up_shexp.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_up_shexp.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_up_shexp.weight buffer type overriden to CUDA1
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
... We know what it looks like 
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size =  9218.28 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 31988.10 MiB
llm_load_tensors:        CPU buffer size =   938.98 MiB
llm_load_tensors:      CUDA0 buffer size = 21945.34 MiB
llm_load_tensors:      CUDA1 buffer size = 21782.68 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   592.89 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   573.76 MiB
llama_new_context_with_model: KV self size  = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4472.01 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 4689240064
llama_new_context_with_model: failed to allocate compute buffers

The above also happens if i attempt to add 2 layers. I also tried 3 to CUDA0 and 4 + 5 into CUDA1, CUDA1 OOMs.

The one I have tried so far that works and is fastest so far is:

./build/bin/llama-sweep-bench \
    --model /mnt/.../models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
    --alias ubergarm/DeepSeek-R1-V3-0324-IQ4_K_R4 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    --temp 0.3 \
    --min-p 0.05 \
    -ts 23,24 \
    --n-gpu-layers 63 \
    -ot "blk\.(3)\.ffn_.*=CUDA0" \
    -ot "blk\.(4)\.ffn_.*=CUDA1" \
    -ot exps=CPU \
    -ub 2048 -b 2048 \
    -ser 6,1 \
    --parallel 1 \
    --threads 56

gets me:

llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size = 22726.83 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 31988.10 MiB
llm_load_tensors:        CPU buffer size =   938.98 MiB
llm_load_tensors:      CUDA0 buffer size = 15500.99 MiB
llm_load_tensors:      CUDA1 buffer size = 15235.03 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = 6, 1
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   592.89 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   573.76 MiB
llama_new_context_with_model: KV self size  = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  4472.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  3560.02 MiB
llama_new_context_with_model:  CUDA_Host comput```e buffer size =   312.02 MiB
llama_new_context_with_model: graph nodes  = 8245
llama_new_context_with_model: graph splits = 149

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 56, n_threads_batch = 56
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 40.892 50.08 49.042 10.44
2048 512 2048 40.985 49.97 48.505 10.56
2048 512 4096 41.221 49.68 47.514 10.78
2048 512 6144 42.020 48.74 50.148 10.21
2048 512 8192 42.275 48.44 51.781 9.89
2048 512 10240 42.688 47.98 51.415 9.96
2048 512 12288 43.135 47.48 52.508 9.75
2048 512 14336 49.251 41.58 49.770 10.29
2048 512 16384 43.725 46.84 54.991 9.31
2048 512 18432 44.590 45.93 55.337 9.25
2048 512 20480 44.006 46.54 60.230 8.50
2048 512 22528 44.505 46.02 58.256 8.79
2048 512 24576 45.575 44.94 55.449 9.23
2048 512 26624 45.942 44.58 59.890 8.55
2048 512 28672 46.320 44.21 56.559 9.05
2048 512 30720 75.212 27.23 66.060 7.75

The 6 experts instead of 8 does bring it up.

8 Experts I think goes like

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s

| 2048 | 512 | 0 | 43.768 | 46.79 | 58.812 | 8.71 |
| 2048 | 512 | 2048 | 43.994 | 46.55 | 60.076 | 8.52 |
| 2048 | 512 | 4096 | 44.731 | 45.79 | 59.162 | 8.65 |
| 2048 | 512 | 6144 | 45.756 | 44.76 | 61.601 | 8.31 |
| 2048 | 512 | 8192 | 44.844 | 45.67 | 62.035 | 8.25 |
| 2048 | 512 | 10240 | 44.996 | 45.52 | 62.541 | 8.19 |
| 2048 | 512 | 12288 | 45.741 | 44.77 | 64.000 | 8.00 |
| 2048 | 512 | 14336 | 45.896 | 44.62 | 65.038 | 7.87 |
| 2048 | 512 | 16384 | 48.384 | 42.33 | 70.296 | 7.28 |
| 2048 | 512 | 18432 | 53.666 | 38.16 | 66.915 | 7.65 |
| 2048 | 512 | 20480 | 46.520 | 44.02 | 66.358 | 7.72 |
| 2048 | 512 | 22528 | 47.653 | 42.98 | 67.443 | 7.59 |
| 2048 | 512 | 24576 | 63.603 | 32.20 | 72.567 | 7.06 |
| 2048 | 512 | 26624 | 51.910 | 39.45 | 74.521 | 6.87 |
| 2048 | 512 | 28672 | 49.094 | 41.72 | 71.302 | 7.18 |
| 2048 | 512 | 30720 | 49.411 | 41.45 | 72.487 | 7.06 |

  1. Is it time to look in to overclocking or squeezing ram speed?
  2. Or is it the 3090s fault (I am on EVGA 3090 FTW3 not 3090 Ti if that makes any difference) being a 3090? Or is it that I have it plugged into the monitor (my mobo is Asus W790 Sage SE)
  3. EDIT: It seems testing 4096 batch, no experts in GPU, goes around 90-75 PP ts and around 8-9.3 t/s
    I may also look in 1x3090 argument setup since as of this moment my own PC could use a GPU for my own stuff. My guess is though for more PP and max response length I cant offload any to 3090. Darn I want to load more layers but why 3090 why

main: n_kv_max = 32768, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, n_gpu_layers = 63, n_threads = 56, n_threads_batch = 56

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 2048 0 52.480 156.10 206.261 9.93
8192 2048 8192 56.717 144.44 228.645 8.96
8192 2048 16384 62.132 131.85 223.730 9.15
8192 2048 24576 71.045 115.31 240.698 8.51

Also tried this config . when I use llama server, when I do a test prompt of my own gets 46 PP 10.55 Gen, and a follow up 35 PP 10 Gen. Not sure if this is in line with the bench
--ctx-size 32768
-ctk q8_0
-mla 3 -fa
-amb 512
-fmoe
--temp 0.3
--min-p 0.05
-ts 23,24
-ot "blk.(3).ffn_.*=CUDA1"
--n-gpu-layers 63
-ot exps=CPU
-ub 8192 -b 8192
-ser 6,1
--parallel 1
--threads 56

Looks like you have very similar config as me, Asus Pro WS w790 and looking at 56 thread count I think you have QYFS?

@mtcl I saw some of your videos which was a source of inspiration.

I think I have some similarities with your build. Intel Xeon 8480 ES - QYFS 56 Core, Asus w790 Sage SE, 512GB DDR5 (I got 5600 but I only did a bit of bios to try to get it up but it sets it for 4800 for me; do you know if we can push speeds higher?).

But I only have 2 3090s in hand as of now (I will eventually try to get something better, but until my "situation improves" I will see if I can make it work with 3090s).

I think that 4800 lock is on the processor being engineering sample. And noni haven't been able to overclock it beyond 4800 either. But it's ok for now I think.

I know ever usecase is different and I shouldn't generalize it, but I've noticed that qwen3-235b outperforms my expectations and works very well in my usecase. I run it at 128k context length and it's able to do 400 pp tk/s. Note that you have to use a special 128k variant of the model for that. If you're ok with 40k token limit then @ubergarm 's model is the best with its iq3 quant. Very fast and very stable.

2x 3090s is a very potent setup! I would say stay with them or buy a couple more 3090s instead of getting a 4090 or 5090.

And I'm glad you're coming over from the channel :) this ik_llama community is the best!

@mtcl

Have you been able to test Deepseek with using just 1 RTX Pro 6000? I want to know if PP and TG speeds will improve if I did decide to "change the GPU situation"... I saw you had like 12 tps to 15 tps using either 1 4090, a mix of 4090/5090, or 2 RTX Pros. What do you think were the reasons you could get this fast i think 15 tps is super nice.

As for now I need to give back a 3090 to my own PC so I will look into 1x 3090 performance for now.

Ok now I have another issue (should I take this thread into another place?) need some more help would be appreciated.

I had to drop back down to 1x3090. I also did some simple BIOS goofing around to tweak some memory and also some commands to tell it to "go faster" and I get 10.5 tps for 0 context yet 8 experts.

CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-server \
    --model /mnt/.../models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
    --alias ubergarm/DeepSeek-R1-V3-0324-IQ4_K_R4 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    --temp 0.3 \
    --min-p 0.05 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 56 \
    --host ... \
    --port ...

But when I use OpenWebUI with Searxng Web Search or the stock RAG system... Every time I change chat threads or do new input I have to wait extremely long for the KV removal as if KV is taking forever.

INFO [            update_slots] kv cache rm [p0, end) | tid="127145341607936" timestamp=1751280242 id_slot=0 id_task=480 p0=2251
INFO [            update_slots] kv cache rm [p0, end) | tid="127145341607936" timestamp=1751280267 id_slot=0 id_task=480 p0=4299
INFO [            update_slots] kv cache rm [p0, end) | tid="127145341607936" timestamp=1751280294 id_slot=0 id_task=480 p0=6347
INFO [            update_slots] kv cache rm [p0, end) | tid="127145341607936" timestamp=1751280326 id_slot=0 id_task=480 p0=8395
INFO [            update_slots] kv cache rm [p0, end) | tid="127145341607936" timestamp=1751280358 id_slot=0 id_task=480 p0=10443
INFO [            update_slots] kv cache rm [p0, end) | tid="127145341607936" timestamp=1751280392 id_slot=0 id_task=480 p0=12491

Each of these lines takes around 12-15 seconds to appear that's far too slow. Is this PP issues again?

I know exactly what you are talking about. Look up what a task model is in open web UI. And switch the task model to a different model. Your task model has to be a public model exposed by either lama.cpp or Ollama. So I have olama along with llama.cpp running. And because I have switched my task model to be llama 3.2 it barely uses any V ram and all the tag generation chat name generation web search and all that happens using that task model from Ollama. The setting for task model is hidden inside of open web UI interface settings I think. I'm driving right now but when I reach home in about an hour and a half I can let you know unless you have figured it out by yourself by then.

Ah okay. I believe I added the task models now but it still has this slow KV rm when I add a document or web search aka I believe its adding a lot to context. is that a bit different?

For example, my opener prompt with websearch on is "Search what is the cost of Hamburgers in [name-burger-place-we-all-know]" and it gets a few sites as results then the dreaded kv rm series happens.
If i respond with websearch off aka normal prompt it responds faster as usual. but if i do another search query prompt it does the slow stuff... typically when I need to inject context RAG/search

@SFPLM

p0=12491 means that you are sending a fairly large context prompt and the server has processed 12491 tokens into the kv-cache. You can enable prompt caching (forget if it is on by default) but that will only help for growing the same conversation and adding new stuff to the end of it.

Seems like your OpenWebUI is likely doing some kind of web search/scraping and then sending it to the big LLM for summarizing or something. As @mctl suggests you might want to have two different LLMs loaded for different purposes e.g.:

  1. A very small long-context (no-thinking) model for fast text summarizing like maybe Phi4, Polaris-4B, Gemma ?? (I dunno do your research here).
  2. Your big DeepSeek-R1-0528 to take in the processed text from above to give the final answer.

More complex "agentic" workflows will typically require at least a couple or more models/re-rankers/embedding models being run simultaneously. You can just start multiple copies of ik_llama.cppeach running on a different port to support multiple LLM endpoints and configure openwebui.

I don't use OpenWebUI but there are some other settings to like disable tags and autocomplete stuff which could be useful if using a big model e.g. my old script which is probably out of date now. No need for ollama.

#!/usr/bin/env bash

source venv/bin/activate

# IT DOES NOT HONOR HOST and PORT ENV VAR SO PASS IT MANUALLY
# https://docs.openwebui.com/getting-started/env-configuration/#port

export DATA_DIR="$(pwd)/data"
export ENABLE_OLLAMA_API=False
export ENABLE_OPENAI_API=True
export OPENAI_API_KEY="none"
export OPENAI_API_BASE_URL="http://127.0.0.1:8080/v1"
#export DEFAULT_MODELS="openai/foo/bar"
export WEBUI_AUTH=True
export DEFAULT_USER_ROLE="admin"
export HOST=127.0.0.1
export PORT=3000

export ENABLE_TAGS_GENERATION=False
export ENABLE_AUTOCOMPLETE_GENERATION=False

open-webui serve \
  --host $HOST \
  --port $PORT

I am having a great success with these quants on EPYC 7282 + 256GB DDR4 3200MHz RAM (8 channels) + 4x RTX 3090 with the latest ik_llama.cpp (4622fad) - thank you for making these!
Getting very usable ~7t/s for indivudal non-batched requests, with some acceptable degradation as the KV increases:

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  2048 |    512 |      0 |   31.896 |    64.21 |   68.622 |     7.46 |
|  2048 |    512 |   2048 |   31.820 |    64.36 |   68.388 |     7.49 |
|  2048 |    512 |   4096 |   32.039 |    63.92 |   69.754 |     7.34 |
|  2048 |    512 |   6144 |   32.232 |    63.54 |   71.181 |     7.19 |
|  2048 |    512 |   8192 |   32.445 |    63.12 |   72.344 |     7.08 |

Parameters that work well with my setup:

./llama-server --model model/DeepSeek-V3-0324-IQ2_K_R4/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf \
--threads 32 \
--temp 0.3 \
--min_p 0.05 \
--ctx-size 32768 \
-ts 24,24,24,24 \ 
-ngl 63 \
-ctk q8_0 \
-fmoe \
-mla 3 \
-fa \
-amb 512 \
-ub 2048 \
-b 2048 \
-ot exps=CPU \
-ot "blk\.(3|4|5)\.ffn_.*=CUDA0" \
-ot "blk\.(6|7|8)\.ffn_.*=CUDA1" \
-ot "blk\.(9|10|11)\.ffn_.*=CUDA2" \
-ot "blk\.(12|13|14)\.ffn_.*=CUDA3"

Memory allocation:

llm_load_tensors:        CPU buffer size = 42930.78 MiB
llm_load_tensors:        CPU buffer size = 46857.06 MiB
llm_load_tensors:        CPU buffer size = 46857.06 MiB
llm_load_tensors:        CPU buffer size = 43869.77 MiB
llm_load_tensors:        CPU buffer size =   938.98 MiB
llm_load_tensors:      CUDA0 buffer size = 15721.59 MiB
llm_load_tensors:      CUDA1 buffer size = 15033.14 MiB
llm_load_tensors:      CUDA2 buffer size = 15291.42 MiB
llm_load_tensors:      CUDA3 buffer size = 15713.87 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   306.01 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   286.88 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   306.01 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   267.76 MiB
llama_new_context_with_model: KV self size  = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  3588.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  3560.01 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  3560.01 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =  3560.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   312.02 MiB
llama_new_context_with_model: graph nodes  = 8245
llama_new_context_with_model: graph splits = 178

This is on Ubuntu 22.04, NVIDIA drivers 570.133.20, ik_llama.cpp compiled against CUDA Toolkit 12.8

Just finished uploading the a new IQ3_KS based quant for https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF

  • IQ3_KS 281.463 GiB (3.598 BPW)

I might add a smaller IQ2_KS as well, maybe a small IQ1_S for the 128GB RAM club, and if interest possibly a slightly larger as well.

@ubergarm
If you do have spare time I would be interested in IQ4 myself for Chimera 2. I personally have not tried Chimera 1 or 2 yet but at a glance may be a sidegrade for something that thinks 'just enough' (I personally prefer nonreasoning over reasoning).

Sign up or log in to comment