Need help getting faster speeds.

by SFPLM - opened Jun 26, 2025

Jun 26, 2025

•

edited Jun 26, 2025

Hey I am a beginner at this and trying out the IQ4 Quant with 1x 3090 and 512 GB RAM on a Intel Xeon Plat 8480 ES QYFS (56 core),

I use cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86" and then cmake --build ./build --config Release -j $(nproc) to build ik_llama.cpp.

I also use

CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-server \
    --model /mnt/.../ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
    --alias ubergarm/DeepSeek-R1-V3-0324-IQ4_K_R4 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -ctv q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    -rtr \
    --temp 0.3 \
    --min-p 0.05 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 56 \
    --host 127.0.0.1 \
    --port 8080

I get the speeds below

INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750924132 id_slot=0 id_task=0 p0=0
INFO [           print_timings] prompt eval time     =    1009.41 ms /    10 tokens (  100.94 ms per token,     9.91 tokens per second) | tid="134454010454016" timestamp=1750924135 id_slot=0 id_task=0 t_prompt_processing=1009.41 n_prompt_tokens_processed=10 t_token=100.941 n_tokens_second=9.906777226300512
INFO [           print_timings] generation eval time =    1280.22 ms /    12 runs   (  106.69 ms per token,     9.37 tokens per second) | tid="134454010454016" timestamp=1750924135 id_slot=0 id_task=0 t_token_generation=1280.221 n_decoded=12 t_token=106.68508333333334 n_tokens_second=9.37338162707845
INFO [           print_timings]           total time =    2289.63 ms | tid="134454010454016" timestamp=1750924135 id_slot=0 id_task=0 t_prompt_processing=1009.41 t_token_generation=1280.221 t_total=2289.631
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750924135 id_slot=0 id_task=0 n_ctx=32768 n_past=21 n_system_tokens=0 n_cache_tokens=21 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924135
INFO [      log_server_request] request | tid="134451669495808" timestamp=1750924135 remote_addr="127.0.0.1" remote_port=33058 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924135
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750924140 id_slot=0 id_task=14
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750924140 id_slot=0 id_task=14 p0=21
INFO [           print_timings] prompt eval time     =     442.20 ms /     6 tokens (   73.70 ms per token,    13.57 tokens per second) | tid="134454010454016" timestamp=1750924144 id_slot=0 id_task=14 t_prompt_processing=442.197 n_prompt_tokens_processed=6 t_token=73.6995 n_tokens_second=13.568613084213597
INFO [           print_timings] generation eval time =    3623.74 ms /    32 runs   (  113.24 ms per token,     8.83 tokens per second) | tid="134454010454016" timestamp=1750924144 id_slot=0 id_task=14 t_token_generation=3623.739 n_decoded=32 t_token=113.24184375 n_tokens_second=8.830658057878892
INFO [           print_timings]           total time =    4065.94 ms | tid="134454010454016" timestamp=1750924144 id_slot=0 id_task=14 t_prompt_processing=442.197 t_token_generation=3623.739 t_total=4065.936
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750924144 id_slot=0 id_task=14 n_ctx=32768 n_past=58 n_system_tokens=0 n_cache_tokens=58 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924144
INFO [      log_server_request] request | tid="134451661103104" timestamp=1750924144 remote_addr="127.0.0.1" remote_port=33064 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924144
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750924161 id_slot=0 id_task=48
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750924161 id_slot=0 id_task=48 p0=58
INFO [      log_server_request] request | tid="134451652710400" timestamp=1750924214 remote_addr="127.0.0.1" remote_port=53082 status=200 method="GET" path="/" params={}
INFO [           print_timings] prompt eval time     =    1115.50 ms /    15 tokens (   74.37 ms per token,    13.45 tokens per second) | tid="134454010454016" timestamp=1750924261 id_slot=0 id_task=48 t_prompt_processing=1115.499 n_prompt_tokens_processed=15 t_token=74.3666 n_tokens_second=13.446896859611709
INFO [           print_timings] generation eval time =   99090.24 ms /   865 runs   (  114.56 ms per token,     8.73 tokens per second) | tid="134454010454016" timestamp=1750924261 id_slot=0 id_task=48 t_token_generation=99090.236 n_decoded=865 t_token=114.55518612716763 n_tokens_second=8.72941709413226
INFO [           print_timings]           total time =  100205.74 ms | tid="134454010454016" timestamp=1750924261 id_slot=0 id_task=48 t_prompt_processing=1115.499 t_token_generation=99090.236 t_total=100205.735
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750924261 id_slot=0 id_task=48 n_ctx=32768 n_past=937 n_system_tokens=0 n_cache_tokens=937 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924261
INFO [      log_server_request] request | tid="134451753373696" timestamp=1750924261 remote_addr="127.0.0.1" remote_port=54154 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924261
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750924580 id_slot=0 id_task=915
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750924580 id_slot=0 id_task=915 p0=937
INFO [           print_timings] prompt eval time     =    1559.26 ms /    25 tokens (   62.37 ms per token,    16.03 tokens per second) | tid="134454010454016" timestamp=1750924662 id_slot=0 id_task=915 t_prompt_processing=1559.258 n_prompt_tokens_processed=25 t_token=62.37032 n_tokens_second=16.033267105251344
INFO [           print_timings] generation eval time =   80799.59 ms /   686 runs   (  117.78 ms per token,     8.49 tokens per second) | tid="134454010454016" timestamp=1750924662 id_slot=0 id_task=915 t_token_generation=80799.59 n_decoded=686 t_token=117.78365889212827 n_tokens_second=8.490142091067542
INFO [           print_timings]           total time =   82358.85 ms | tid="134454010454016" timestamp=1750924662 id_slot=0 id_task=915 t_prompt_processing=1559.258 t_token_generation=80799.59 t_total=82358.848
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750924662 id_slot=0 id_task=915 n_ctx=32768 n_past=1647 n_system_tokens=0 n_cache_tokens=1647 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924662
INFO [      log_server_request] request | tid="134451644317696" timestamp=1750924662 remote_addr="127.0.0.1" remote_port=35684 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750924662
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750925002 id_slot=0 id_task=1603
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750925002 id_slot=0 id_task=1603 p0=8


INFO [           print_timings] prompt eval time     =    9485.77 ms /   435 tokens (   21.81 ms per token,    45.86 tokens per second) | tid="134454010454016" timestamp=1750925024 id_slot=0 id_task=1603 t_prompt_processing=9485.769 n_prompt_tokens_processed=435 t_token=21.806365517241378 n_tokens_second=45.85816922170464
INFO [           print_timings] generation eval time =   11827.47 ms /   101 runs   (  117.10 ms per token,     8.54 tokens per second) | tid="134454010454016" timestamp=1750925024 id_slot=0 id_task=1603 t_token_generation=11827.468 n_decoded=101 t_token=117.10364356435645 n_tokens_second=8.539443945229866
INFO [           print_timings]           total time =   21313.24 ms | tid="134454010454016" timestamp=1750925024 id_slot=0 id_task=1603 t_prompt_processing=9485.769 t_token_generation=11827.468 t_total=21313.237
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750925024 id_slot=0 id_task=1603 n_ctx=32768 n_past=543 n_system_tokens=0 n_cache_tokens=543 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925024
INFO [      log_server_request] request | tid="134451635924992" timestamp=1750925024 remote_addr="127.0.0.1" remote_port=48162 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925024
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750925068 id_slot=0 id_task=1706
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750925068 id_slot=0 id_task=1706 p0=8
INFO [      log_server_request] request | tid="134451627532288" timestamp=1750925102 remote_addr="127.0.0.1" remote_port=33158 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750925102 id_slot=0 id_task=1706 n_ctx=32768 n_past=1668 n_system_tokens=0 n_cache_tokens=1668 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925102
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750925110 id_slot=0 id_task=1710
INFO [            update_slots] we have to evaluate at least 1 token to generate logits | tid="134454010454016" timestamp=1750925110 id_slot=0 id_task=1710
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750925110 id_slot=0 id_task=1710 p0=1666
INFO [      log_server_request] request | tid="134451619139584" timestamp=1750925118 remote_addr="127.0.0.1" remote_port=44816 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750925118 id_slot=0 id_task=1710 n_ctx=32768 n_past=1731 n_system_tokens=0 n_cache_tokens=1731 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925118
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750925133 id_slot=0 id_task=1777
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750925133 id_slot=0 id_task=1777 p0=1730
INFO [           print_timings] prompt eval time     =    1074.76 ms /    13 tokens (   82.67 ms per token,    12.10 tokens per second) | tid="134454010454016" timestamp=1750925215 id_slot=0 id_task=1777 t_prompt_processing=1074.764 n_prompt_tokens_processed=13 t_token=82.67415384615384 n_tokens_second=12.095678679226324
INFO [           print_timings] generation eval time =   80388.76 ms /   662 runs   (  121.43 ms per token,     8.23 tokens per second) | tid="134454010454016" timestamp=1750925215 id_slot=0 id_task=1777 t_token_generation=80388.76 n_decoded=662 t_token=121.43317220543805 n_tokens_second=8.234982104463361
INFO [           print_timings]           total time =   81463.52 ms | tid="134454010454016" timestamp=1750925215 id_slot=0 id_task=1777 t_prompt_processing=1074.764 t_token_generation=80388.76 t_total=81463.52399999999
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750925215 id_slot=0 id_task=1777 n_ctx=32768 n_past=2404 n_system_tokens=0 n_cache_tokens=2404 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925215
INFO [      log_server_request] request | tid="134451610746880" timestamp=1750925215 remote_addr="127.0.0.1" remote_port=48836 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925215
INFO [   launch_slot_with_task] slot is processing task | tid="134454010454016" timestamp=1750925215 id_slot=0 id_task=2441
INFO [            update_slots] kv cache rm [p0, end) | tid="134454010454016" timestamp=1750925215 id_slot=0 id_task=2441 p0=1979
INFO [           print_timings] prompt eval time     =   10217.66 ms /   436 tokens (   23.43 ms per token,    42.67 tokens per second) | tid="134454010454016" timestamp=1750925255 id_slot=0 id_task=2441 t_prompt_processing=10217.656 n_prompt_tokens_processed=436 t_token=23.434990825688075 n_tokens_second=42.67123496817665
INFO [           print_timings] generation eval time =   29813.12 ms /   244 runs   (  122.18 ms per token,     8.18 tokens per second) | tid="134454010454016" timestamp=1750925255 id_slot=0 id_task=2441 t_token_generation=29813.119 n_decoded=244 t_token=122.18491393442622 n_tokens_second=8.18431644136261
INFO [           print_timings]           total time =   40030.78 ms | tid="134454010454016" timestamp=1750925255 id_slot=0 id_task=2441 t_prompt_processing=10217.656 t_token_generation=29813.119 t_total=40030.775
INFO [            update_slots] slot released | tid="134454010454016" timestamp=1750925255 id_slot=0 id_task=2441 n_ctx=32768 n_past=2658 n_system_tokens=0 n_cache_tokens=2658 truncated=false
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925255
INFO [      log_server_request] request | tid="134451610746880" timestamp=1750925255 remote_addr="127.0.0.1" remote_port=48836 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925255
INFO [            update_slots] all slots are idle | tid="134454010454016" timestamp=1750925283

I also tried with 2x 3090 and in 32k ctx or 64 ctx its both similar in speed gen to the 1x3090 I have above

The PP feels too slow I was hoping for maybe at least 9.5 t/s gen speed and perhaps much better PP speeds. I am not sure if the PP speeds are to be expected or how to get better numbers from this as I would like to try to see if I can comfortably run DeepSeek given I have high RAM and a deemed good CPU. Is the 3090 the fault or its something not present in my launch arguments.

How I can further improve this?

ubergarm

Owner Jun 26, 2025

@SFPLM

Glad you got it going and can now begin to optimzie for your exact system. There is a lot of info out thee already so check the discussions on ik_llama.cpp github and other threads here.

First off compile like this especially when using 2x 3090TI. The SCHED_MAX_COPIES will make it easier to understand VRAM allocation and increase batch sizes for more PP. The FORCE_BF16 may increase speed on 3090's and also prevent numerical issues specific to DeepSeek MLA.

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CUDA_F16=ON
cmake --build ./build --config Release -j $(nproc)

Next, how is your NUMA configured? I assume this is a single socket system? Make sure you have BIOS configured to show only a single numa node e.. numactl --hardware shows only one. As no llama.cpp fork has optimizations here for this and you don't have enough RAM to try ktransformers USE_NUMA=1 which doubles RAM usage (one per CPU socket numa node).

If you can't get a single NUMA node let me know and you can use some numactl stuff to glue them togther to get the best you can.

Regarding your command, try this, I will assume you are using 2x 3090TI which will give more VRAM to offload additional layers

./build/bin/llama-server \
    --model /mnt/.../ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
    --alias ubergarm/DeepSeek-R1-V3-0324-IQ4_K_R4 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    --temp 0.3 \
    --min-p 0.05 \
    --n-gpu-layers 63 \
    -ot "blk\.(3|4)\.ffn_.*=CUDA0" \
    -ot "blk\.(5|6)\.ffn_.*=CUDA1" \
    -ot exps=CPU \
    -ub 2048 -b 2048 \
    --parallel 1 \
    --threads 56 \
    --host 127.0.0.1 \
    --port 8080

Adjust number of additional ffn layers offloaded to CUDA0 and CUDA1 based on if you OOM or not. Basically crank it up until you OOM and dial back by one. Increasing batch sizes takes a little more VRAM but gives a lot more PP.

When you want to benchmark speeds, I advise to use llama-sweep-bench. Simply replace llama-server with llama-sweep-bench and add --warmup-batch and it should ignore alias/host/port fine.

Finally use the most recent version of ik_llama.cpp as PP just got a boost for the _R4 quants running on CUDA like u are doing.

TG will be limited by your RAM bandwidth and in my experience can be below theoretical maximum on intel rigs even when mlc (intel memory latency checker) bandwidth reads higher.

Cheers and keep us all posted!

SFPLM

Jun 27, 2025

•

edited Jun 27, 2025

@ubergarm

Glad to hear and appreciate the advice. After doing a bit of tampering I seem to cannot get it to load more than 1 expert layer to each GPU. I have tried ts 23,24 and ctv (v cache) to q8 and i just cant.

For example

Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_up_shexp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_shexp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_down_shexp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_up_shexp.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.5.ffn_up_shexp.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_gate_shexp.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_down_shexp.weight buffer type overriden to CUDA1
Tensor blk.6.ffn_up_shexp.weight buffer type overriden to CUDA1
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
... We know what it looks like 
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size =  9218.28 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 31988.10 MiB
llm_load_tensors:        CPU buffer size =   938.98 MiB
llm_load_tensors:      CUDA0 buffer size = 21945.34 MiB
llm_load_tensors:      CUDA1 buffer size = 21782.68 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   592.89 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   573.76 MiB
llama_new_context_with_model: KV self size  = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4472.01 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 4689240064
llama_new_context_with_model: failed to allocate compute buffers

The above also happens if i attempt to add 2 layers. I also tried 3 to CUDA0 and 4 + 5 into CUDA1, CUDA1 OOMs.

The one I have tried so far that works and is fastest so far is:

./build/bin/llama-sweep-bench \
    --model /mnt/.../models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
    --alias ubergarm/DeepSeek-R1-V3-0324-IQ4_K_R4 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    --temp 0.3 \
    --min-p 0.05 \
    -ts 23,24 \
    --n-gpu-layers 63 \
    -ot "blk\.(3)\.ffn_.*=CUDA0" \
    -ot "blk\.(4)\.ffn_.*=CUDA1" \
    -ot exps=CPU \
    -ub 2048 -b 2048 \
    -ser 6,1 \
    --parallel 1 \
    --threads 56

gets me:

llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size = 22726.83 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 40525.67 MiB
llm_load_tensors:        CPU buffer size = 31988.10 MiB
llm_load_tensors:        CPU buffer size =   938.98 MiB
llm_load_tensors:      CUDA0 buffer size = 15500.99 MiB
llm_load_tensors:      CUDA1 buffer size = 15235.03 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = 6, 1
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   592.89 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   573.76 MiB
llama_new_context_with_model: KV self size  = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  4472.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  3560.02 MiB
llama_new_context_with_model:  CUDA_Host comput```e buffer size =   312.02 MiB
llama_new_context_with_model: graph nodes  = 8245
llama_new_context_with_model: graph splits = 149

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 56, n_threads_batch = 56

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	40.892	50.08	49.042	10.44
2048	512	2048	40.985	49.97	48.505	10.56
2048	512	4096	41.221	49.68	47.514	10.78
2048	512	6144	42.020	48.74	50.148	10.21
2048	512	8192	42.275	48.44	51.781	9.89
2048	512	10240	42.688	47.98	51.415	9.96
2048	512	12288	43.135	47.48	52.508	9.75
2048	512	14336	49.251	41.58	49.770	10.29
2048	512	16384	43.725	46.84	54.991	9.31
2048	512	18432	44.590	45.93	55.337	9.25
2048	512	20480	44.006	46.54	60.230	8.50
2048	512	22528	44.505	46.02	58.256	8.79
2048	512	24576	45.575	44.94	55.449	9.23
2048	512	26624	45.942	44.58	59.890	8.55
2048	512	28672	46.320	44.21	56.559	9.05
2048	512	30720	75.212	27.23	66.060	7.75

The 6 experts instead of 8 does bring it up.

8 Experts I think goes like

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s

| 2048 | 512 | 0 | 43.768 | 46.79 | 58.812 | 8.71 |
| 2048 | 512 | 2048 | 43.994 | 46.55 | 60.076 | 8.52 |
| 2048 | 512 | 4096 | 44.731 | 45.79 | 59.162 | 8.65 |
| 2048 | 512 | 6144 | 45.756 | 44.76 | 61.601 | 8.31 |
| 2048 | 512 | 8192 | 44.844 | 45.67 | 62.035 | 8.25 |
| 2048 | 512 | 10240 | 44.996 | 45.52 | 62.541 | 8.19 |
| 2048 | 512 | 12288 | 45.741 | 44.77 | 64.000 | 8.00 |
| 2048 | 512 | 14336 | 45.896 | 44.62 | 65.038 | 7.87 |
| 2048 | 512 | 16384 | 48.384 | 42.33 | 70.296 | 7.28 |
| 2048 | 512 | 18432 | 53.666 | 38.16 | 66.915 | 7.65 |
| 2048 | 512 | 20480 | 46.520 | 44.02 | 66.358 | 7.72 |
| 2048 | 512 | 22528 | 47.653 | 42.98 | 67.443 | 7.59 |
| 2048 | 512 | 24576 | 63.603 | 32.20 | 72.567 | 7.06 |
| 2048 | 512 | 26624 | 51.910 | 39.45 | 74.521 | 6.87 |
| 2048 | 512 | 28672 | 49.094 | 41.72 | 71.302 | 7.18 |
| 2048 | 512 | 30720 | 49.411 | 41.45 | 72.487 | 7.06 |

Is it time to look in to overclocking or squeezing ram speed?
Or is it the 3090s fault (I am on EVGA 3090 FTW3 not 3090 Ti if that makes any difference) being a 3090? Or is it that I have it plugged into the monitor (my mobo is Asus W790 Sage SE)
EDIT: It seems testing 4096 batch, no experts in GPU, goes around 90-75 PP ts and around 8-9.3 t/s
I may also look in 1x3090 argument setup since as of this moment my own PC could use a GPU for my own stuff. My guess is though for more PP and max response length I cant offload any to 3090. Darn I want to load more layers but why 3090 why

SFPLM

Jun 27, 2025

•

edited Jun 27, 2025

main: n_kv_max = 32768, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, n_gpu_layers = 63, n_threads = 56, n_threads_batch = 56

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	2048	0	52.480	156.10	206.261	9.93
8192	2048	8192	56.717	144.44	228.645	8.96
8192	2048	16384	62.132	131.85	223.730	9.15
8192	2048	24576	71.045	115.31	240.698	8.51

Also tried this config . when I use llama server, when I do a test prompt of my own gets 46 PP 10.55 Gen, and a follow up 35 PP 10 Gen. Not sure if this is in line with the bench
--ctx-size 32768
-ctk q8_0
-mla 3 -fa
-amb 512
-fmoe
--temp 0.3
--min-p 0.05
-ts 23,24
-ot "blk.(3).ffn_.*=CUDA1"
--n-gpu-layers 63
-ot exps=CPU
-ub 8192 -b 8192
-ser 6,1
--parallel 1
--threads 56

mtcl

Jun 27, 2025

Looks like you have very similar config as me, Asus Pro WS w790 and looking at 56 thread count I think you have QYFS?

SFPLM

Jun 27, 2025

@mtcl I saw some of your videos which was a source of inspiration.

I think I have some similarities with your build. Intel Xeon 8480 ES - QYFS 56 Core, Asus w790 Sage SE, 512GB DDR5 (I got 5600 but I only did a bit of bios to try to get it up but it sets it for 4800 for me; do you know if we can push speeds higher?).

But I only have 2 3090s in hand as of now (I will eventually try to get something better, but until my "situation improves" I will see if I can make it work with 3090s).

mtcl

Jun 27, 2025

I think that 4800 lock is on the processor being engineering sample. And noni haven't been able to overclock it beyond 4800 either. But it's ok for now I think.

I know ever usecase is different and I shouldn't generalize it, but I've noticed that qwen3-235b outperforms my expectations and works very well in my usecase. I run it at 128k context length and it's able to do 400 pp tk/s. Note that you have to use a special 128k variant of the model for that. If you're ok with 40k token limit then @ubergarm 's model is the best with its iq3 quant. Very fast and very stable.

2x 3090s is a very potent setup! I would say stay with them or buy a couple more 3090s instead of getting a 4090 or 5090.

And I'm glad you're coming over from the channel :) this ik_llama community is the best!

SFPLM

Jun 28, 2025

@mtcl

Have you been able to test Deepseek with using just 1 RTX Pro 6000? I want to know if PP and TG speeds will improve if I did decide to "change the GPU situation"... I saw you had like 12 tps to 15 tps using either 1 4090, a mix of 4090/5090, or 2 RTX Pros. What do you think were the reasons you could get this fast i think 15 tps is super nice.

As for now I need to give back a 3090 to my own PC so I will look into 1x 3090 performance for now.

SFPLM

Jul 1, 2025

Ok now I have another issue (should I take this thread into another place?) need some more help would be appreciated.

I had to drop back down to 1x3090. I also did some simple BIOS goofing around to tweak some memory and also some commands to tell it to "go faster" and I get 10.5 tps for 0 context yet 8 experts.

CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-server \
    --model /mnt/.../models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
    --alias ubergarm/DeepSeek-R1-V3-0324-IQ4_K_R4 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    --temp 0.3 \
    --min-p 0.05 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 56 \
    --host ... \
    --port ...

But when I use OpenWebUI with Searxng Web Search or the stock RAG system... Every time I change chat threads or do new input I have to wait extremely long for the KV removal as if KV is taking forever.

INFO [            update_slots] kv cache rm [p0, end) | tid="127145341607936" timestamp=1751280242 id_slot=0 id_task=480 p0=2251
INFO [            update_slots] kv cache rm [p0, end) | tid="127145341607936" timestamp=1751280267 id_slot=0 id_task=480 p0=4299
INFO [            update_slots] kv cache rm [p0, end) | tid="127145341607936" timestamp=1751280294 id_slot=0 id_task=480 p0=6347
INFO [            update_slots] kv cache rm [p0, end) | tid="127145341607936" timestamp=1751280326 id_slot=0 id_task=480 p0=8395
INFO [            update_slots] kv cache rm [p0, end) | tid="127145341607936" timestamp=1751280358 id_slot=0 id_task=480 p0=10443
INFO [            update_slots] kv cache rm [p0, end) | tid="127145341607936" timestamp=1751280392 id_slot=0 id_task=480 p0=12491

Each of these lines takes around 12-15 seconds to appear that's far too slow. Is this PP issues again?

mtcl

Jul 1, 2025

I know exactly what you are talking about. Look up what a task model is in open web UI. And switch the task model to a different model. Your task model has to be a public model exposed by either lama.cpp or Ollama. So I have olama along with llama.cpp running. And because I have switched my task model to be llama 3.2 it barely uses any V ram and all the tag generation chat name generation web search and all that happens using that task model from Ollama. The setting for task model is hidden inside of open web UI interface settings I think. I'm driving right now but when I reach home in about an hour and a half I can let you know unless you have figured it out by yourself by then.

SFPLM

Jul 1, 2025

•

edited Jul 1, 2025

Ah okay. I believe I added the task models now but it still has this slow KV rm when I add a document or web search aka I believe its adding a lot to context. is that a bit different?

For example, my opener prompt with websearch on is "Search what is the cost of Hamburgers in [name-burger-place-we-all-know]" and it gets a few sites as results then the dreaded kv rm series happens.
If i respond with websearch off aka normal prompt it responds faster as usual. but if i do another search query prompt it does the slow stuff... typically when I need to inject context RAG/search

ubergarm

Owner Jul 1, 2025

@SFPLM

p0=12491 means that you are sending a fairly large context prompt and the server has processed 12491 tokens into the kv-cache. You can enable prompt caching (forget if it is on by default) but that will only help for growing the same conversation and adding new stuff to the end of it.

Seems like your OpenWebUI is likely doing some kind of web search/scraping and then sending it to the big LLM for summarizing or something. As @mctl suggests you might want to have two different LLMs loaded for different purposes e.g.:

A very small long-context (no-thinking) model for fast text summarizing like maybe Phi4, Polaris-4B, Gemma ?? (I dunno do your research here).
Your big DeepSeek-R1-0528 to take in the processed text from above to give the final answer.

More complex "agentic" workflows will typically require at least a couple or more models/re-rankers/embedding models being run simultaneously. You can just start multiple copies of ik_llama.cppeach running on a different port to support multiple LLM endpoints and configure openwebui.

I don't use OpenWebUI but there are some other settings to like disable tags and autocomplete stuff which could be useful if using a big model e.g. my old script which is probably out of date now. No need for ollama.

#!/usr/bin/env bash

source venv/bin/activate

# IT DOES NOT HONOR HOST and PORT ENV VAR SO PASS IT MANUALLY
# https://docs.openwebui.com/getting-started/env-configuration/#port

export DATA_DIR="$(pwd)/data"
export ENABLE_OLLAMA_API=False
export ENABLE_OPENAI_API=True
export OPENAI_API_KEY="none"
export OPENAI_API_BASE_URL="http://127.0.0.1:8080/v1"
#export DEFAULT_MODELS="openai/foo/bar"
export WEBUI_AUTH=True
export DEFAULT_USER_ROLE="admin"
export HOST=127.0.0.1
export PORT=3000

export ENABLE_TAGS_GENERATION=False
export ENABLE_AUTOCOMPLETE_GENERATION=False

open-webui serve \
  --host $HOST \
  --port $PORT

Lusp

Jul 8, 2025

•

edited Jul 8, 2025

I am having a great success with these quants on EPYC 7282 + 256GB DDR4 3200MHz RAM (8 channels) + 4x RTX 3090 with the latest ik_llama.cpp (4622fad) - thank you for making these!
Getting very usable ~7t/s for indivudal non-batched requests, with some acceptable degradation as the KV increases:

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  2048 |    512 |      0 |   31.896 |    64.21 |   68.622 |     7.46 |
|  2048 |    512 |   2048 |   31.820 |    64.36 |   68.388 |     7.49 |
|  2048 |    512 |   4096 |   32.039 |    63.92 |   69.754 |     7.34 |
|  2048 |    512 |   6144 |   32.232 |    63.54 |   71.181 |     7.19 |
|  2048 |    512 |   8192 |   32.445 |    63.12 |   72.344 |     7.08 |

Parameters that work well with my setup:

./llama-server --model model/DeepSeek-V3-0324-IQ2_K_R4/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf \
--threads 32 \
--temp 0.3 \
--min_p 0.05 \
--ctx-size 32768 \
-ts 24,24,24,24 \ 
-ngl 63 \
-ctk q8_0 \
-fmoe \
-mla 3 \
-fa \
-amb 512 \
-ub 2048 \
-b 2048 \
-ot exps=CPU \
-ot "blk\.(3|4|5)\.ffn_.*=CUDA0" \
-ot "blk\.(6|7|8)\.ffn_.*=CUDA1" \
-ot "blk\.(9|10|11)\.ffn_.*=CUDA2" \
-ot "blk\.(12|13|14)\.ffn_.*=CUDA3"

Memory allocation:

llm_load_tensors:        CPU buffer size = 42930.78 MiB
llm_load_tensors:        CPU buffer size = 46857.06 MiB
llm_load_tensors:        CPU buffer size = 46857.06 MiB
llm_load_tensors:        CPU buffer size = 43869.77 MiB
llm_load_tensors:        CPU buffer size =   938.98 MiB
llm_load_tensors:      CUDA0 buffer size = 15721.59 MiB
llm_load_tensors:      CUDA1 buffer size = 15033.14 MiB
llm_load_tensors:      CUDA2 buffer size = 15291.42 MiB
llm_load_tensors:      CUDA3 buffer size = 15713.87 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   306.01 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   286.88 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   306.01 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   267.76 MiB
llama_new_context_with_model: KV self size  = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  3588.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  3560.01 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  3560.01 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =  3560.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   312.02 MiB
llama_new_context_with_model: graph nodes  = 8245
llama_new_context_with_model: graph splits = 178

This is on Ubuntu 22.04, NVIDIA drivers 570.133.20, ik_llama.cpp compiled against CUDA Toolkit 12.8

ubergarm

Owner Jul 8, 2025

Just finished uploading the a new IQ3_KS based quant for https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF

IQ3_KS 281.463 GiB (3.598 BPW)

I might add a smaller IQ2_KS as well, maybe a small IQ1_S for the 128GB RAM club, and if interest possibly a slightly larger as well.

SFPLM

Jul 9, 2025

@ubergarm
If you do have spare time I would be interested in IQ4 myself for Chimera 2. I personally have not tried Chimera 1 or 2 yet but at a glance may be a sidegrade for something that thinks 'just enough' (I personally prefer nonreasoning over reasoning).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment