Testing IQ5_K
W790E Sage + QYFS + 512G + RTX5090
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 40445.11 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 41384.10 MiB
llm_load_tensors: CPU buffer size = 938.98 MiB
llm_load_tensors: CUDA0 buffer size = 16707.02 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 61 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.1.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.2.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.3.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.4.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.5.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.6.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.7.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.8.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.9.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.10.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.11.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.12.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.13.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.14.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.15.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.16.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.17.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.18.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.19.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.20.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.21.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.22.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.23.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.24.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.25.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.26.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.27.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.28.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.29.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.30.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.31.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.32.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.33.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.34.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.35.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.36.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.37.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.38.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.39.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.40.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.41.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.42.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.43.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.44.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.45.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.46.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.47.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.48.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.49.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.50.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.51.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.52.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.53.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.54.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.55.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.56.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.57.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.58.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.59.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 80128
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 1
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: split_mode_graph_scheduling = 0
llama_new_context_with_model: split_mode_f16= 1
llama_new_context_with_model: sched_async = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 2852.79 MiB
llama_new_context_with_model: KV self size = 2852.76 MiB, c^KV (q8_0): 2852.76 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 7347.71 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 369.02 MiB
llama_new_context_with_model: graph nodes = 24110
llama_new_context_with_model: graph splits = 118
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
main: n_kv_max = 80128, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 44.385 | 46.14 | 44.178 | 11.59 |
| 2048 | 512 | 2048 | 44.477 | 46.05 | 44.972 | 11.38 |
| 2048 | 512 | 4096 | 44.928 | 45.58 | 45.169 | 11.34 |
| 2048 | 512 | 6144 | 45.078 | 45.43 | 45.203 | 11.33 |
| 2048 | 512 | 8192 | 45.377 | 45.13 | 45.767 | 11.19 |
Yes, the Speciale version was trained for extra long thinking. Supposedly it scores better, if you are very patient haha...
As always, thanks for the report and demo!
Can I run the IQ5_K with regular llama.cpp? About how much token is it using to think through 1 solution? 32k, 64k?
mainline llama.cpp does not support the iq5_k quantized tensors, no. you can git clone and compile ik_llama.cpp pretty quick though which does support all the mainline quants plus ik's new ones (ik contributed the older q5_K to mainline a couple years ago before starting the new fork).
Check out the code block in this section with git clone and compiling instructions: https://huggingface.co/ubergarm/DeepSeek-V3.2-Speciale-GGUF#quick-start
Or for windows you can use Thireus' precompiled binaries (linked on model card as well).
I'm not sure how much context was consumed during @shewin 's testing though. probably varies, but i'd suggest a minimum of 32k context for any 1-shot vibe coding kind of project, preferably 64k or more for multi-turn stuff i'm guessing. but dunno the exact details!
