Testing IQ5_K

by shewin - opened Jan 10

Jan 10

W790E Sage + QYFS + 512G + RTX5090

Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 40445.11 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 41384.10 MiB
llm_load_tensors: CPU buffer size = 938.98 MiB
llm_load_tensors: CUDA0 buffer size = 16707.02 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 61 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.1.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.2.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.3.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.4.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.5.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.6.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.7.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.8.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.9.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.10.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.11.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.12.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.13.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.14.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.15.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.16.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.17.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.18.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.19.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.20.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.21.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.22.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.23.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.24.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.25.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.26.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.27.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.28.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.29.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.30.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.31.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.32.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.33.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.34.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.35.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.36.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.37.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.38.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.39.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.40.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.41.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.42.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.43.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.44.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.45.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.46.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.47.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.48.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.49.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.50.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.51.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.52.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.53.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.54.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.55.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.56.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.57.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.58.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.59.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 80128
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 1
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: split_mode_graph_scheduling = 0
llama_new_context_with_model: split_mode_f16= 1
llama_new_context_with_model: sched_async = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 2852.79 MiB
llama_new_context_with_model: KV self size = 2852.76 MiB, c^KV (q8_0): 2852.76 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 7347.71 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 369.02 MiB
llama_new_context_with_model: graph nodes = 24110
llama_new_context_with_model: graph splits = 118
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload

main: n_kv_max = 80128, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	44.385	46.14	44.178	11.59
2048	512	2048	44.477	46.05	44.972	11.38
2048	512	4096	44.928	45.58	45.169	11.34
2048	512	6144	45.078	45.43	45.203	11.33
2048	512	8192	45.377	45.13	45.767	11.19

shewin

Jan 10

too much talker, doesn't stop thinking.
It doesn't seem to fit on the local machine.

ubergarm

Owner Jan 10

Yes, the Speciale version was trained for extra long thinking. Supposedly it scores better, if you are very patient haha...

As always, thanks for the report and demo!

segmond

Feb 8

•

edited Feb 8

Can I run the IQ5_K with regular llama.cpp? About how much token is it using to think through 1 solution? 32k, 64k?

ubergarm

Owner Feb 8

@segmond

mainline llama.cpp does not support the iq5_k quantized tensors, no. you can git clone and compile ik_llama.cpp pretty quick though which does support all the mainline quants plus ik's new ones (ik contributed the older q5_K to mainline a couple years ago before starting the new fork).

Check out the code block in this section with git clone and compiling instructions: https://huggingface.co/ubergarm/DeepSeek-V3.2-Speciale-GGUF#quick-start

Or for windows you can use Thireus' precompiled binaries (linked on model card as well).

I'm not sure how much context was consumed during @shewin 's testing though. probably varies, but i'd suggest a minimum of 32k context for any 1-shot vibe coding kind of project, preferably 64k or more for multi-turn stuff i'm guessing. but dunno the exact details!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment