imatrix coming soon? 🫣

#1
by Thireus - opened

Hey @ubergarm , thank you for looking into that model. Just saw your other post that you may be afk, and wanted to know about the imatrix ETA for this model. I might be able to produce one as well in a few days on Q8_0, but if you are also going to release it then I would wait (I've just released the BF16 a few hours ago).

Heya @Thireus , yeah i managed to check on the remote rig while i was away, just got back to my desk and the imatrix is done!

[810]3.3940,[811]3.3950,[812]3.3967,
save_imatrix: stored collected data after 812 chunks in /mnt/data/models/ubergarm/DeepSeek-V3.1-Terminus-GGUF/imatrix-DeepSeek-V3.1-Terminus-Q8_0.dat

llama_print_timings:        load time =  271554.50 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 8854734.92 ms / 415744 tokens (   21.30 ms per token,    46.95 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 9138093.84 ms / 415745 tokens

Final estimate: PPL = 3.3967 +/- 0.01656

i'll upload that now, then get the model card and stuff and some quants up in the next hours!

Thank you so much!

@Thireus okay it is up, haven't tested it yet, gonna do that quick to see if any missing tensors this time or what is going on, i had to omit --layer-similarity again due to that error:

compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 663.766 ms
compute_imatrix: computing over 812 chunks with batch_size 512
================= Adjusted mainline llama.cpp MLA tensors to ik_llama.cpp
======================================= HAVE_FANCY_SIMD is defined
Oops, inconsistent ffn vs last_input size

I'm re-cooking the imatrix now with -no-fug otherwise it will be missing importance data for dense/shexp (up|gate) tensors... another 3 hours or so and the corrected one will be uploaded, sorry for the confusion!

@Thireus

Oh no, I spoke too soon, in using this new imatrix for quantizing it reporting missing imatrix weights for the first 3 dense layers ffn_(gate|up) and for the shared expert ffn_(gate|up)_shexp however seems to have the down tensors and routed experts so not all is lost.

Not sure why this started happening more recently, I'll have to go bisect the most recent PRs from a couple weeks ago and try to see what changed.

$ grep 'did not find' quantize-DeepSeek-V3.1-Terminus-IQ2_KS-main.log
====== llama_model_quantize_internal: did not find weights for token_embd.weight
====== llama_model_quantize_internal: did not find weights for blk.0.ffn_gate.weight
====== llama_model_quantize_internal: did not find weights for blk.0.ffn_up.weight
====== llama_model_quantize_internal: did not find weights for blk.1.ffn_gate.weight
====== llama_model_quantize_internal: did not find weights for blk.1.ffn_up.weight
====== llama_model_quantize_internal: did not find weights for blk.2.ffn_gate.weight
====== llama_model_quantize_internal: did not find weights for blk.2.ffn_up.weight
====== llama_model_quantize_internal: did not find weights for blk.3.ffn_gate_shexp.weight
====== llama_model_quantize_internal: did not find weights for blk.3.ffn_up_shexp.weight
====== llama_model_quantize_internal: did not find weights for blk.4.ffn_gate_shexp.weight
====== llama_model_quantize_internal: did not find weights for blk.4.ffn_up_shexp.weight
====== llama_model_quantize_internal: did not find weights for blk.5.ffn_gate_shexp.weight
====== llama_model_quantize_internal: did not find weights for blk.5.ffn_up_shexp.weight
...

So for now if you use this imatrix probably want to leave these specific tensors unquantized at q8_0 unfortunately.

fwiw I used the fp8 cast with triton-cpu method, then used mainline lcpp to convert to bf16, then run this for imatrix. you can use -ub 4096 -b 4096 or leave it off, shouldn't matter. I've tried both ways and still missing above tensors. I'll fool with it some more and maybe leave of -mla 1 but I'm definitely not using -fmoe unless something changed and missed it.

#ik_llama.cpp main@6d2e7ca4

numactl -N 1 -m 1 \
./build/bin/llama-imatrix \
    -m "$model" \
    -fa -mla 1 \
    -f ubergarm-imatrix-calibration-corpus-v02.txt \
    -o /mnt/data/models/ubergarm/DeepSeek-V3.1-Terminus-GGUF/imatrix-DeepSeek-V3.1-Terminus-Q8_0.dat \
    --verbosity 1 \
    --ctx-size 512 \
    --numa numactl \
    --threads 128 \
    --threads-batch 192 \
    --no-mmap

@Thireus

I belive this is the culprit: https://github.com/ikawrakow/ik_llama.cpp/pull/741

I must explicitly disable fug now pretty sure, going to re-run this imatrix ASAP sorry for the hassle!

Just uploaded corrected imatrix dat file, sha256sum 41833a7e9e58acaf65662f7fb250f47d577d15913eea44fec4eecf80519e7d27.

Confirmed properly now that it has all the imatrix data for all the ffn tensors etc.

Now back to quantizing again!

Thank you! It's working.

Thireus changed discussion status to closed

Sign up or log in to comment