Say Whattt?!

by mtcl - opened Jan 9

Jan 9

I was not expecting this! I just went to bed, And now I have to go back to my system and queue the download... 💤 Some of us want to sleep sir!

(Thank you so much 🙏😊)

geveent

Jan 9

Oh wow! Thank you John! I've been waiting for this, too.

ubergarm

Owner Jan 9

Great glad there is some interest in this experiment! In limited testing it seems pretty good for doing difficult one-shot programming questions that do not require any tool calling / MCP / agentic stuff. You could always dump in all the required information in the prompt yourself anyway or use other models to do that first before using this one for the final result.

Honestly credit goes to @sszymczyk for creating the patches and @bibproj for informing me: https://huggingface.co/sszymczyk/DeepSeek-V3.2-nolight-GGUF/discussions/1

Super curious to hear more about the rumored DS V4 model: https://www.reddit.com/r/LocalLLaMA/comments/1q88hdc/the_information_deepseek_to_release_next_flagship/

With luck it will run similarly to this V3.2-Speciale even without the sparse attention features.

geveent

Jan 9

It runs well on my machine Intel Xeon with 8-channel memory + single RTX 5090. I get 12 t/s running IQ5_K. The answers I get are succinct which is a good thing. I just need to be smarter and better user of this AI.

SFPLM

Jan 13

@ubergarm

May I ask if you plan to do Deepseek V3.2 Regular? I am interested in running the general more all rounder variant, though it looks like I might as well take a IQ5 for this.

I also wonder what if the next time you did a IQ4 quant what would happen if the embed and output was q8 and experts are as usual, down iq5 gate/up iq4 or down/gate/up iq4?

ubergarm

Owner Jan 13

@SFPLM

May I ask if you plan to do Deepseek V3.2 Regular?

I don't have plans for that given V4 is rumored to be coming out possibly soon™️ and you can find some mainline versions of it e.g.

I also wonder what if the next time you did a IQ4 quant what would happen if the embed and output was q8 and experts are as usual, down iq5 gate/up iq4 or down/gate/up iq4?

If you look at my IQ5_K quant recipe, I am setting token embed and output to q8_0:

## Token embedding and output tensors (GPU)
token_embd\.weight=q8_0
output\.weight=q8_0

This is mainly to appease the folks who were asking for full bf16 token embedding and output "head". They claim that anecdotally the responses seem better, but it is very difficult to measure especially given it doesn't effect overall perplexity/kld very much. But who knows, I haven't done a full benchmark study to prove that theory wrong so I decided to split the difference on the IQ5_K as it is the biggest I usualy bother to release haha...

In the future, I could consider extending this logic down to IQ4_K, but the tradition is to use ~4.5bpw for token embedding and high ~6bpw for output.

Have you collected any new data on thi? Just curious your experience lately.

SFPLM

Jan 14

@ubergarm

In the future, I could consider extending this logic down to IQ4_K, but the tradition is to use ~4.5bpw for token embedding and high ~6bpw for output.

I have not. I was just curious as to why the high 4 bit IQ4_K quants usually didn't have the q8 for outputs/embeds, also why down is quanted to higher bits than ffn gate and ffn up. I have yet to learn how to quantize via llama.cpp or ik_llama.cpp or the significance of each type of tensor.

I am not sure what is the hardware requirements to quantize and if I have 512GB DDR5 Xeon Sapphire Rapids + 1x RTX Pro 6000 would that be enough to quantize it. I have mainly used V3.1 Terminus IQ4_K ever since you released that.
I don't know much in this stuff but wonder if an alternate recipe was (looking at V3.1T your secret recipes for the higher quants) if I took your recipe and do some modifications like:

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Three Dense Layers [0-2] (GPU)
blk\.0\.ffn_down\.weight=q8_0
blk\.0\.ffn_(gate|up)\.weight=q8_0
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [3-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
## Routed Experts [3-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq5_k OR iq4_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k

## Token embedding and output tensors (GPU)
token_embd\.weight=q8_0
output\.weight=q8_0
"

What would this be would it be some sort of "IQ4_K+ or IQ4_KL" hopefully or in the <420GB range for some wierd mongrel quant.

Although I see that gguf Doctor posted as you linked a Q8+Q4:FFN_UP/FFN_GATE +Q5:FFN_DOWN so I might go try that I was waiting for a 3.2 gguf to upgrade over 3.1 I heard quite good things about 3.2 regular so maybe that was what I was looking for... or try Ktransformers when I get time to explore. Lots in the to-do to explore but imho feels kind of miserable to me when the llm inference server for isn't running because doing other stuff.

ubergarm

Owner Jan 14

I have not. I was just curious as to why the high 4 bit IQ4_K quants usually didn't have the q8 for outputs/embeds, also why down is quanted to higher bits than ffn gate and ffn up. I have yet to learn how to quantize via llama.cpp or ik_llama.cpp or the significance of each type of tensor.

A lot of the history is in two year old closed PRs on mainline llama.cpp typically written by ik and then discussions on quant types, imatrix corpi, and the original "mixtures" or recipes like Q4_K_M which under the hood uses a few different quantiation types for different tensors. So there are some traditions that come from that.

You can see them too in exllamav3 where turbo has allocation.py which does stuff like making attn_(k|v) higher bpw than attn_(q|output) and makes ffn_down_exp one size bigger than ffn_(gate|up)_exp etc. Typically keep gate|up the same size to allow for fused operation on ik etc. There is a lot going on under the hood with implications based on exact kernel used on what hardware (cpu/cuda/vulkan etc...)

If you're interested, I have a talk covering some of this here: https://blog.aifoundry.org/p/adventures-in-model-quantization with some specific discussions of tensors and quantization types (minute 18ish has a lot of quant talk).

I am not sure what is the hardware requirements to quantize and if I have 512GB DDR5 Xeon Sapphire Rapids + 1x RTX Pro 6000 would that be enough to quantize it.

Yes, quanting takes suprisingly little CPU/RAM and zero GPU. I make all my quants on a cpu-only rig (with a ton of ram hah). But only need maybe 128 GB ram to quant many big models probably. And patience and disk space.

The only part of quantizing that takes a lot of resources is generating the imatrix as it requires you to run inference on the largest size model you can (frequently the full size bf16). But if you find an imatrix.dat (or imatrix.gguf from mainline converted to .dat for ik) you can just use that. Myself, bartowski, and others often make these available for our models.

Here is a rough guide to the process, but it is kinda outa date on specific syntax at this point: https://github.com/ikawrakow/ik_llama.cpp/discussions/434 my repos have some log files in the most recent ones that might give you some more clues.

What would this be would it be some sort of "IQ4_K+ or IQ4_KL" hopefully or in the <420GB range for some wierd mongrel quant.

My naming convention is to name the quant mix itself after the quantization used for the ffn_(gate|up)_exps tensors (MoEs). Then i prefix it smol if down is the same size or no prefix if down is the usual one size larger. AesSedai and Madison and DrShotgun have a more explicit naming convention including tensor type of attn/down/gate|up or similar i believe or however you listed it there yes.

or try Ktransformers when I get time to explore.

I have a whole guide on ktransformers from over a year ago, haha.. I only recommend it if you have two socket intel xeon and want to load the entire model weights twice (once into each socket's RAM) and attempt to run data parallel otherwise I don't bother with it anymore after finding ik.

Lots in the to-do to explore but imho feels kind of miserable to me when the llm inference server for isn't running because doing other stuff.

"Potential" is overrated, the present moment is all we have, if you're not enjoying the journey then take a breath and refocus your attention away from thoughts of could be/should/future and all shall be well! Have fun and be gentle on yourself!

Cheers!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment