IQ2_XS?

#6
by tarruda - opened

I see you've re-uploaded to get prompt processing speed improvements.

I've been using @ubergarm 's smol-IQ2_XS with great success locally and it is a perfect fit for 128G mac studio (completely fills my RAM but I get no swapping). Here are some benchmarks I ran: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discussions/8

Any chance you could upload a similar quant for 128G?

In any case, thanks for all your quants so far!

I had a guy @Garf looking for more AesSedai quants here too: https://huggingface.co/ubergarm/Qwen3.5-122B-A10B-GGUF/discussions/3#69b33b98979c7cb090ac4c1b (specifically a Qwen3.5-122B IQ2_XS psure)

No presh I'm just saying ppl are loving your quants! ;p

@tarruda I'll look into a ~128GiB sized quant, will let you know :)

also thank you for the link @ubergarm ! <3

Owner
β€’
edited Mar 15

Got a mix that comes in precisely at 126GiB (it uses the fused up+gate FFN tensors):

MIX=IQ2_XS
TYPE_FFN_UP_EXPS=IQ2_XS
TYPE_FFN_GATE_EXPS=IQ2_XS
TYPE_FFN_DOWN_EXPS=IQ3_XXS
TYPE_DEFAULT=Q6_K

llama_model_quantize_impl: quant size  = 126172.44 MiB (2.67 BPW)

I'll finish quanting, KLD testing, and get it uploaded in a few hours.

@tarruda IQ2_XS has been uploaded.

Thanks @AesSedai !

I'm downloading and will give it a shot, but I feel that at 132GB/2.67BPW your quant might be too big for 128G.

In contrast ubergarm's smol-IQ2_XS is 122GB/2.34BPW and almost fills the available RAM, though I can load the full 256k context.

In any case I will try your version, maybe I can load with a lower amount of context. Thanks a lot!

Oh, shoot I read that as "it needs to fit into 128GiB" and I was like "Jackpot!", didn't realize you needed to fit it into 128GB oops!

I just checked and quantizing to IQ2_XXS for the Up / Gate reduces it to 117GiB (or 125GB) and doing that plus moving from Q6_K to Q4_K for the default type lowers it to 116.6GiB (122GB).

So I'll get an IQ2_XXS quant up tomorrow which should fit better :)

Looking forward to it!

I also have a few questions if you don't mind:

  • Do you publish your quantization command/script anywhere? I saw the comment above where you set some env vars and call llama-quantize, but what about the other quants/repo you upload?
  • If I understood correctly the information I read online, imatrix has information about which layers are more important by running a certain dataset over the full bf16 weights and saving the most activated layers. In that case, if two devs (say you or unsloth) use the same dataset to generate imatrix, would the generated files be similar?
  • I recently created a heretic quant of 397b using ubergarm's imatrix computed for the original model and magically seemed to work! Is that I pattern I can generally use? For example, if there's a new/better heretic bf16 gguf of the 397B, could I use your imatrix on it or it would be better if another imatrix is computed for the heretic version?

@tarruda

I can take a crack at some of these, tho I'm not speaking for Aes of course!

  1. The syntax for mainline llama-quantize is slightly different than for ik, but the basic idea it so either set some env vars kinda like shown here and have a script or just call it stright like so:
./build/bin/llama-quantize \
    --tensor-type ffn_down_exps=q4_K \
    --tensor-type ffn_gate_exps=q3_K \
    --tensor-type ffn_up_exps=q3_K \
    --token-embedding-type q4_K \
    --output-tensor-type q6_K \
    --imatrix /mnt/data/models/ubergarm/Qwen3.5-397B-A17B-GGUF/imatrix-Qwen3.5-397B-A17B-BF16-mainline.gguf \
    /mnt/data/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-BF16-00001-of-00017.gguf \
    /mnt/data/models/ubergarm/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-Q3_K.gguf \
    Q8_0 \
    128

Replace any regex and type you want if you want to keep the ssm guys at bf16 or F32 whatever etc. That final Q8_0 is what Aes is refering to as the "Default". The final number is $(nproc) for your rig.

  1. Yes, if two devs use the same bf16, the same build of llama.cpp, and the same backend hardware, it should generate the same imatrix data as there is no sampling involved. Refer to the original PR for some more in the Details section: https://github.com/ggml-org/llama.cpp/pull/4861

  2. In theory you want to compute the imatrix against the bf16 for which you are using to quantize. It does work to use mine in this case as the heretick version has all the same tensors and dimensions etc. But given the weights are different inside the heretic version, it would have a different imatrix probably. So while it does work, it may not be as optimal as computing and using the correct one. Now is it better than using nothing, maybe? Probably? You could quant two (one with and one without) and measure KLD/PPL and see if it makes a difference. Also some folks who can't inference the bf16 (because it takes ~800GB RAM) will do a --pure Q8_0 or even Q4_0 and use that to make the imatrix and go back and use the imatrix to make the "real quants" haha... Its almost an art at this point, or a ton of benchmarking and best guesses.

Cheers!

Owner
β€’
edited Mar 16

Just to pitch in as well since Ubergarm covered the majority of it:

  1. I don't publish my script just because it's something I tweak as needed, it is semi-automated and I usually run it to make the BF16.gguf then comment that out, make an imatrix in the terminal directly, then use the script again to produce the quants. I can't upload the script as an attachment so here it is:
Janky quantize.sh script
#!/usr/bin/env sh
set -euo pipefail

source venv/bin/activate

base_path_in="/mnt/srv/snowdrift"
base_path_out="/mnt/srv/snowdrift"

fp16_indir="${base_path_in}/fp16/${1}"
ggml_outdir="${base_path_out}/ggml/${1}"
ggml_outfile="${ggml_outdir}/${1}-BF16.gguf"

# mmproj conversions
# python3 convert_hf_to_gguf.py $fp16_indir --outfile $base_path_out/ggml/$1/mmproj-$1-Q8_0.gguf --outtype q8_0 --mmproj
# python3 convert_hf_to_gguf.py $fp16_indir --outfile $base_path_out/ggml/$1/mmproj-$1-F16.gguf --outtype f16 --mmproj
# python3 convert_hf_to_gguf.py $fp16_indir --outfile $base_path_out/ggml/$1/mmproj-$1-BF16.gguf --outtype bf16 --mmproj
# python3 convert_hf_to_gguf.py $fp16_indir --outfile $base_path_out/ggml/$1/mmproj-$1-F32.gguf --outtype f32 --mmproj

# exit 0

# --fuse-gate-up-exp
# if [ ! -f "$ggml_outfile" ]; then
#     echo "Starting conversion for: ${ggml_outdir}"
#     mkdir -p $ggml_outdir
#     python3 convert_hf_to_gguf.py $fp16_indir --outfile $ggml_outfile --outtype bf16 
# else
#     echo "Output file ${ggml_outfile} already exists, skipping conversion."
# fi

# exit 0

gguf=$(find $ggml_outdir -maxdepth 1 -type f -iname "*BF16*" | head -1)
imatrix=$(find $ggml_outdir -maxdepth 1 -type f -iname "*imatrix*" | head -1)

# Auto-detect or generate imatrix
# imatrix=$(find $ggml_outdir -maxdepth 1 -type f -iname "*imatrix*" | head -1)
# if [ -z "$imatrix" ]; then
#     imatrix="$ggml_outdir/imatrix.gguf"
#     echo "Generating imatrix at: ${imatrix}"
#    ./build/bin/llama-imatrix -m $gguf -f /mnt/srv/host/calibration_datav3_gguf.txt -o $imatrix
# fi

echo "Found BF16: ${gguf}"
echo "Found imatrix: ${imatrix}"

# Ensure the specific subdirectories exist
gguf_outdir="/mnt/srv/snowdrift/gguf/${1}-GGUF/aes_sedai"
mkdir -p $gguf_outdir

# --- Define Recipes Here ---
# Each recipe is a multi-line string of KEY=VALUE pairs inside the array.
# Add new recipes by adding new quoted blocks to the `recipes` array.
recipes=(
    "
    MIX=Q5_K_M
    TYPE_FFN_UP_EXPS=Q5_K
    TYPE_FFN_GATE_EXPS=Q5_K
    TYPE_FFN_DOWN_EXPS=Q6_K
    TYPE_DEFAULT=Q8_0
    "

    "
    MIX=Q4_K_M
    TYPE_FFN_UP_EXPS=Q4_K
    TYPE_FFN_GATE_EXPS=Q4_K
    TYPE_FFN_DOWN_EXPS=Q5_K
    TYPE_DEFAULT=Q8_0
    "

    "
    MIX=IQ4_XS
    TYPE_FFN_UP_EXPS=IQ3_S
    TYPE_FFN_GATE_EXPS=IQ3_S
    TYPE_FFN_DOWN_EXPS=IQ4_XS
    TYPE_DEFAULT=Q8_0
    "

    "
    MIX=IQ3_S
    TYPE_FFN_UP_EXPS=IQ2_S
    TYPE_FFN_GATE_EXPS=IQ2_S
    TYPE_FFN_DOWN_EXPS=IQ3_S
    TYPE_DEFAULT=Q6_K
    "
)

# for GLM-4.6 low BPW quants, add this to the TOP of the tensor stack
# --tensor-type blk\.92=q3_K \

# Loop through each defined recipe
for recipe in "${recipes[@]}"; do
    # Use eval to set the variables for the current recipe
    eval "$recipe"

    echo "--- Starting new quantization recipe ---"
    echo "Default quantization level: ${TYPE_DEFAULT}"

    output_filename="$gguf_outdir/$1-$MIX.gguf"

    echo "Outputting to: $output_filename"

    ./build/bin/llama-quantize \
        --override-kv "MoE_Quantization.ffn_up_exps=str:$TYPE_FFN_UP_EXPS" \
        --override-kv "MoE_Quantization.ffn_gate_exps=str:$TYPE_FFN_GATE_EXPS" \
        --override-kv "MoE_Quantization.ffn_down_exps=str:$TYPE_FFN_DOWN_EXPS" \
        --override-kv "MoE_Quantization.type_default=str:$TYPE_DEFAULT" \
        --tensor-type "ssm_(a|d|conv|dt|norm)=F32" \
        --tensor-type ffn_up_exps=$TYPE_FFN_UP_EXPS \
        --tensor-type ffn_gate_exps=$TYPE_FFN_GATE_EXPS \
        --tensor-type ffn_gate_up_exps=$TYPE_FFN_GATE_EXPS \
        --tensor-type ffn_down_exps=$TYPE_FFN_DOWN_EXPS \
        --imatrix $imatrix $gguf $output_filename $TYPE_DEFAULT
done

# ./build/bin/llama-quantize $gguf $gguf_outdir/$1-Q8_0.gguf Q8_0
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-Q8_0.gguf Q8_0
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-Q6_K.gguf Q6_K
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-Q5_K_M.gguf Q5_K_M
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-Q5_K_S.gguf Q5_K_S
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-Q4_K_M.gguf Q4_K_M
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-Q4_K_S.gguf Q4_K_S
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-IQ4_XS.gguf IQ4_XS
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-Q3_K_M.gguf Q3_K_M
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-Q3_K_S.gguf Q3_K_S
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-IQ3_M.gguf IQ3_M
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-IQ3_XS.gguf IQ3_XS
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-IQ3_XXS.gguf IQ3_XXS
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-Q2_K.gguf Q2_K
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-IQ2_S.gguf IQ2_S
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-IQ2_XS.gguf IQ2_XS
# ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-IQ2_XXS.gguf IQ2_XXS

# for making a loop of regular ftype quantizations
# levels=("IQ3_M" "Q4_K_M" "Q5_K_M" "Q6_K")
# levels=("Q8_0")
# for i in "${levels[@]}"
# do
#     echo "Starting conversion for: ${i}"
#     ./build/bin/llama-quantize \
#         $gguf $gguf_outdir/$1-$i.gguf $i
#     # ./build/bin/llama-quantize --imatrix $imatrix $gguf $gguf_outdir/$1-$i.gguf $i
# done
  1. Should be similar or the same yes. I use bartowski's calibration_datav5 for my imatrix.

  2. Uber's right that it will work because it has the same architecture, but because I'd expect your activations and weights to differ you'd want to produce a new imatrix. This is especially relevant because these are sparsely-activated MoE models and the heretic abliteration might have changed the expert activations (though I don't think the router should be directly affected by heretic). Either way if your weights aren't the same, I'd recommend making a new imatrix.

@tarruda IQ2_XXS has been uploaded now.

Started downloading, thanks for it and all the useful information @AesSedai and @ubergarm !

tarruda changed discussion status to closed

Just downloaded and tested the IQ2_XXS and it looks good! I might run some evals on it later.

One thing I noticed is that the IQ2_XXS is about 10% faster than the smol-IQ2_XS from ubergarm. I guess it is because of this thing mentioned in the README.md: "I've uploaded new quants using the new fused Up + Gate conversion, this offers up to a +10% boost in prompt processing speed from my testing."

If I wanted to reproduce this quant locally, with this speed boost, could I use the BF16 published by unsloth (which has weights generated more than 10 days ago) or would I need to re-convert using the original safetensors and a new version of llama.cpp?

You would need to re-convert because the fused Up + Gate is done in the convert_hf_to_gguf.py step to produce the BF16.gguf.

@tarruda

fused Up + Gate conversion

your other option is use ik_llama.cpp which does the fusion at run time with --merge-qkv -muge but otherwise about the same as mainline llama-server command. πŸ˜…

@ubergarm I tried ik_llama.cpp in the past but found it to be significantly slower on Apple Silicon. It should be more optimized for mixed CPU inference, correct?

@ubergarm I tried ik_llama.cpp in the past but found it to be significantly slower on Apple Silicon. It should be more optimized for mixed CPU inference, correct?

i believe ik owns a M2 mac and recently did some ARM NEON optimizations of the gated delta net implementation:

my understanding is macs have pretty good memory bandwidth but less raw compute, but honestly i don't benchmark them myself.

anyway, the pre-fused ggufs folks are now releasing for mainline will give you performance boost on mainline so that is all good too!

I downloaded the original safetensors and converted to BF16 (trying to reproduce quants from the scratch) using the new --fuse-gate-up-exps. Seems to have worked but I have no way to verify!

Then I ran the same llama-quantize command that I ran on heretic (passing in ubergarm's imatrix) on the new BF16, but the resulting file was about 300G! Do I need to modify the llama-quantize command for the fused gate-up?

I gave it a shot at adding --tensor-type ffn_gate_up_exps=iq2_xs \ to @ubergarm 's original script, let's see what comes out!

Also just learned about --dry-run which allows me to see the result size before spending 2 hours running the quantization 🀣

Ahh failed with this error:

[  11/1038] blk.0.ffn_gate_shexp.weight          - [  4096,   1024,      1,      1], type =   bf16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  12/1038] blk.0.ffn_gate_up_exps.weight        - [  4096,   2048,    512,      1], type =   bf16, 
====== llama_model_quantize_impl: did not find weights for blk.0.ffn_gate_up_exps.weight


============================================================
Missing importance matrix for tensor blk.0.ffn_gate_up_exps.weight in a very low-bit quantization
The result will be garbage, so bailing out
============================================================

llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.ffn_gate_up_exps.weight in a very low-bit quantization
main: failed to quantize model from '/Users/thiago/ml-models/huggingface/tarruda/Qwen3.5-397B-A17B-GGUF/BF16/Qwen3.5-397B-A17B-BF16-00001-of-00019.gguf'

Seems like the importance matrix needs to be recomputed with the fused bf16?

Using AesSedai new imatrix seems to have fixed it, so my quant will be using Ubergarm's recipe and AesSedai imatrix. Frankenstein Monster!

Apparently everything is working, the final quant looks as good as ubergarm's smol-IQ2_XS but with the extra speed bump on mainline!

This was the command I used:

llama-quantize \
    --tensor-type ffn_down_exps=iq2_xs \
    --tensor-type ffn_up_exps=iq2_xs \
    --tensor-type ffn_gate_exps=iq2_xs \
    --tensor-type ffn_gate_up_exps=iq2_xs \
    --token-embedding-type q4_k \
    --output-tensor-type q6_k \
    --imatrix ~/ml-models/huggingface/AesSedai/Qwen3.5-397B-A17B-GGUF/imatrix.gguf \
    ~/ml-models/huggingface/tarruda/Qwen3.5-397B-A17B-GGUF/BF16/Qwen3.5-397B-A17B-BF16-00001-of-00019.gguf \
    Qwen3.5-397B-A17B-IQ2_XXS.gguf \
    Q8_0 \
    128

I did a few experiments with dry-run and it seems I can change both --output-tensor-type and --token-embedding-type to q8_0 and it has no effect on the size (still 2.46BPW). I also tried bf16 on both types and it only did a minor increase in the final size (2.51 BPW).

I'm thinking of trying bf16 for those types since I could probably still load with 128k context. Any thoughts about this?

Another question: Is the --tensor-type ffn_up_exps=iq2_xs \ --tensor-type ffn_gate_exps=iq2_xs \ redundant with the ffn_gate_up_exps ? Apparently removing those two has no effect on the size.

BTW I'm planning to upload all qwen3.5 BF16 with gate + up fused tensors in case anyone want to rebuild their quants (seems like the free tier would allow for this!)

@tarruda

I'm thinking of trying bf16 for those types since I could probably still load with 128k context. Any thoughts about this?

There is a contingent of folks who enjoy bf16 token_embd.weight and output.weight non-repeating layers. Nothing wrong with it, and it can even be left on CPU psure to save some space. Personally I don't go over q8_0 for those tensors as they are still fairly large and potentially always active which will slow down TG a bit due to needing more memory access during autoregressive decode (TG).

And yes on mainline you need the imatrix to match if it is the fused up|gate exps. On ik he handles it all under the hood automagically now which is nice. You can quantize with ik_llama.cpp using mainline types and it will run fine on mainline. ik can be faster for quantization and has a few subtle tweaks as well.

I'd probably go with q8_0 for both token_embd and output but you can do whatever you like! It usually doesn't effect PPL/KLD too much. You can even play around with a couple versions and do a vibe check (the most official of all benchmarks in the local LLM world haha)

I'm thinking of trying bf16 for those types since I could probably still load with 128k context. Any thoughts about this?

Nice! link it here when ur done and I'll give u a <3 !! Thanks for doing all the mac testing of all these quants!

Haha lots of movement here while I was sleeping!

  1. Yes, you need to specify --tensor-type ffn_gate_up_exps=<target> since the fused Up + Gate has a differently named tensor.
  2. --dry-run is amazing and thanks to @ddh0 for adding it when I suggested it :3
  3. Since that tensor named changed, you need a compatible imatrix that was run against the fused BF16 which yes, I did upload my updated imatrixes across my quants
  4. Since it's regex-match based having --tensor-type ffn_up_exps=iq2_xs \ --tensor-type ffn_gate_exps=iq2_xs \ leftover as redundant won't hurt anything

The free tier is generous certainly, but when talking about multiple 200B+ models and quants it doesn't feel as spacious :P

I've been deep on the quant crafting rabbit hole, it really is addicting @ubergarm !

@AesSedai about this:

I just checked and quantizing to IQ2_XXS for the Up / Gate reduces it to 117GiB (or 125GB) and doing that plus moving from Q6_K to Q4_K for the default type lowers it to 116.6GiB (122GB).

What made you prefer to change the Up/Gate to IQ2_XXS vs downgrading the the Down to IQ2_XS (From IQ3_XXS which you used in your XS recipe) ? Is there any advantage to keeping the Down tensor with a higher precision?

@ubergarm just answered my question here

So keeping down higher than gate/up is the way to go, go it. That explains why one of my experiments failed miserably:

    TYPE_FFN_GATE_UP_EXPS=IQ3_XXS
    TYPE_FFN_DOWN_EXPS=IQ1_S
    TYPE_DEFAULT=Q4_K

@ubergarm 's smol-IQ2_XS is awesome and allows up to 256k context, but I don't need that much context, so I'm basically tweaking these types to see if I can find a sweet spot between ubergarm's smol-IQ2_XS and AesSedai IQ2_XS (extract as much performance as I can for my hardware).

@tarruda

You're learning fast! Also thanks for all the good PR with all the 2 bit HN haters: https://news.ycombinator.com/item?id=47476422 lolol

@tarruda

What made you prefer to change the Up/Gate to IQ2_XXS vs downgrading the the Down to IQ2_XS (From IQ3_XXS which you used in your XS recipe) ? Is there any advantage to keeping the Down tensor with a higher precision?

The Down tensor is more sensitive to quantization than the Up / Gate tensors are from my previous testing so usually I keep Down at one quant level higher. I've tested leaving down as two quant levels higher for a few situations and that didn't help the KLD / PPL, just increased the file size, but that was also testing like Q4 / Q4 / Q6 so I'm not sure if that same finding holds for the smaller bpw ranges like IQ2 / IQ3. Also specifically for that objective, I was trying to minimize the file size and by dropping the Up / Gate quant level that shaves size off of two tensors instead of one.

I created my own quant based on your recipes/scripts: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-GGUF

Copied @ubergarm smol-IQ2_XS but changed DOWN to iq3_xxs and UP/GATE to iq2_xxs, and using @ubergarm 's imatrix, which I found to be a little better over the tests I ran locally (was always scoring a bit higher on MMLU).

Still running benchmarks on it (right now ifeval, takes about 48h), but so far the results have been good:

  • MMLU is still within the margin of error, jumped 0.1% to 87.96% from 87.86%
  • GPQA diamond with cot enabled jumped to 86.36% from 82.32%. I couldn't find official bf16 Qwen numbers for GPQA diamond, but according to this, the FP8 version scores 87.1%

Also notice about 5% improvement in token generation speed, went from 21 tps from 20 tps (I assume that is because of the iq3_xxs)

Thank you two for the guidance!

Sign up or log in to comment