4.4 bpw request

#1
by BahamutRU - opened

16 GB VRAM + 128 GB RAM

or

128 GB shared RAM

Maybe 125-126 GB max size? 120-121?

At this moment I use Q4_K_S from unsloth, but quality… wish it were better.

Q4_K_M too big. =(
IQ4_XS too small, but too big ppl. =(

This is the set I'm uploading currently (minus the Q8_0, that's just for reference):

Quant Size Mixture PPL 1-(Mean PPL(Q)/PPL(base)) KLD
Q8_0 226.43 GiB (8.51 BPW) Q8_0 7.880138 ± 0.060034 +0.2412% 0.029715 ± 0.000649
Q5_K_M 157.23 GiB (5.91 BPW) Q8_0 / Q5_K / Q5_K / Q6_K 7.871878 ± 0.059897 +0.1361% 0.038926 ± 0.000692
Q4_K_M 130.52 GiB (4.90 BPW) Q8_0 / Q4_K / Q4_K / Q5_K 0.000000 ± 0.000000 +0.0000% 0.000000 ± 0.000000
IQ4_XS 101.10 GiB (3.80 BPW) Q8_0 / IQ3_S / IQ3_S / IQ4_XS 8.290674 ± 0.063543 +5.4635% 0.128807 ± 0.001070
IQ3_S 77.86 GiB (2.92 BPW) Q8_0 / IQ2_S / IQ2_S / IQ3_S 8.815764 ± 0.067859 +12.1430% 0.282740 ± 0.001687

and now that I'm looking at that hmm, the Q4_K_M has some nan's in it so I think I'm going to have to yank that one and redo it.

There is a bit of a gap between the IQ4_XS as 101GiB and the Q4_K_M at 130GiB so I can tinker a bit with something in-between, maybe close to 115-120GiB?

Concerning the gaps - M2.5 and M2.7 should be sharing a similar size, right? I don't get it why M2.5 Q4K_XL (from Unsloth) is considerably smaller than their M2.7 Q4K_XL. Are we dealing with another case of sub-optimal GGUFs being uploaded to HF recently? It's getting hard to keep track of what's wrong and what's right.

I can tinker a bit with something in-between, maybe close to 115-120GiB?

I don't know about others (Mac users, DGX Spark users), I use 16 + 128, so 120~126 GiB (3 GiB on VRAM, and 123 on RAM) is the best choice for me. =)
Sure, smaller — faster, but quality is more important.

This will be great.

I'm running a 7900XT (20GB) + 128GB DDR4 so a slightly higher bpw would be great.

Even if you choose not to, I'm fine with IQ4_XS quant although I've noticed weird spelling mistakes and missing think tags with IQ stuff (not limited to your quants, in general).

In any case, thank you for quantizing these models. Your MM2.5 IQ4_XS has a special place on my setup (it's amazing).

Just some info for anyone wondering; I tested out the IQ4_XS a bit on a headless Strix Halo 128MB Max 395+.

MiniMax M2.7, Strix Halo Max+ 395, 128MB

Quanitzation: IQ4_XS, AesSedai
KV Technique: TurboQuant 3 (tq3 / turbo3)

Model & Output Buffers
- ROCm model: 102,903 MiB
- Host model: 623 MiB
- Host output: 3 MiB
- Total: 107,529 MiB (105.01GiB)

96K Context (98,304 cells)

Buffers
- K (turbo3): 2,325 MiB
- V (turbo3): 2,325 MiB
- ROCm compute: 397 MiB
- Host compute: 204 MiB
- Total: 5,251 MiB

prompt eval time = 1630.50 ms / 48 tokens ( 33.97 ms per token, **29.44 tokens per second **)
eval time = 62583.59 ms / 1324 tokens ( 47.27 ms per token, **21.16 tokens per second **)
total time = 64214.09 ms / 1372 tokens

Context 192K (196608 cells) (Default)

Buffers
- K (turbo3): 4,650 MiB
- V (turbo3): 4,650 MiB
- ROCm compute: 594 MiB
- Host compute: 396 MiB
- Total: 10,290 MiB

prompt eval time = 1439.18 ms / 48 tokens ( 29.98 ms per token, 33.35 tokens per second)
eval time = 51124.55 ms / 1121 tokens ( 45.61 ms per token, 21.93 tokens per second)
total time = 52563.73 ms / 1169 tokens

prompt eval time = 15563.93 ms / 884 tokens ( 17.61 ms per token, 56.80 tokens per second)
eval time = 19406.78 ms / 373 tokens ( 52.03 ms per token, 19.22 tokens per second)
total time = 34970.71 ms / 1257 tokens

prompt eval time = 15362.10 ms / 884 tokens ( 17.38 ms per token, 57.54 tokens per second)
eval time = 19956.13 ms / 415 tokens ( 48.09 ms per token, 20.80 tokens per second)
total time = 35318.23 ms / 1299 tokens

Just some info for anyone wondering; I tested out the IQ4_XS a bit on a headless Strix Halo 128MB Max 395+.

MiniMax M2.7, Strix Halo Max+ 395, 128MB

Quanitzation: IQ4_XS, AesSedai
KV Technique: TurboQuant 3 (tq3 / turbo3)

Model & Output Buffers
- ROCm model: 102,903 MiB
- Host model: 623 MiB
- Host output: 3 MiB
- Total: 107,529 MiB (105.01GiB)

96K Context (98,304 cells)

Buffers
- K (turbo3): 2,325 MiB
- V (turbo3): 2,325 MiB
- ROCm compute: 397 MiB
- Host compute: 204 MiB
- Total: 5,251 MiB

prompt eval time = 1630.50 ms / 48 tokens ( 33.97 ms per token, **29.44 tokens per second **)
eval time = 62583.59 ms / 1324 tokens ( 47.27 ms per token, **21.16 tokens per second **)
total time = 64214.09 ms / 1372 tokens

Context 192K (196608 cells) (Default)

Buffers
- K (turbo3): 4,650 MiB
- V (turbo3): 4,650 MiB
- ROCm compute: 594 MiB
- Host compute: 396 MiB
- Total: 10,290 MiB

prompt eval time = 1439.18 ms / 48 tokens ( 29.98 ms per token, 33.35 tokens per second)
eval time = 51124.55 ms / 1121 tokens ( 45.61 ms per token, 21.93 tokens per second)
total time = 52563.73 ms / 1169 tokens

prompt eval time = 15563.93 ms / 884 tokens ( 17.61 ms per token, 56.80 tokens per second)
eval time = 19406.78 ms / 373 tokens ( 52.03 ms per token, 19.22 tokens per second)
total time = 34970.71 ms / 1257 tokens

prompt eval time = 15362.10 ms / 884 tokens ( 17.38 ms per token, 57.54 tokens per second)
eval time = 19956.13 ms / 415 tokens ( 48.09 ms per token, 20.80 tokens per second)
total time = 35318.23 ms / 1299 tokens

How is the intelligence of the model though.

Okay I've got my rig available again and fixed the Q4_K_M quant and uploaded it, will check on a 4.4bpw quant later today!

BahamutRU changed discussion status to closed
BahamutRU changed discussion status to open

Nailed it :)

Quant Size Mixture PPL 1-(Mean PPL(Q)/PPL(base)) KLD
Q8_0 226.43 GiB (8.51 BPW) Q8_0 7.880138 ± 0.060034 +0.2412% 0.029715 ± 0.000649
Q5_K_M 157.23 GiB (5.91 BPW) Q8_0 / Q5_K / Q5_K / Q6_K 7.871878 ± 0.059897 +0.1361% 0.038926 ± 0.000692
Q4_K_M 130.67 GiB (4.91 BPW) Q8_0 / Q4_K / Q4_K / Q5_K 7.951215 ± 0.060706 +1.1453% 0.059323 ± 0.000771
Q4_K_S 117.74 GiB (4.42 BPW) Q8_0 / IQ4_XS / IQ4_XS / Q4_K 7.968221 ± 0.060797 +1.3616% 0.071012 ± 0.000774
IQ4_XS 101.10 GiB (3.80 BPW) Q8_0 / IQ3_S / IQ3_S / IQ4_XS 8.290674 ± 0.063543 +5.4635% 0.128807 ± 0.001070
IQ3_S 77.86 GiB (2.92 BPW) Q6_K / IQ2_S / IQ2_S / IQ3_S 8.815764 ± 0.067859 +12.1430% 0.282740 ± 0.001687

Q4_K_S is 4.42bpw and should be available in a bit.

Wow! This is amazing, I think!

Q4_K_S is 4.42bpw and should be available in a bit.

Purrrfect, I tried it. The best quant of all I've tried in the last week. Thank you!

BahamutRU changed discussion status to closed

I appreciate the kind words :)

Sign up or log in to comment