4.4 bpw request
16 GB VRAM + 128 GB RAM
or
128 GB shared RAM
Maybe 125-126 GB max size? 120-121?
At this moment I use Q4_K_S from unsloth, but quality… wish it were better.
Q4_K_M too big. =(
IQ4_XS too small, but too big ppl. =(
This is the set I'm uploading currently (minus the Q8_0, that's just for reference):
| Quant | Size | Mixture | PPL | 1-(Mean PPL(Q)/PPL(base)) | KLD |
|---|---|---|---|---|---|
| Q8_0 | 226.43 GiB (8.51 BPW) | Q8_0 | 7.880138 ± 0.060034 | +0.2412% | 0.029715 ± 0.000649 |
| Q5_K_M | 157.23 GiB (5.91 BPW) | Q8_0 / Q5_K / Q5_K / Q6_K | 7.871878 ± 0.059897 | +0.1361% | 0.038926 ± 0.000692 |
| Q4_K_M | 130.52 GiB (4.90 BPW) | Q8_0 / Q4_K / Q4_K / Q5_K | 0.000000 ± 0.000000 | +0.0000% | 0.000000 ± 0.000000 |
| IQ4_XS | 101.10 GiB (3.80 BPW) | Q8_0 / IQ3_S / IQ3_S / IQ4_XS | 8.290674 ± 0.063543 | +5.4635% | 0.128807 ± 0.001070 |
| IQ3_S | 77.86 GiB (2.92 BPW) | Q8_0 / IQ2_S / IQ2_S / IQ3_S | 8.815764 ± 0.067859 | +12.1430% | 0.282740 ± 0.001687 |
and now that I'm looking at that hmm, the Q4_K_M has some nan's in it so I think I'm going to have to yank that one and redo it.
There is a bit of a gap between the IQ4_XS as 101GiB and the Q4_K_M at 130GiB so I can tinker a bit with something in-between, maybe close to 115-120GiB?
Concerning the gaps - M2.5 and M2.7 should be sharing a similar size, right? I don't get it why M2.5 Q4K_XL (from Unsloth) is considerably smaller than their M2.7 Q4K_XL. Are we dealing with another case of sub-optimal GGUFs being uploaded to HF recently? It's getting hard to keep track of what's wrong and what's right.
I can tinker a bit with something in-between, maybe close to 115-120GiB?
I don't know about others (Mac users, DGX Spark users), I use 16 + 128, so 120~126 GiB (3 GiB on VRAM, and 123 on RAM) is the best choice for me. =)
Sure, smaller — faster, but quality is more important.
This will be great.
I'm running a 7900XT (20GB) + 128GB DDR4 so a slightly higher bpw would be great.
Even if you choose not to, I'm fine with IQ4_XS quant although I've noticed weird spelling mistakes and missing think tags with IQ stuff (not limited to your quants, in general).
In any case, thank you for quantizing these models. Your MM2.5 IQ4_XS has a special place on my setup (it's amazing).
Just some info for anyone wondering; I tested out the IQ4_XS a bit on a headless Strix Halo 128MB Max 395+.
MiniMax M2.7, Strix Halo Max+ 395, 128MB
Quanitzation: IQ4_XS, AesSedai
KV Technique: TurboQuant 3 (tq3 / turbo3)
Model & Output Buffers
- ROCm model: 102,903 MiB
- Host model: 623 MiB
- Host output: 3 MiB
- Total: 107,529 MiB (105.01GiB)
96K Context (98,304 cells)
Buffers
- K (turbo3): 2,325 MiB
- V (turbo3): 2,325 MiB
- ROCm compute: 397 MiB
- Host compute: 204 MiB
- Total: 5,251 MiB
prompt eval time = 1630.50 ms / 48 tokens ( 33.97 ms per token, **29.44 tokens per second **)
eval time = 62583.59 ms / 1324 tokens ( 47.27 ms per token, **21.16 tokens per second **)
total time = 64214.09 ms / 1372 tokens
Context 192K (196608 cells) (Default)
Buffers
- K (turbo3): 4,650 MiB
- V (turbo3): 4,650 MiB
- ROCm compute: 594 MiB
- Host compute: 396 MiB
- Total: 10,290 MiB
prompt eval time = 1439.18 ms / 48 tokens ( 29.98 ms per token, 33.35 tokens per second)
eval time = 51124.55 ms / 1121 tokens ( 45.61 ms per token, 21.93 tokens per second)
total time = 52563.73 ms / 1169 tokens
prompt eval time = 15563.93 ms / 884 tokens ( 17.61 ms per token, 56.80 tokens per second)
eval time = 19406.78 ms / 373 tokens ( 52.03 ms per token, 19.22 tokens per second)
total time = 34970.71 ms / 1257 tokens
prompt eval time = 15362.10 ms / 884 tokens ( 17.38 ms per token, 57.54 tokens per second)
eval time = 19956.13 ms / 415 tokens ( 48.09 ms per token, 20.80 tokens per second)
total time = 35318.23 ms / 1299 tokens
Just some info for anyone wondering; I tested out the IQ4_XS a bit on a headless Strix Halo 128MB Max 395+.
MiniMax M2.7, Strix Halo Max+ 395, 128MB
Quanitzation: IQ4_XS, AesSedai
KV Technique: TurboQuant 3 (tq3 / turbo3)Model & Output Buffers
- ROCm model: 102,903 MiB
- Host model: 623 MiB
- Host output: 3 MiB
- Total: 107,529 MiB (105.01GiB)96K Context (98,304 cells)
Buffers
- K (turbo3): 2,325 MiB
- V (turbo3): 2,325 MiB
- ROCm compute: 397 MiB
- Host compute: 204 MiB
- Total: 5,251 MiBprompt eval time = 1630.50 ms / 48 tokens ( 33.97 ms per token, **29.44 tokens per second **)
eval time = 62583.59 ms / 1324 tokens ( 47.27 ms per token, **21.16 tokens per second **)
total time = 64214.09 ms / 1372 tokensContext 192K (196608 cells) (Default)
Buffers
- K (turbo3): 4,650 MiB
- V (turbo3): 4,650 MiB
- ROCm compute: 594 MiB
- Host compute: 396 MiB
- Total: 10,290 MiBprompt eval time = 1439.18 ms / 48 tokens ( 29.98 ms per token, 33.35 tokens per second)
eval time = 51124.55 ms / 1121 tokens ( 45.61 ms per token, 21.93 tokens per second)
total time = 52563.73 ms / 1169 tokensprompt eval time = 15563.93 ms / 884 tokens ( 17.61 ms per token, 56.80 tokens per second)
eval time = 19406.78 ms / 373 tokens ( 52.03 ms per token, 19.22 tokens per second)
total time = 34970.71 ms / 1257 tokensprompt eval time = 15362.10 ms / 884 tokens ( 17.38 ms per token, 57.54 tokens per second)
eval time = 19956.13 ms / 415 tokens ( 48.09 ms per token, 20.80 tokens per second)
total time = 35318.23 ms / 1299 tokens
How is the intelligence of the model though.
Okay I've got my rig available again and fixed the Q4_K_M quant and uploaded it, will check on a 4.4bpw quant later today!
Thx!
Nailed it :)
| Quant | Size | Mixture | PPL | 1-(Mean PPL(Q)/PPL(base)) | KLD |
|---|---|---|---|---|---|
| Q8_0 | 226.43 GiB (8.51 BPW) | Q8_0 | 7.880138 ± 0.060034 | +0.2412% | 0.029715 ± 0.000649 |
| Q5_K_M | 157.23 GiB (5.91 BPW) | Q8_0 / Q5_K / Q5_K / Q6_K | 7.871878 ± 0.059897 | +0.1361% | 0.038926 ± 0.000692 |
| Q4_K_M | 130.67 GiB (4.91 BPW) | Q8_0 / Q4_K / Q4_K / Q5_K | 7.951215 ± 0.060706 | +1.1453% | 0.059323 ± 0.000771 |
| Q4_K_S | 117.74 GiB (4.42 BPW) | Q8_0 / IQ4_XS / IQ4_XS / Q4_K | 7.968221 ± 0.060797 | +1.3616% | 0.071012 ± 0.000774 |
| IQ4_XS | 101.10 GiB (3.80 BPW) | Q8_0 / IQ3_S / IQ3_S / IQ4_XS | 8.290674 ± 0.063543 | +5.4635% | 0.128807 ± 0.001070 |
| IQ3_S | 77.86 GiB (2.92 BPW) | Q6_K / IQ2_S / IQ2_S / IQ3_S | 8.815764 ± 0.067859 | +12.1430% | 0.282740 ± 0.001687 |
Q4_K_S is 4.42bpw and should be available in a bit.
Wow! This is amazing, I think!
Q4_K_S is 4.42bpw and should be available in a bit.
Purrrfect, I tried it. The best quant of all I've tried in the last week. Thank you!
I appreciate the kind words :)