Description
This repo contains specialized MoE-quants for Nemotron-Cascade-2-30B-A3B. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization. To that end, the quantization type default is kept in high quality and the FFN UP + FFN GATE tensors are quanted down along with the FFN DOWN tensors.
Notes
This model is a little weird, similarly to the other recent Nemotrons. There isn't a ffn_gate_exps tensor in it, and the ffn_up_exps and ffn_down_exps tensors have 2688 elements in it which means that it is not compatible with most Q*_K quantizations.
Therefore, most of the quants have to use IQ4_NL, Q4_0/Q4_1, and Q5_0/Q5_1 quantizations for the FFNs.
| Quant | Size | Mixture | PPL | 1-(Mean PPL(Q)/PPL(base)) | KLD |
|---|---|---|---|---|---|
| Q8_0 | 31.27 GiB (8.51 BPW) | Q8_0 (reference) | 9.743360 ± 0.072693 | +0.1278% | 0.003439 ± 0.000025 |
| Q5_K_M | 27.00 GiB (7.34 BPW) | Q8_0 / Q5_1 / X / Q8_0 | 9.752863 ± 0.072779 | +0.2255% | 0.004316 ± 0.000033 |
| Q4_K_M | 21.87 GiB (5.95 BPW) | Q8_0 / Q5_0 / X / Q5_1 | 9.760517 ± 0.072841 | +0.3041% | 0.005375 ± 0.000036 |
| Q4_0 | 19.30 GiB (5.25 BPW) | Q8_0 / Q4_0 / X / Q5_0 | 9.775306 ± 0.072933 | +0.4561% | 0.008387 ± 0.000053 |
| IQ4_XS | 17.59 GiB (4.79 BPW) | Q8_0 / IQ4_NL / X / IQ4_NL | 9.802367 ± 0.073142 | +0.7342% | 0.009969 ± 0.000062 |
- Downloads last month
- 2,004
Model tree for AesSedai/Nemotron-Cascade-2-30B-A3B-GGUF
Base model
nvidia/Nemotron-Cascade-2-30B-A3B
