Description

This repo contains specialized MoE-quants for Nemotron-Cascade-2-30B-A3B. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization. To that end, the quantization type default is kept in high quality and the FFN UP + FFN GATE tensors are quanted down along with the FFN DOWN tensors.

Notes

This model is a little weird, similarly to the other recent Nemotrons. There isn't a ffn_gate_exps tensor in it, and the ffn_up_exps and ffn_down_exps tensors have 2688 elements in it which means that it is not compatible with most Q*_K quantizations.

Therefore, most of the quants have to use IQ4_NL, Q4_0/Q4_1, and Q5_0/Q5_1 quantizations for the FFNs.

Quant	Size	Mixture	PPL	1-(Mean PPL(Q)/PPL(base))	KLD
Q8_0	31.27 GiB (8.51 BPW)	Q8_0 (reference)	9.743360 ± 0.072693	+0.1278%	0.003439 ± 0.000025
Q5_K_M	27.00 GiB (7.34 BPW)	Q8_0 / Q5_1 / X / Q8_0	9.752863 ± 0.072779	+0.2255%	0.004316 ± 0.000033
Q4_K_M	21.87 GiB (5.95 BPW)	Q8_0 / Q5_0 / X / Q5_1	9.760517 ± 0.072841	+0.3041%	0.005375 ± 0.000036
Q4_0	19.30 GiB (5.25 BPW)	Q8_0 / Q4_0 / X / Q5_0	9.775306 ± 0.072933	+0.4561%	0.008387 ± 0.000053
IQ4_XS	17.59 GiB (4.79 BPW)	Q8_0 / IQ4_NL / X / IQ4_NL	9.802367 ± 0.073142	+0.7342%	0.009969 ± 0.000062