Description

This repo contains specialized MoE-quants for Nemotron-Cascade-2-30B-A3B. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization. To that end, the quantization type default is kept in high quality and the FFN UP + FFN GATE tensors are quanted down along with the FFN DOWN tensors.

Notes

This model is a little weird, similarly to the other recent Nemotrons. There isn't a ffn_gate_exps tensor in it, and the ffn_up_exps and ffn_down_exps tensors have 2688 elements in it which means that it is not compatible with most Q*_K quantizations.

Therefore, most of the quants have to use IQ4_NL, Q4_0/Q4_1, and Q5_0/Q5_1 quantizations for the FFNs.

Quant Size Mixture PPL 1-(Mean PPL(Q)/PPL(base)) KLD
Q8_0 31.27 GiB (8.51 BPW) Q8_0 (reference) 9.743360 ± 0.072693 +0.1278% 0.003439 ± 0.000025
Q5_K_M 27.00 GiB (7.34 BPW) Q8_0 / Q5_1 / X / Q8_0 9.752863 ± 0.072779 +0.2255% 0.004316 ± 0.000033
Q4_K_M 21.87 GiB (5.95 BPW) Q8_0 / Q5_0 / X / Q5_1 9.760517 ± 0.072841 +0.3041% 0.005375 ± 0.000036
Q4_0 19.30 GiB (5.25 BPW) Q8_0 / Q4_0 / X / Q5_0 9.775306 ± 0.072933 +0.4561% 0.008387 ± 0.000053
IQ4_XS 17.59 GiB (4.79 BPW) Q8_0 / IQ4_NL / X / IQ4_NL 9.802367 ± 0.073142 +0.7342% 0.009969 ± 0.000062

kld_graph ppl_graph

Downloads last month
2,004
GGUF
Model size
32B params
Architecture
nemotron_h_moe
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AesSedai/Nemotron-Cascade-2-30B-A3B-GGUF

Quantized
(31)
this model