Quant request.

by Ohjaaja - opened Jan 29

Jan 29

Hey. I found your model QuietImpostor/Nemotron-3-Nano-80experts-REAP runs great on low vram after q4 quantization. Would it be possible for you to make mxpf4 quant out of this? Tried mxfp4 quantization with mlx_lm on macbook air m4, 16gb. but it ran out of buffer.

QuietImpostor

Owner Jan 29

•

edited Jan 29

Glad to hear someone is actually using this! Not quite sure how you managed to get it to work since I had trouble converting it to GGUF. But yes, as soon as I get a chance (and can figure out how to quant it with 16GBs of VRAM) I can certainly try.

Edit: it just now occurred to me that I don't have a Mac and can't convert it to MLX. However, I still can try to convert it to MXFP4 so it can be possibly easier converted to MLX?

Ohjaaja

Jan 29

•

edited Jan 29

I got it coverted to gguf and quantizized to MXFP4_MOE with this: https://github.com/ggml-org/llama.cpp.

python convert_hf_to_gguf.py
~/.cache/huggingface/hub/models--QuietImpostor--Nemotron-3-Nano-80experts-REAP/snapshots/ddbef50dd4e62e661bfc649e7952bef01a19c68a
--outfile ~/nemotron-reap-f16.gguf
--outtype f16

llama-quantize
~/nemotron-reap-f16.gguf
~/nemotron-reap-mxfp4.gguf
MXFP4_MOE.

Its still too tight for this tiny mac. Ill try the quant tomorrow on my 12gb rtx3060 with some offloading. What did you prune out of it?

QuietImpostor

Owner Jan 30

•

edited Jan 30

Out of 128 experts, I pruned 48 experts. And odd, I tried that exact thing but llama.cpp was complaining about some expert tensors being missing.

llama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model          /.../.../Nemotron-Prune-80x1.4B-BF16.gguf
llama_model_load: error loading model: missing tensor 'blk.0.attn_norm.weight'

And I do have an RTX 5080... I could try to do some weight surgery to try and get it to pass, but that might make the model worse unless I know explicitly WHAT experts were pruned. Or I could just try FP16, it could be a problem with BF16.

Edit, upon reviewing the file sizes, my failed attempt was only 38GBs vs a successful attempt of 41GBs in BF16. I'm quantizing to MXFP4 now!

QuietImpostor

Owner Jan 30

All done! Enjoy: https://huggingface.co/QuietImpostor/Nemotron-3-Nano-REAP-21B-A3B-MXFP4-GGUF

QuietImpostor changed discussion status to closed Jan 30

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment