pruned version

#16
by pirola - opened

Hi. As the owner of a small giant RTX5080, I need a pruned + high quality DAQ NVFP4 version of this model to be able to run NemoClaw locally. It looks like all small models releases from nvidia focus on the RTX5090 only... please also include the 16 Gi VRAM boards as well.

16GB is going to be impossible for this model without REAP or similar removal/merging of experts. The best NVFP4 quant available (which uses the official recipe) is close to 20GB: https://huggingface.co/chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4. You could try that or a 4-bit GGUF with CPU/RAM offloading, though, and may be pleasantly surprised.

Running natively on your card, the best model in this series is likely the dense Mamba hybrid 4B Nemotron-3 at FP8 https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8 - and for NemoClaw style work, consider also FP8 or 4-bit quants of Qwen3.5 9B.

NVIDIA org

check out this setup from Sudo su:

"i pointed hermes agent at nvidia's nemotron cascade 2 30B-A3B on a single RTX 3090 24GB. IQ4_XS quant by bartowski, 187 tok/s, 625K context. had it discover its own hardware, create an identity file, then build a full GPU marketplace UI from a single prompt."

https://x.com/sudoingX/status/2037512256599306578?s=20

Sign up or log in to comment