NVFP4 Quantizer
Hi Michael. In which GPU are you running this model? What is the decoding speed you are achieving? Would you share your quantization tool?
Hello Pirola
I am using an RTX 5090, the first NVFP4 generic GPU kernel was merged into llama.cpp so you can now run it without any experimental PR:
On the most recent build I am getting these results:
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32606 MiB):
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B NVFP4 | 19.28 GiB | 31.58 B | CUDA | 99 | pp512 | 7662.02 ± 29.70 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 19.28 GiB | 31.58 B | CUDA | 99 | tg128 | 221.35 ± 1.31 |
I am still working on the Blackwell kernel which I will post up on a PR soon, this is an early result, I still need to get token gen speed back up without losing on prefill.
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B NVFP4 | 19.28 GiB | 31.58 B | CUDA | 99 | pp512 | 10865.83 ± 40.35 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 19.28 GiB | 31.58 B | CUDA | 99 | tg128 | 188.51 ± 4.75 |
Quantization tool is too rough to post for now but if you have a model in mind you'd like me to put up I'll be happy to.
These are very nice results.
Maybe we could "join forces": I am also targeting Cascade 2, but in the smaller brother: RTX5080 with only 16Gb of VRAM. To achieve this I am will try today the REAM ( https://bknyaz.github.io/blog/2026/moe/ ). They haven't yet provided their code, so I am using Claude to do it. If it ever works, it would be nice to integrate your tool.
How does your quantization compares to chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 ?
And what kv cache format are you using?