NVFP4 Quantizer

#1
by pirola - opened

Hi Michael. In which GPU are you running this model? What is the decoding speed you are achieving? Would you share your quantization tool?

Hello Pirola
I am using an RTX 5090, the first NVFP4 generic GPU kernel was merged into llama.cpp so you can now run it without any experimental PR:
On the most recent build I am getting these results:

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32606 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |           pp512 |      7662.02 ± 29.70 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |           tg128 |        221.35 ± 1.31 |

I am still working on the Blackwell kernel which I will post up on a PR soon, this is an early result, I still need to get token gen speed back up without losing on prefill.

  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |           pp512 |     10865.83 ± 40.35 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |           tg128 |        188.51 ± 4.75 |

Quantization tool is too rough to post for now but if you have a model in mind you'd like me to put up I'll be happy to.

These are very nice results.
Maybe we could "join forces": I am also targeting Cascade 2, but in the smaller brother: RTX5080 with only 16Gb of VRAM. To achieve this I am will try today the REAM ( https://bknyaz.github.io/blog/2026/moe/ ). They haven't yet provided their code, so I am using Claude to do it. If it ever works, it would be nice to integrate your tool.

How does your quantization compares to chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 ?

And what kv cache format are you using?

Sign up or log in to comment