NVFP4 Quantizer

by pirola - opened 21 days ago

Hi Michael. In which GPU are you running this model? What is the decoding speed you are achieving? Would you share your quantization tool?

michaelw9999

Owner 21 days ago

Hello Pirola
I am using an RTX 5090, the first NVFP4 generic GPU kernel was merged into llama.cpp so you can now run it without any experimental PR:
On the most recent build I am getting these results:

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32606 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |           pp512 |      7662.02 ± 29.70 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |           tg128 |        221.35 ± 1.31 |

I am still working on the Blackwell kernel which I will post up on a PR soon, this is an early result, I still need to get token gen speed back up without losing on prefill.

  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |           pp512 |     10865.83 ± 40.35 |
| nemotron_h_moe 31B.A3.5B NVFP4 |  19.28 GiB |    31.58 B | CUDA       |  99 |           tg128 |        188.51 ± 4.75 |

Quantization tool is too rough to post for now but if you have a model in mind you'd like me to put up I'll be happy to.

pirola

21 days ago

These are very nice results.
Maybe we could "join forces": I am also targeting Cascade 2, but in the smaller brother: RTX5080 with only 16Gb of VRAM. To achieve this I am will try today the REAM ( https://bknyaz.github.io/blog/2026/moe/ ). They haven't yet provided their code, so I am using Claude to do it. If it ever works, it would be nice to integrate your tool.

How does your quantization compares to chankhavu/Nemotron-Cascade-2-30B-A3B-NVFP4 ?

And what kv cache format are you using?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment