I think the claim that NVFP4 is superior to INT4 still needs more evidence.
I've run some tests on Qwen3 4B: INT4 with same scale format (FP8 E4M3) used in Blackwell (1 scale per 16 elements): block scaled INT4 achieved better KL divergence. On benchmarks like PIQA and HellaSwag, NVFP4 did slightly better.
To benefit from NVFP4 hardware, both multiplicands have to be in NVFP4, it can be achieved by quantizing activations on the fly. So in addition to QAT you've mentioned for the weights, the model has to be adapted to accepted NVFP4 activations as well.