All this talk about NVFP4 - why is it dog slow?

#13

by josephbreda - opened Mar 13

Mar 13

Running this model on a Blackwell-class DGX Spark runs at a sleep inducing 16 t/s. The similarly sized Qwen3.5 122b-a10b runs about twice as fast with an autoround quant, and it benchmarks better in most cases.

Where are those NVFP4 optimizations for Blackwell we have been hearing about? Less marketing, more engineering please.

syvvvv

Mar 14

•

edited Mar 14

In some evaluations, the speed of Nemotron-3 Super was several times that of Qwen3.5-122B. However, in reality, Qwen3.5-122B A10B can reach 16 tokens/s under NVFP4, while Nemotron-3 Super only achieves 15-17 tokens/s.

there is my command for Nemotron-3 Super

docker run --gpus all -itd --rm -e FLASHINFER_FUSED_MOE_DISABLE_CUTLASS=1 -e TORCHINDUCTOR_MAX_AUTOTUNE=0
-e VLLM_DISABLE_PYNCCL=1
-e NCCL_IB_DISABLE=1
-e VLLM_SLEEP_WHEN_IDLE=1
-e OMP_NUM_THREADS=12
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
-e VLLM_NVFP4_GEMM_BACKEND=cutlass
-e VLLM_USE_FLASHINFER_MOE_FP4=0
-e VLLM_TEST_FORCE_FP8_MARLIN=1
-v /home/hsyue/models/llm/$model:/model/$model
-p 8080:8080
vllm/vllm-openai:cu130-nightly
--model /model/$model
--port 8080
--host 0.0.0.0
--async-scheduling
--served-model-name nvidia/nemotron-3-super
--dtype auto
--kv-cache-dtype fp8
--tensor-parallel-size 1
--pipeline-parallel-size 1
--data-parallel-size 1
--swap-space 0
--trust-remote-code
--attention-backend TRITON_ATTN
--gpu-memory-utilization 0.9
--enable-chunked-prefill
--max-num-seqs 512
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
-–speculative_config '{"method": "mtp", "num_speculative_tokens": 2}'
--reasoning-parser-plugin "/model/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/super_v3_reasoning_parser.py"
--reasoning-parser super_v3

Is there any problem?

josephbreda

Mar 14

NVIDIA has continued to offer no answer.

catplusplus

Mar 14

You are disabling a lot of potential optimizations. It could well be they don't work on DGX Spark yet, but still using fallbacks will be slower. I would try to build latest vllm from source and see. Except that latest vllm doesn't even load the model for me (Qwen 3.5 works fine), so I don't know :-)

josephbreda

Mar 14

It runs, but none of the nvfp4 alleged optimizations exist.

nawoalanor

Mar 15

Avarok's custom docker container makes FP4 faster than INT4 by a reasonable margin on GB10. Worth a try, though it's still experimental.

I don't know if anyone has yet done any in-depth testing if there's quality degradation specific to the Avarok container; this would be an ideal opportunity for someone to do that since Nvidia itself made the NVFP4 quant so it's not an unknown-quality community version. If nobody else has done it in a few days I might have time.

andrew-stanton

Mar 19

•

edited Mar 19

I'd like to add that I run a RTX Pro 6000 as well, which in theory has NVFP4 support; I've yet to come close to Qwen 122B numbers, let alone GPT-Oss-120b numbers. Maybe it requires the MPT pieces that aren't working anywhere as far as I can tell. It does feel a bit bait and switch though as it currently stands.

josephbreda

Mar 19

Apparently the focus is on $300K DGX B200 systems and not the huge number of folks like us trying to run on Sparks and RTX Pros. The fact that they are boasting the speed advantage everywhere, when it really doesn't show up where it can make a difference at the edge is super frustrating. So far the company seems to be ignoring the disappointing state of NVFP4 on "prosumer" Blackwell hardware.

andrew-stanton

Mar 19

@josephbreda deeply frustrating, isn't it. I dropped $13k on a rig to use these models and here we are out high and dry.

nawoalanor

Mar 19

•

edited Mar 19

For Spark, have you tried Eugr's container with his specific "nemotron-3-super-nvfp4" recipe? They're quoting 16.55 tps with 1 spark and 27.23 tps on two stacked. I'm not sure how old those tests are. There has been a lot of rapid advancement even in the past week.

Also, again, I would recommend people give the Avarok experimental container a try. It is specifically optimized for nvfp4 and delivers actual FP4 performance improvements above INT4.

Not excusing Nvidia for not having all this sorted out months ago, I am myself also extremely disappointed in the state of Spark's backend.

josephbreda

Mar 19

Yes. I have tried both of those. This model runs at ~ 16 t/s, which is far below the theoretical bandwidth and compute capabilities of the DGX Spark. This is a model with 12B activated parameters per token. It should generate at least 30 t/s.
To put it in perspective, Qwen 3.5 122b runs at 3X the speed with 10b active parameters, and the much larger Qwen 3.5 397B runs at about 2X the speed with 17B activated parameters per token. (Both with high-quality Intel autoround quants) .

I think the community has gone above and beyond to support the state of inference on this hardware. It is time for NVIDIA to step up and either announce material support for these "inferior" Blackwell products, or fess up to the fact that they will never deliver the promised performance. Right now NVFP4 runs slower than quants not specifically targeted for NVIDIA chips!

nawoalanor

Mar 19

•

edited Mar 19

If it's any consolation until if/when this situation improves - tests show it's not actually a very good model. I suspect it's intended more as a tool for others to develop targeted models with and demonstrate how to properly use NVFP4 to minimize quantization loss rather than something they expect people will actually want to use 24/7.

If you're using this to code, Qwen3-Coder-Next in FP8 is a far better choice, it runs very well, comfortably fits in 128GB with lots of context, and there's a preset recipe for it in Eugr's repo. If you want a general assistant, Qwen3.5-122B Autoround is quite competent but you'll want to use launch parameters to limit or disable thinking. And on RTX 6000 Pro 96GB you're likely best off with the excellent Qwen3.5 27B FP8 or 122B in NVFP4 if FP4 actually works properly on that hardware.

josephbreda

Mar 19

Thanks -- I'm running Qwen3.5 122b AutoRound on a two-Spark cluster and it works reasonably well apart from some documented tool calling issues between vLLM and OpenCode. I know I can disable thinking -- not aware it could be limited. How does that work?

nawoalanor

Mar 19

•

edited Mar 20

@josephbreda if your intended use is coding, Eugr's container for qwen3-coder-next-FP8 is something you should definitely try out. I just got it running on my dual spark cluster like yours and I'm getting 64 tokens per second! This is the kind of performance I was hoping for when I invested in this kit. Hopefully something for folks to look forward to if NVFP4 ever gets fixed as they should be able to expect even higher tokens per second on such a setup. I haven't tried it single-node yet but it should be very performant as well. And even with 256K context there's plenty of room for concurrency.

If you decide to try it out I recommend adding this to your launch command:
--kv-cache-dtype auto \

By default it appears to be using FP8 and I got a lot of looping behavior.

To answer your question about thinking, you can either adjust it permanently by modifying your launch parameters like this: Increase "presence penalty" to tweak how much thinking it does. 0 for lots of thinking, 2.0 for less. You should also be able to just pass "chat_template_kwargs": {"enable_thinking": false} in your frontend of choice if you want to be able to disable it on demand.

fanmingjie

23 days ago

I run Qwen3.5-122B-A10B-Q4_K_M.gguf with llama ~20t/s and want to try NVFP4 with vllm which claimed to be faster on Spark. And now I'm confused.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment