How are you running this?

#1
by infernix - opened

Sglang seems incompatible with this quantisation. Same for the bf16. Any hints? Thx!

yea the nvfp4 is super specific and nvidia is not fully baked on hardware for now not even all the blackwells with respect to kernel. But the full BF16 is solid if you have the hardware for it. What hardware are you trying to run it on? I would try run the Bf16 and specify -q fp8 with vllm and it will compress on the fly

dual sparts.. i.e. clustered works.

Dual rtx 6k Blackwell eg sm120. Will try vllm fp8, thanks

That is a perfect set up . just put -q FP8 and vllm will compress on the fly from BF16

So vllm won't run this quant using any of its fp4 kernels?

Not sure. Will nee dto try with vllm nightly . Back in a bit.

Ok I just verified that this model works perfectly with Blackwells. I checked all there options under vllm Vision+text, MTA and Text only. I did have problems at first but for me that was because I didnt have the right nvcc when it went to compile for the first run. to make sure you dont have the same problem run nvcc --version and nvidia-smi. make sure if you load cuda toolkit 12.8 its at or lower than the cuda your driver currently supports. the latest driver support 13.1 but stick with cuda 12.8 for sm120

Do you have a cmdline? I'm getting all exclamation marks with the nvfp4; the on-the-fly FP8 down from BF16 works well FWIW.

vllm serve trohrbaugh/Qwen3.5-122B-A10B-heretic-nvfp4 --port 8100 --reasoning-parser qwen3 ... just this alone works. You can drop the vision transformer --language-model-only

BUT make sure you go in this order:

  1. uv pip install -U vllm --torch-backend=auto --extra-index-url vllm/vllm-openai:cu128-nightly
  2. uv pip install git+https://github.com/huggingface/transformers
  3. vllm serve trohrbaugh/Qwen3.5-122B-A10B-heretic-nvfp4 --reasoning-parser qwen3

Does multimodal work? I have tried a number of things but submitting images just results in !!!!! in the output. Text is fine. VLLM 0.17.1 (basically git main from $now).

edit: never mind. VLLM_CUTLASS is broken. --moe-backend flashinfer_cutlass and everything works fine

There are many moving parts with local inference. keeps me busy with each model released. Let me know if you find anything else.

Sign up or log in to comment