How are you running this?

by infernix - opened Mar 6

Discussion

infernix

Mar 6

Sglang seems incompatible with this quantisation. Same for the bf16. Any hints? Thx!

trohrbaugh

Owner Mar 6

yea the nvfp4 is super specific and nvidia is not fully baked on hardware for now not even all the blackwells with respect to kernel. But the full BF16 is solid if you have the hardware for it. What hardware are you trying to run it on? I would try run the Bf16 and specify -q fp8 with vllm and it will compress on the fly

trohrbaugh

Owner Mar 6

dual sparts.. i.e. clustered works.

infernix

Mar 6

Dual rtx 6k Blackwell eg sm120. Will try vllm fp8, thanks

trohrbaugh

Owner Mar 7

That is a perfect set up . just put -q FP8 and vllm will compress on the fly from BF16

coughmedicine

Mar 7

So vllm won't run this quant using any of its fp4 kernels?

trohrbaugh

Owner Mar 7

Not sure. Will nee dto try with vllm nightly . Back in a bit.

trohrbaugh

Owner Mar 8

Ok I just verified that this model works perfectly with Blackwells. I checked all there options under vllm Vision+text, MTA and Text only. I did have problems at first but for me that was because I didnt have the right nvcc when it went to compile for the first run. to make sure you dont have the same problem run nvcc --version and nvidia-smi. make sure if you load cuda toolkit 12.8 its at or lower than the cuda your driver currently supports. the latest driver support 13.1 but stick with cuda 12.8 for sm120

infernix

Mar 8

Do you have a cmdline? I'm getting all exclamation marks with the nvfp4; the on-the-fly FP8 down from BF16 works well FWIW.

trohrbaugh

Owner Mar 8

vllm serve trohrbaugh/Qwen3.5-122B-A10B-heretic-nvfp4 --port 8100 --reasoning-parser qwen3 ... just this alone works. You can drop the vision transformer --language-model-only

BUT make sure you go in this order:

uv pip install -U vllm --torch-backend=auto --extra-index-url vllm/vllm-openai:cu128-nightly
uv pip install git+https://github.com/huggingface/transformers
vllm serve trohrbaugh/Qwen3.5-122B-A10B-heretic-nvfp4 --reasoning-parser qwen3

infernix

Mar 13

•

edited Mar 13

Does multimodal work? I have tried a number of things but submitting images just results in !!!!! in the output. Text is fine. VLLM 0.17.1 (basically git main from $now).

edit: never mind. VLLM_CUTLASS is broken. --moe-backend flashinfer_cutlass and everything works fine

trohrbaugh

Owner Mar 13

There are many moving parts with local inference. keeps me busy with each model released. Let me know if you find anything else.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment