Model does not load on VLLM

#1
by mdierolf - opened

Model is not loading on VLLM

(EngineCore_DP0 pid=248480)   File "/home/mdierolf/gitprojects/vllm/vllm/model_executor/models/utils.py", line 265, in _load_module
(EngineCore_DP0 pid=248480)     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=248480)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=248480)   File "/home/mdierolf/gitprojects/vllm/vllm/model_executor/models/qwen3_5.py", line 729, in load_weights
(EngineCore_DP0 pid=248480)     return loader.load_weights(weights)
(EngineCore_DP0 pid=248480)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=248480)   File "/home/mdierolf/gitprojects/vllm/vllm/model_executor/model_loader/reload/torchao_decorator.py", line 50, in patched_model_load_weights
(EngineCore_DP0 pid=248480)     return original_load_weights(self, weights, *args, **kwargs)
(EngineCore_DP0 pid=248480)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=248480)   File "/home/mdierolf/gitprojects/vllm/vllm/model_executor/models/utils.py", line 344, in load_weights
(EngineCore_DP0 pid=248480)     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=248480)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=248480)   File "/home/mdierolf/gitprojects/vllm/vllm/model_executor/models/utils.py", line 292, in _load_module
(EngineCore_DP0 pid=248480)     yield from self._load_module(
(EngineCore_DP0 pid=248480)   File "/home/mdierolf/gitprojects/vllm/vllm/model_executor/models/utils.py", line 265, in _load_module
(EngineCore_DP0 pid=248480)     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=248480)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=248480)   File "/home/mdierolf/gitprojects/vllm/vllm/model_executor/models/qwen3_5.py", line 642, in load_weights
(EngineCore_DP0 pid=248480)     weight_loader(param, loaded_weight)
(EngineCore_DP0 pid=248480)   File "/home/mdierolf/gitprojects/vllm/vllm/model_executor/layers/linear.py", line 594, in weight_loader_v2
(EngineCore_DP0 pid=248480)     param.load_column_parallel_weight(loaded_weight=loaded_weight)
(EngineCore_DP0 pid=248480)   File "/home/mdierolf/gitprojects/vllm/vllm/model_executor/parameter.py", line 153, in load_column_parallel_weight
(EngineCore_DP0 pid=248480)     assert self.data.shape == loaded_weight.shape
Owner

@mdierolf this is only tested for correctness against SGLang, but I think it should work with vLLM on or after this commit: https://github.com/vllm-project/vllm/pull/35156.

Need larger stack trace anyway. And repro steps / vllm version tested.

Whoops, my VLLM didn't update properly

FWIW, after updating vllm to latest git commit, I am seeing complete gibberish from model when cuda graphs are enabled on RTX 6000. If I add --enforce-eager, I see proper output but very slow speeds. So there is some sort of bug in vllm preventing this model from working in NVFP4

I also tried sglang latest git commit, and it fails to load the model unless the Triton backend is selected, even though the logs show that the Triton backend was selected. If I manually select the Triton back end I get this:

python -m sglang.launch_server \
  --model-path txn545/Qwen3.5-122B-A10B-NVFP4 \
  --port 11434 \
  --context-length 262144 \
  --served-model-name qwen/qwen3.5-122B \
  --quantization modelopt_fp4 \
  --attention-backend triton

[2026-02-25 11:30:54] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/mdierolf/gitprojects/sglang/python/sglang/srt/managers/scheduler.py", line 3116, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/home/mdierolf/gitprojects/sglang/python/sglang/srt/managers/scheduler.py", line 363, in __init__
    self.init_model_worker()
  File "/home/mdierolf/gitprojects/sglang/python/sglang/srt/managers/scheduler.py", line 559, in init_model_worker
    self.init_tp_model_worker()
  File "/home/mdierolf/gitprojects/sglang/python/sglang/srt/managers/scheduler.py", line 517, in init_tp_model_worker
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/home/mdierolf/gitprojects/sglang/python/sglang/srt/managers/tp_worker.py", line 247, in __init__
    self._init_model_runner()
  File "/home/mdierolf/gitprojects/sglang/python/sglang/srt/managers/tp_worker.py", line 330, in _init_model_runner
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/home/mdierolf/gitprojects/sglang/python/sglang/srt/model_executor/model_runner.py", line 415, in __init__
    self.initialize(min_per_gpu_memory)
  File "/home/mdierolf/gitprojects/sglang/python/sglang/srt/model_executor/model_runner.py", line 611, in initialize
    self.init_device_graphs()
  File "/home/mdierolf/gitprojects/sglang/python/sglang/srt/model_executor/model_runner.py", line 2149, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mdierolf/gitprojects/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 492, in __init__
    self.capture_bs, self.compile_bs = get_batch_sizes_to_capture(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mdierolf/gitprojects/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 412, in get_batch_sizes_to_capture
    assert len(capture_bs) > 0 and capture_bs[0] > 0, f"{capture_bs=}"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: capture_bs=[0]

All in all, I have not been able to make this work on an RTX 6000

@mdierolf If it helps, this deployment works for me with sglang. Important pieces are moe-runner-backend and fp4-gemm-backend (works with & without KV cache so kv-cache-dtype up to you)

      --host 0.0.0.0
      --port 30000
      --context-length 64000
      --tp-size 1
      --dp-size 1
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --served-model-name qwen3.5-122b
      --json-model-override-args '{"temperature":1.0,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":1.5,"repetition_penalty":1.0}'
      --quantization modelopt_fp4
      --trust-remote-code
      --mem-fraction-static 0.9
      --chunked-prefill-size 16384
      --enable-metrics
      --attention-backend triton
      --moe-runner-backend flashinfer_cutlass
      --fp4-gemm-backend flashinfer_cudnn
      --kv-cache-dtype fp8_e4m3```

I am getting around 80 t/s on rtx 6000 pro with above settings with empty context. Isn't that on lower side for nvfp4?

@Cossale Yes and had similar experience, MXFP4 produced much better results for me. I'm assuming not optimized for sm120 yet.

77tg/s on 6000 blackwell for me. (300watt -pl)

@adunna Which mxfp4 checkpoint? and what speeds?

Owner

Just a note that I’ve re-uploaded the NVFP4 quants. Accuracy and overall model behavior should be improved compared to the previous version.

I can’t really comment on SM120 optimization, as my primary system is SM100.

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: 20.0
Max request concurrency: not set
Successful requests: 20
Benchmark duration (s): 246.14
Total input tokens: 268338
Total input text tokens: 268338
Total generated tokens: 93037
Total generated tokens (retokenized): 92460
Request throughput (req/s): 0.08
Input token throughput (tok/s): 1090.16
Output token throughput (tok/s): 377.98
Peak output token throughput (tok/s): 658.00
Peak concurrent requests: 20
Total token throughput (tok/s): 1468.14
Concurrency: 10.62
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 130687.32
Median E2E Latency (ms): 155580.79
P90 E2E Latency (ms): 193048.91
P99 E2E Latency (ms): 238439.68
---------------Time to First Token----------------
Mean TTFT (ms): 29376.32
Median TTFT (ms): 10920.07
P99 TTFT (ms): 97865.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 23.43
Median TPOT (ms): 22.39
P99 TPOT (ms): 36.67
---------------Inter-Token Latency----------------
Mean ITL (ms): 21.78
Median ITL (ms): 19.68
P95 ITL (ms): 21.73
P99 ITL (ms): 21.92
Max ITL (ms): 15030.64

a test for 20 threads, on 4 x rtx5090, good results. With single requests, average speed of generation 105-115tok/s. Sglang version 0.5.9

Does this latest upload support MTP hidden layers? does MTP spec decode work?

Sign up or log in to comment