Not working on 0.16.1rc1.dev141+g792a74b97

#2
by ineersa - opened

Command:

CUDA_VISIBLE_DEVICES=1,2 \
vllm serve osoleve/Qwen3.5-27B-NVFP4-MTP \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --quantization modelopt \
  --language-model-only \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Getting an error:

(Worker pid=125572) (Worker_TP0 pid=125572) INFO 03-02 12:11:11 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.                                                                                                                      
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800] WorkerProc failed to start.                                                                                                                                                         
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800] Traceback (most recent call last):                                                                                                                                                  
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 771, in worker_main                                                     
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     worker = WorkerProc(*args, **kwargs)                                                                                                                                            
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                            
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper                                                                      
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     return func(*args, **kwargs)                                                                                                                                                    
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]            ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                    
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 597, in __init__                                                        
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     self.worker.load_model()                                                                                                                                                        
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 336, in load_model                                                                
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     self.model_runner.load_model(load_dummy_weights=dummy_weights)                                                                                                                  
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper                                                                      
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     return func(*args, **kwargs)                                                                                                                                                    
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]            ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                    
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4218, in load_model                                                         
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     self.model = model_loader.load_model(                                                                                                                                           
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]                  ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                           
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper                                                                      
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     return func(*args, **kwargs)                                                                                                                                                    
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]            ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                    
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 54, in load_model                                              
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     model = initialize_model(                                                                                                                                                       
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]             ^^^^^^^^^^^^^^^^^                                                                                                                                                       
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper                                                                      
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     return func(*args, **kwargs)                                                                                                                                                    
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]            ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                    
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 56, in initialize_model                                              
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     model = model_class(vllm_config=vllm_config, prefix=prefix)                                                                                                                     
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                     
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_5.py", line 652, in __init__                                                         
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     self.visual = Qwen3_VisionTransformer(                                                                                                                                          
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]                   ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                          
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_vl.py", line 401, in __init__                                                        
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     Qwen3_VisionBlock(                                                                                                                                                              
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_vl.py", line 233, in __init__                                                        
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     self.mlp = Qwen3_VisionMLP(                                                                                                                                                     
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]                ^^^^^^^^^^^^^^^^                                                                                                                                                     
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_vl.py", line 194, in __init__                                                        
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     self.linear_fc2 = RowParallelLinear(                                                                                                                                            
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]                       ^^^^^^^^^^^^^^^^^^
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1407, in __init__
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     self.quant_method.create_weights(
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]   File "/home/ineersa/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/modelopt.py", line 1125, in create_weights
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800]     raise ValueError(
(Worker pid=125573) (Worker_TP1 pid=125573) ERROR 03-02 12:11:12 [multiproc_executor.py:800] ValueError: Unsupported model when in features size is not multiple of 16

Looks like it's trying to init vision even with --language-model-only and fails.

Is there a way to maybe add this quant with vision suuport or MTP doesn't work there?

Owner

Not sure about version compatibility there, but I'm making the full vision quants and should have them up today

vllm 0.16.1rc1.dev160+g6521ccf28.cu130 , transformers 5.2.0

vllm serve osoleve/Qwen3.5-27B-NVFP4-MTP \
  --quantization modelopt \
  --trust-remote-code \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.8 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' \
  --language-model-only \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs=1

It runs okay, acceptance rate around 80%.

Sign up or log in to comment