issue with vllm running
i could able to run the model , but the output is not coming correctly , i wonder something wrong with input tensor
log
(EngineCore_DP0 pid=35123) /workspace/user/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (64). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=35123) return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=35123) /workspace/user/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (64). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=35123) return fn(*contiguous_args, **contiguous_kwargs)
command
vllm serve Qwen/Qwen3.5-122B-A10B-FP8 --port 8070 --gpu-memory-utilization 0.95 --served-model-name qwen --mm-encoder-tp-mode data --allowed-local-media-path /workspace --attention-backend FLASH_ATTN
and i've followed all instructions given by you .
running in vllm nightly
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293]
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293] β β ββ ββ
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293] ββ ββ β β β βββ β version 0.16.0rc2.dev496+g4a9c07a0a
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293] ββββ β β β β model Qwen/Qwen3.5-122B-A10B-FP8
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293] ββ βββββ βββββ β β
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293]
output
root@9e309957d443:/workspace/user# curl http://localhost:8070/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "qwen",
"messages": [{"role": "user", "content": "Say hello in one word."}],
"max_tokens": 20,
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20
}'
{"id":"chatcmpl-b8456ddecf863d54","object":"chat.completion","created":1772092311,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"do\n\n\n\n","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":20,"completion_tokens":4,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
Looking for help from you .
(Worker_TP1 pid=451) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (11) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=451) return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=450) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (11) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=450) return fn(*contiguous_args, **contiguous_kwargs)
(APIServer pid=1) DEBUG 02-27 05:56:12 loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
me too
it makes me waiting more minutes to get ai answer
But for me it's giving some numerical numbers instead of words and sometimes not responding anything
sglang most good than vllm, bye vllm
sglang is slower than vllm and i fixed that issue myself
Please share how you fixed this issue please
yep same issue
How did you fix? Having same issue
sglang is slower than vllm and i fixed that issue myself
Can you describe how you fixed it? Thanks