issue with vllm running

by aathi1324 - opened Feb 26

•

i could able to run the model , but the output is not coming correctly , i wonder something wrong with input tensor

log
(EngineCore_DP0 pid=35123) /workspace/user/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (64). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=35123) return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=35123) /workspace/user/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (64). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=35123) return fn(*contiguous_args, **contiguous_kwargs)

command
vllm serve Qwen/Qwen3.5-122B-A10B-FP8 --port 8070 --gpu-memory-utilization 0.95 --served-model-name qwen --mm-encoder-tp-mode data --allowed-local-media-path /workspace --attention-backend FLASH_ATTN

and i've followed all instructions given by you .

running in vllm nightly

(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293]
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293] █ █ █▄ ▄█
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.16.0rc2.dev496+g4a9c07a0a
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293] █▄█▀ █ █ █ █ model Qwen/Qwen3.5-122B-A10B-FP8
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293]

output

root@9e309957d443:/workspace/user# curl http://localhost:8070/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "qwen",
"messages": [{"role": "user", "content": "Say hello in one word."}],
"max_tokens": 20,
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20
}'
{"id":"chatcmpl-b8456ddecf863d54","object":"chat.completion","created":1772092311,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"do\n\n\n\n","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":20,"completion_tokens":4,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Looking for help from you .

yushang

Feb 27

(Worker_TP1 pid=451) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (11) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=451) return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=450) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (11) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=450) return fn(*contiguous_args, **contiguous_kwargs)
(APIServer pid=1) DEBUG 02-27 05:56:12 loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

me too