issue with vllm running

#1
by aathi1324 - opened

i could able to run the model , but the output is not coming correctly , i wonder something wrong with input tensor

log
(EngineCore_DP0 pid=35123) /workspace/user/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (64). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=35123) return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=35123) /workspace/user/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (64). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=35123) return fn(*contiguous_args, **contiguous_kwargs)

command
vllm serve Qwen/Qwen3.5-122B-A10B-FP8 --port 8070 --gpu-memory-utilization 0.95 --served-model-name qwen --mm-encoder-tp-mode data --allowed-local-media-path /workspace --attention-backend FLASH_ATTN

and i've followed all instructions given by you .

running in vllm nightly

(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293]
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293] β–ˆ β–ˆ β–ˆβ–„ β–„β–ˆ
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293] β–„β–„ β–„β–ˆ β–ˆ β–ˆ β–ˆ β–€β–„β–€ β–ˆ version 0.16.0rc2.dev496+g4a9c07a0a
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293] β–ˆβ–„β–ˆβ–€ β–ˆ β–ˆ β–ˆ β–ˆ model Qwen/Qwen3.5-122B-A10B-FP8
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293] β–€β–€ β–€β–€β–€β–€β–€ β–€β–€β–€β–€β–€ β–€ β–€
(APIServer pid=34901) INFO 02-26 07:50:11 [utils.py:293]

output

root@9e309957d443:/workspace/user# curl http://localhost:8070/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "qwen",
"messages": [{"role": "user", "content": "Say hello in one word."}],
"max_tokens": 20,
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20
}'
{"id":"chatcmpl-b8456ddecf863d54","object":"chat.completion","created":1772092311,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"do\n\n\n\n","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":16,"total_tokens":20,"completion_tokens":4,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Looking for help from you .

(Worker_TP1 pid=451) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (11) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=451) return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=450) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (11) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=450) return fn(*contiguous_args, **contiguous_kwargs)
(APIServer pid=1) DEBUG 02-27 05:56:12 loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

me too

it makes me waiting more minutes to get ai answer

But for me it's giving some numerical numbers instead of words and sometimes not responding anything

sglang most good than vllm, bye vllm

sglang is slower than vllm and i fixed that issue myself

Please share how you fixed this issue please

yep same issue

How did you fix? Having same issue

sglang is slower than vllm and i fixed that issue myself

Can you describe how you fixed it? Thanks

Sign up or log in to comment