Findings after vllm v0.20

by GabrielaCats - opened 9 days ago

Since version 0.20 of vllm there is a change that helps you configure separate attention backend for main/verifier model and speculative/draft model (https://github.com/vllm-project/vllm/pull/39930). However, I couldn't make it work (on DGX Spark), here are some things I tried and results:

Setting up attention backend to FLASH_ATTN for both main and speculative model results in ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration. Reason: ['partial multimodal token full attention not supported']
Setting up attention backend to FLASH_ATTN for just speculative/draft model (main is auto-selected and set to TRITON_ATTN) fails with the same error message, i.e. ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration. Reason: ['partial multimodal token full attention not supported']
Not setting attention backend explicitly (resulting in auto selection, i.e. main=TRITON_ATTN and draft=FLEX_ATTENTION) and setting --max-num-batched-tokens and --disable-chunked-mm-input - works! BUT results are gibberish:
3.1. By increasing the value of --max-num-batched-tokens it feels like gibberish start later... but still not usable at all
3.2. Changing the config.json of the speculative model and its rope_parameters to match the rope_parameters of the main/verifier model helps a bit but again ends up in gibberish

Hope this could help someone or maybe someone could point out what else to try out...

fynnsu

Red Hat AI org 9 days ago

Hi, https://github.com/vllm-project/vllm/pull/39930 was just merged 2 days ago after the 0.20 release. Could you please try running using a nightly build?

e.g.

uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly

GabrielaCats

8 days ago

@fynnsu Thank you. Yes, I did use nightly build and to be sure I just checked the code to be sure if the changes from that PR are there - looks like that part is not the issue... Not sure if flash attention has some problems with dgx spark? google/gemini says it does...

fynnsu

Red Hat AI org 8 days ago

Yes, I was able to reproduce the gibberish issue and I've opened an issue for it on vllm: https://github.com/vllm-project/vllm/issues/41262.

GabrielaCats

8 days ago

Thank you! I left a comment with some details about hardware here

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment