Findings after vllm v0.20
Since version 0.20 of vllm there is a change that helps you configure separate attention backend for main/verifier model and speculative/draft model (https://github.com/vllm-project/vllm/pull/39930). However, I couldn't make it work (on DGX Spark), here are some things I tried and results:
Setting up attention backend to FLASH_ATTN for both main and speculative model results in
ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration. Reason: ['partial multimodal token full attention not supported']Setting up attention backend to FLASH_ATTN for just speculative/draft model (main is auto-selected and set to TRITON_ATTN) fails with the same error message, i.e.
ValueError: Selected backend AttentionBackendEnum.FLASH_ATTN is not valid for this configuration. Reason: ['partial multimodal token full attention not supported']Not setting attention backend explicitly (resulting in auto selection, i.e. main=TRITON_ATTN and draft=FLEX_ATTENTION) and setting
--max-num-batched-tokensand--disable-chunked-mm-input- works! BUT results are gibberish:
3.1. By increasing the value of--max-num-batched-tokensit feels like gibberish start later... but still not usable at all
3.2. Changing theconfig.jsonof the speculative model and itsrope_parametersto match therope_parametersof the main/verifier model helps a bit but again ends up in gibberish
Hope this could help someone or maybe someone could point out what else to try out...
Hi, https://github.com/vllm-project/vllm/pull/39930 was just merged 2 days ago after the 0.20 release. Could you please try running using a nightly build?
e.g.
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
@fynnsu Thank you. Yes, I did use nightly build and to be sure I just checked the code to be sure if the changes from that PR are there - looks like that part is not the issue... Not sure if flash attention has some problems with dgx spark? google/gemini says it does...
Yes, I was able to reproduce the gibberish issue and I've opened an issue for it on vllm: https://github.com/vllm-project/vllm/issues/41262.