Amazing

by ehartford - opened 17 days ago

Discussion

ehartford

17 days ago

Thanks for this!
How can I do this myself?
Is it in LLM compressor?

MeganEFlynn

Red Hat AI org 17 days ago

We created this using the Speculators repository: https://github.com/vllm-project/speculators

There are a few small changes we had to make to support Gemma 4, but we are looking to land those very soon so you can try it out yourself!

ehartford

16 days ago

Exciting project!

GabrielaCats

11 days ago

•

edited 11 days ago

There are a few small changes we had to make to support Gemma 4, but we are looking to land those very soon so you can try it out yourself!

Does this also mean "small changes" to make it running using vllm? Tried it with different verifier models and different attention backends but with no success...

EDIT:

Also I don't see any changes regarding Gemma4 and dflash in vllm repo (I do see the changes for eagle3 and it works for eagle3)
I see the redhat-h100-testing branch but no commits ahead of the main branch at vllm, could it be that vllm-openai:cu130-nightly container image is not containing all the latest changes?

fynnsu

Red Hat AI org 11 days ago

@GabrielaCats It is working on Hopper architecture gpus (e.g. H100s), but there are some attention backend conflicts on Ampere.

The issue comes from DFlash requiring non-causal attention support (which currently very few backends can handle). This pr (https://github.com/vllm-project/vllm/pull/39930) fixes an bug in DFlash where we're currently requiring the DFlash drafter and the verifier to use the same (non-causal) backend. The pr also makes it easier to manually specify a separate attention backend for the drafter to use. Once this is merged, it should clear up a lot of the attention backend conflicts you've been seeing.

GabrielaCats

11 days ago

@fynnsu Thank you! I really appreciate the explanation and it's exactly the exception I was getting when struggling to run it :)

Do you happen to know if that PR fix it for any GPU architecture or?

fynnsu

Red Hat AI org 8 days ago

Could you share some more details on the hardware you are running on? And the error you see?

Unfortunately Gemma itself has some complex attention backend requirements which will limit the choices available on some hardware. Then the DFlash head adds an additional condition that non-causal attention be supported. The linked pr should separate these two requirements a bit, allowing distinct backends for the two parts of the model, but it won't solve the problem if the backends are still incompatible with your hardware.

Could you also verify that you can run the base gemma model and maybe this Eagle-3 model: https://huggingface.co/RedHatAI/gemma-4-31B-it-speculator.eagle3? That will help narrow down the source of the issue. The Eagle-3 model is also a good option to use until the DFlash issue is resolved.

fynnsu

Red Hat AI org 8 days ago

Actually I see you added more details in another thread, I will also respond there.

GabrielaCats

8 days ago

•

edited 8 days ago

@fynnsu yes, I'm running "normal" model RedHatAI/gemma-4-31B-it-NVFP4 without any issues (with auto-selected attention backed, i.e. TRITON-ATTN). I did experiment with eagle3 speculator some time before and it worked (no gibberish), used RedHatAI/gemma-4-31B-it-speculator.eagle3. I'll try it again eagle3, but this time with that latest version of vllm. As for the hardware, it's DGX Spark, which is advertised to be Blackwell architecture, but obviously it has some differences from real blackwell for servers. My assumption was that if model card of this dflash model suggests to use FLASH_ATTN as attention backend (at it was advertised to work at least on hopper) it might be the issue with my hardware and flash attention, but I'm not an expert...

Thank you for your explanation and help!

EDIT: Just tried it again with eagle3 speculator and 0.20.1rc1.dev91+ga749a33d8 vllm, it just works (with more or less default vllm args).

fynnsu

Red Hat AI org 7 days ago

Yeah, unfortunately the FLASH_ATTN backend worked initially on vllm main but now doesn't seem to work. I think it doesn't support multi-modal inputs and the initialize gemma vllm implementation didn't have mm implemented. When it was added, a believe a check was added blocking gemma from using FLASH_ATTN. There may be a way to still use it and disable mm but I'm not sure.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment