z-lab/Qwen3.5-27B-DFlash · Qwen3.5-4B/9B dflash supports VL mode

Qwen3.5-4B/9B dflash supports VL mode

by huzhua - opened 10 days ago

Good work, now I see the speed up on my GPU with text mode, not sure if it is workable for VL mode for qwen3.5-4B/9B model? BTW, does it work for int4 quantization? Thanks.

jianchen0311

Z Lab org 10 days ago

Yes, we tested Qwen3.5-9B-DFlash on various VL tasks and it works well.

Note that the speedup here is end-to-end, which includes the prefill time, so DFlash’s speedup is amortized by the long prefill time of VL input. But the acceptance length is still generally good.

For quantized target model, yes DFlash should be able to work with int4 target model, achieving similar acceptance length with the BF16 one.

huzhua

3 days ago

VL works well on our int4 model also. furthermore, I found one interesting thing that, with longer context, the acceptance rate will drop dramatically, even for the test bench which initially has very high acceptance. Did you meet the similar issue? Any method to solve it? Thanks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment