Qwen3.5-4B/9B dflash supports VL mode
Good work, now I see the speed up on my GPU with text mode, not sure if it is workable for VL mode for qwen3.5-4B/9B model? BTW, does it work for int4 quantization? Thanks.
Yes, we tested Qwen3.5-9B-DFlash on various VL tasks and it works well.
Note that the speedup here is end-to-end, which includes the prefill time, so DFlash’s speedup is amortized by the long prefill time of VL input. But the acceptance length is still generally good.
For quantized target model, yes DFlash should be able to work with int4 target model, achieving similar acceptance length with the BF16 one.
VL works well on our int4 model also. furthermore, I found one interesting thing that, with longer context, the acceptance rate will drop dramatically, even for the test bench which initially has very high acceptance. Did you meet the similar issue? Any method to solve it? Thanks.