Does this model support multimodality?
First of all, thank you very much for contributing these excellent models.
I'd like to inquire whether this model supports multimodality. I've tried using it with OpenClaw via OMLX, but it says it can't recognize image information. The original Qwen 3.5 works fine. I checked, and it seems that chat_template.jinja is different from the original model, lacking image-related fields. However, even after adjusting it according to the original model, it still doesn't work correctly, although I noticed that the model tags include image-text-to-text.
Therefore, I'd like to confirm whether this model supports multimodality, and if so, how to configure it to take effect?
Thanks
Hi, thank you so much for your kind words and support! 😊
Regarding multimodality — the GGUF version does support multimodal inference. To enable it properly, you’ll need to use the corresponding mmproj.gguf projection file.
You can find the mmproj.gguf file in the GGUF repository. Once downloaded, please place it in the same directory as the main model file. This mapping file is required for handling image inputs correctly.
Thanks
Thanks for the reply, I'll try it again.
Is only the GGUF version supported? Does the current repository not support it?