Multimodal support

#2
by Nasa1423 - opened

I understand Gemini was built to be natively multimodal. Could you elaborate on the current capabilities, especially regarding real-time processing of combined audio and video inputs? Furthermore, what does the development roadmap look like for expanding these core multimodal features?

Unsloth AI org

Currently this GGUF only supports text. We wrote it in the description. Hopefully llama.cpp will be able to support all forms soon

Ok, now I see that it is a llama.cpp restriction, not specifically this quant. Thanks!

llama.cpp now has support for multi-modal vision models. Your Gemma 3 (not 3n here) variants already support it link, so I wanted to check if any plans on supporting it here?

Lastly, what about also supporting ALL modalities? The original gemma 3n support text, audio, and vision (image and video) inputs.

llama.cpp has had support for multi-modal vision models for a while, not just now. As of pull 18256, it supports gemma-3n's MobileNet v5 architecture specifically.

Do we have any plans to add an mmproj for this model any time soon? I'm hoping to try it out whenever it becomes available.

Sign up or log in to comment