Multimodal support

by Nasa1423 - opened Jun 26, 2025

Jun 26, 2025

I understand Gemini was built to be natively multimodal. Could you elaborate on the current capabilities, especially regarding real-time processing of combined audio and video inputs? Furthermore, what does the development roadmap look like for expanding these core multimodal features?

shimmyshimmer

Unsloth AI org Jun 26, 2025

Currently this GGUF only supports text. We wrote it in the description. Hopefully llama.cpp will be able to support all forms soon

Nasa1423

Jun 26, 2025

Ok, now I see that it is a llama.cpp restriction, not specifically this quant. Thanks!

BitBuilder

Jan 26

llama.cpp now has support for multi-modal vision models. Your Gemma 3 (not 3n here) variants already support it link, so I wanted to check if any plans on supporting it here?

Lastly, what about also supporting ALL modalities? The original gemma 3n support text, audio, and vision (image and video) inputs.

karaiwulf

Feb 5

llama.cpp has had support for multi-modal vision models for a while, not just now. As of pull 18256, it supports gemma-3n's MobileNet v5 architecture specifically.

Do we have any plans to add an mmproj for this model any time soon? I'm hoping to try it out whenever it becomes available.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment