Multimodal support
I understand Gemini was built to be natively multimodal. Could you elaborate on the current capabilities, especially regarding real-time processing of combined audio and video inputs? Furthermore, what does the development roadmap look like for expanding these core multimodal features?
Currently this GGUF only supports text. We wrote it in the description. Hopefully llama.cpp will be able to support all forms soon
Ok, now I see that it is a llama.cpp restriction, not specifically this quant. Thanks!
llama.cpp now has support for multi-modal vision models. Your Gemma 3 (not 3n here) variants already support it link, so I wanted to check if any plans on supporting it here?
Lastly, what about also supporting ALL modalities? The original gemma 3n support text, audio, and vision (image and video) inputs.