How to enable the **text+image to text** mode? (i.e. add the vision encoder)
#9
by Didier - opened
As always, it must be very simple once we know how to, but I currently do not know.
Question
- How do I merge the GGUF file (text mode only) with the mmproj file (vision encoder)?
- Text only: Mistral-Small-3.2-24B-Instruct-2506-Q4_0.gguf
- Vision encoder: mmproj-BF16.gguf
What
- I am interested in serving the multimodal (text+image) model as an OpenAI compatible API using the conda package llama-cpp-python.
Env
- conda create -n model_serve llama-cpp-python ipykernel
Serving
- python -m llama_cpp.server --model Mistral-Small-3.2-24B-Instruct-2506-Q4_0.gguf
--clip_model_path mmproj-BF16.gguf
Using (with LiteLLM)
- Text to text mode is fine!
- When text+image as input, I get a message indicating that there is no vision encoder :-(
"""
Please share the image you'd like me to describe. You can paste it directly into the chat, and I'll do my best to analyze its contents for you!
"""
Thanks for any help and your time.
Simply changed the way the model is served, i.e. using "-hf" instead of "--model xxxxx --clip-model xxxxx". Text+image to text works as expected. I.e.
llama-server -hf unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF --jinja --host 127.0.0.1 --port 8080 --api-key "Keep learning" --ctx-size 4096
Didier changed discussion status to closed