How to enable the text+image to text mode? (i.e. add the vision encoder)

by Didier - opened Sep 4, 2025

Discussion

Didier

Sep 4, 2025

As always, it must be very simple once we know how to, but I currently do not know.

Question

How do I merge the GGUF file (text mode only) with the mmproj file (vision encoder)?
Text only: Mistral-Small-3.2-24B-Instruct-2506-Q4_0.gguf
Vision encoder: mmproj-BF16.gguf

What

I am interested in serving the multimodal (text+image) model as an OpenAI compatible API using the conda package llama-cpp-python.

Env

conda create -n model_serve llama-cpp-python ipykernel

Serving

python -m llama_cpp.server --model Mistral-Small-3.2-24B-Instruct-2506-Q4_0.gguf
--clip_model_path mmproj-BF16.gguf

Using (with LiteLLM)

Text to text mode is fine!
When text+image as input, I get a message indicating that there is no vision encoder :-(

"""
Please share the image you'd like me to describe. You can paste it directly into the chat, and I'll do my best to analyze its contents for you!
"""

Thanks for any help and your time.

Didier

Oct 20, 2025

Simply changed the way the model is served, i.e. using "-hf" instead of "--model xxxxx --clip-model xxxxx". Text+image to text works as expected. I.e.

llama-server -hf unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF --jinja --host 127.0.0.1 --port 8080 --api-key "Keep learning" --ctx-size 4096

Didier changed discussion status to closed Oct 20, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

How to enable the **text+image to text** mode? (i.e. add the vision encoder)

How to enable the text+image to text mode? (i.e. add the vision encoder)