How to enable the **text+image to text** mode? (i.e. add the vision encoder)

#9
by Didier - opened

As always, it must be very simple once we know how to, but I currently do not know.

Question

  • How do I merge the GGUF file (text mode only) with the mmproj file (vision encoder)?
  • Text only: Mistral-Small-3.2-24B-Instruct-2506-Q4_0.gguf
  • Vision encoder: mmproj-BF16.gguf

What

  • I am interested in serving the multimodal (text+image) model as an OpenAI compatible API using the conda package llama-cpp-python.

Env

  • conda create -n model_serve llama-cpp-python ipykernel

Serving

  • python -m llama_cpp.server --model Mistral-Small-3.2-24B-Instruct-2506-Q4_0.gguf
    --clip_model_path mmproj-BF16.gguf

Using (with LiteLLM)

  • Text to text mode is fine!
  • When text+image as input, I get a message indicating that there is no vision encoder :-(

"""
Please share the image you'd like me to describe. You can paste it directly into the chat, and I'll do my best to analyze its contents for you!
"""

Thanks for any help and your time.

Simply changed the way the model is served, i.e. using "-hf" instead of "--model xxxxx --clip-model xxxxx". Text+image to text works as expected. I.e.

llama-server -hf unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF --jinja --host 127.0.0.1 --port 8080 --api-key "Keep learning" --ctx-size 4096

Didier changed discussion status to closed

Sign up or log in to comment