Avoid Re-encoding Reference Images in Vision-LLM When Comparison Criteria Are User-Defined

#18
by yaroslav332 - opened

Hi everyone,

I’m working with a Vision-LLM (like Qwen-VL / LLaVA / llama.cpp-based multimodal models) where I need to compare new images against reference images. The key part of my use case is that users define the comparison criteria (e.g., fur length, ear shape, color patterns), and I’m using image-to-text models to evaluate how well a new image matches a reference according to these criteria.

Currently, every time I send a prompt including the reference images, the model re-encodes them from scratch. From the logs, I can see:

llama-server
encoding image slice...
image slice encoded in 3800–4800 ms
decoding image batch ...
Even for the same reference images, this happens every single request, which makes inference slow.

Questions:

Has anyone dealt with user-defined comparison criteria in Vision-LLM pipelines?

Are there ways to cache or pre-load reference images in llama.cpp / Hugging Face pipelines to avoid repeated encoding?

What are recommended strategies to efficiently compare new images against a set of references using image-to-text models without reprocessing the reference images each time?

Thanks in advance for any advice or examples!

Sign up or log in to comment