Vision tasks always start with intro thinking process

#3
by ampersandru - opened

Tested this with llama.cpp and each time I ask it to analyze an image, it almost always starts out with:
It appears you have provided a text-based description of an image rather than the image itself. Based on the text provided, here is a breakdown of the scene described:

or:
It appears that the text you provided is an OCR (Optical Character Recognition) error.

The text is a "broken" version of an image description where the computer tried to read the visual layout of a photo as text. It has merged the descriptions of the people in the photo with the background and the text on their clothing.

Deciphering the "Hidden" Image:

Based on the text, here is what the photo actually depicts:

I have --reasoning off

Thanks!

Sign up or log in to comment