Vision tasks always start with intro thinking process
#3
by ampersandru - opened
Tested this with llama.cpp and each time I ask it to analyze an image, it almost always starts out with:
It appears you have provided a text-based description of an image rather than the image itself. Based on the text provided, here is a breakdown of the scene described:
or:
It appears that the text you provided is an OCR (Optical Character Recognition) error.
The text is a "broken" version of an image description where the computer tried to read the visual layout of a photo as text. It has merged the descriptions of the people in the photo with the background and the text on their clothing.
Deciphering the "Hidden" Image:
Based on the text, here is what the photo actually depicts:
I have --reasoning off
Thanks!