Vision tasks always start with intro thinking process

by ampersandru - opened 14 days ago

Tested this with llama.cpp and each time I ask it to analyze an image, it almost always starts out with:
It appears you have provided a text-based description of an image rather than the image itself. Based on the text provided, here is a breakdown of the scene described:

or:
It appears that the text you provided is an OCR (Optical Character Recognition) error.

The text is a "broken" version of an image description where the computer tried to read the visual layout of a photo as text. It has merged the descriptions of the people in the photo with the background and the text on their clothing.

Deciphering the "Hidden" Image:

Based on the text, here is what the photo actually depicts:

I have --reasoning off

Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment