Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck
Paper • 2505.24840 • Published
This model is a hierarchically enhanced version of Qwen2.5-VL-7B-Instruct, fine-tuned with LoRA on the iNat21-Plant taxonomy using text-only instruction tuning.
For more details, please refer to our paper.
Base model
Qwen/Qwen2.5-VL-7B-Instruct