Unlocking UML Class Diagram Understanding in Vision Language Models
Abstract
A benchmark for visual question answering using UML class diagrams is introduced, along with a large-scale dataset and a LoRA-based fine-tuning approach that exceeds the performance of Qwen 3.5 27B.
Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper