Model Card for Model ID
This model is a fine-tuned version of Qwen2.5-VL-7B-Instruct optimized for sequential narrative understanding in comic art, specifically targeting the ComicsPAP benchmark.
Model Description
This model is a fine-tuned version of Qwen2.5-VL-7B-Instruct using QLoRA. It has been specifically optimized for the ComicsPAP Sequence Filling task, a challenge in the field of visual narrative understanding.
The model is trained to identify the correct missing panel in a comic strip sequence, choosing from 5 possible options.
- Developed by: Francesco Colasurdo
- Model type: Multimodal Large Language Model (Vision-Language)
- Language(s) (NLP): English
- License: Apache-2.0
- Finetuned from model [optional]: Qwen2.5-VL-7B
Model Sources
- Repository: GitHub
- Dataset: VLR-CVC/ComicsPAP
Uses
Direct Use
The model is designed for the Pick A Panel (Sequence Filling) task. It can be used to analyze narrative consistency between sequential images in comic-style datasets.
How to Get Started with the Model
To use this model, the custom image processor from the official GitHub repository is required to format the input panels correctly.
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
model_id = "Qwen/Qwen2.5-VL-7B-Instruct"
adapter_id = "kaj04/Qwen2.5-VL-7B-ComicsPAP-QLoRA"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_id)
processor = AutoProcessor.from_pretrained(model_id)
Training Details
Training Data
The model was trained on the train split of the VLR-CVC/ComicsPAP dataset (approx. 4,000+ samples). This dataset features composite images of comic panels specifically designed for the "Pick A Panel" (Sequence Filling) task.
Preprocessing
Images were processed using the SingleImagePickAPanel strategy, which organizes context panels and options into a structured 2xX grid to maximize effective resolution within the model's visual token limits.
Training Hyperparameters
- Method: QLoRA (4-bit quantization)
- LoRA Rank (r): 16
- LoRA Alpha: 32
- Learning Rate: 1e-4 (Warmup 10% + Cosine Decay)
- Batch Size: 1 (Gradient Accumulation Steps: 4)
Speeds, Sizes, Times
Hardware: 1x NVIDIA A100 (80GB) Training Time: ~10 hours for 600 steps
Testing Data & Metrics
Performance was measured using the Validation split of the ComicsPAP dataset. Note on Rigor: The Validation Set was strictly kept as a hold-out set and was not used during the training process. The official Test Set was not used for evaluation as the ground-truth labels are masked for the official challenge.
Results
| Model | Strategy | Val Accuracy |
|---|---|---|
| Qwen2.5-VL-7B (Mine) | QLoRA Fine-tuned | 66.41% |
| Qwen2.5-VL-72B | Zero-shot (from Paper) | 46.88% |
| Qwen2.5-VL-7B | Zero-shot (from Paper) | 30.53% |
This score represents a significant improvement (+35.88% over the 7B baseline), showcasing the effectiveness of LoRA adapters in capturing comic-specific visual logic and narrative flow.
Technical Specifications
Compute Infrastructure
The training was conducted on the Snellius national supercomputer (Netherlands). Access was provided by the academic program at TU/e (Eindhoven University of Technology).
Citations
If you use this model or the underlying dataset, please cite the original ComicsPAP paper:
@InProceedings{vivoli2025comicspap,
author="Vivoli, Emanuele and Llabr{\'e}s, Artemis and Souibgui, Mohamed Ali and Bertini, Marco and Llobet, Ernest Valveny and Karatzas, Dimosthenis",
editor="Yin, Xu-Cheng and Karatzas, Dimosthenis and Lopresti, Daniel",
title="ComicsPAP: Understanding Comic Strips by Picking the Correct Panel",
booktitle="Document Analysis and Recognition -- ICDAR 2025",
year="2026",
publisher="Springer Nature Switzerland",
address="Cham",
pages="337--350",
isbn="978-3-032-04614-7"
}
More Information
Model Card Authors
Francesco Colasurdo
Model Card Contact
francesco.colasurdo04@gmail.com www.linkedin.com/in/francesco-colasurdo
- Downloads last month
- 48
Model tree for kaj04/Qwen2.5-VL-7B-ComicsPAP-QLoRA
Base model
Qwen/Qwen2.5-VL-7B-Instruct