Model Card for Model ID

This model is a fine-tuned version of Qwen2.5-VL-7B-Instruct optimized for sequential narrative understanding in comic art, specifically targeting the ComicsPAP benchmark.

Model Description

This model is a fine-tuned version of Qwen2.5-VL-7B-Instruct using QLoRA. It has been specifically optimized for the ComicsPAP Sequence Filling task, a challenge in the field of visual narrative understanding.

The model is trained to identify the correct missing panel in a comic strip sequence, choosing from 5 possible options.

  • Developed by: Francesco Colasurdo
  • Model type: Multimodal Large Language Model (Vision-Language)
  • Language(s) (NLP): English
  • License: Apache-2.0
  • Finetuned from model [optional]: Qwen2.5-VL-7B

Model Sources

Uses

Direct Use

The model is designed for the Pick A Panel (Sequence Filling) task. It can be used to analyze narrative consistency between sequential images in comic-style datasets.

How to Get Started with the Model

To use this model, the custom image processor from the official GitHub repository is required to format the input panels correctly.

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

model_id = "Qwen/Qwen2.5-VL-7B-Instruct"
adapter_id = "kaj04/Qwen2.5-VL-7B-ComicsPAP-QLoRA"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_id)
processor = AutoProcessor.from_pretrained(model_id)

Training Details

Training Data

The model was trained on the train split of the VLR-CVC/ComicsPAP dataset (approx. 4,000+ samples). This dataset features composite images of comic panels specifically designed for the "Pick A Panel" (Sequence Filling) task.

Preprocessing

Images were processed using the SingleImagePickAPanel strategy, which organizes context panels and options into a structured 2xX grid to maximize effective resolution within the model's visual token limits.

Training Hyperparameters

  • Method: QLoRA (4-bit quantization)
  • LoRA Rank (r): 16
  • LoRA Alpha: 32
  • Learning Rate: 1e-4 (Warmup 10% + Cosine Decay)
  • Batch Size: 1 (Gradient Accumulation Steps: 4)

Speeds, Sizes, Times

Hardware: 1x NVIDIA A100 (80GB) Training Time: ~10 hours for 600 steps

Testing Data & Metrics

Performance was measured using the Validation split of the ComicsPAP dataset. Note on Rigor: The Validation Set was strictly kept as a hold-out set and was not used during the training process. The official Test Set was not used for evaluation as the ground-truth labels are masked for the official challenge.

Results

Model Strategy Val Accuracy
Qwen2.5-VL-7B (Mine) QLoRA Fine-tuned 66.41%
Qwen2.5-VL-72B Zero-shot (from Paper) 46.88%
Qwen2.5-VL-7B Zero-shot (from Paper) 30.53%

This score represents a significant improvement (+35.88% over the 7B baseline), showcasing the effectiveness of LoRA adapters in capturing comic-specific visual logic and narrative flow.

Technical Specifications

Compute Infrastructure

The training was conducted on the Snellius national supercomputer (Netherlands). Access was provided by the academic program at TU/e (Eindhoven University of Technology).

Citations

If you use this model or the underlying dataset, please cite the original ComicsPAP paper:

@InProceedings{vivoli2025comicspap,
  author="Vivoli, Emanuele and Llabr{\'e}s, Artemis and Souibgui, Mohamed Ali and Bertini, Marco and Llobet, Ernest Valveny and Karatzas, Dimosthenis",
  editor="Yin, Xu-Cheng and Karatzas, Dimosthenis and Lopresti, Daniel",
  title="ComicsPAP: Understanding Comic Strips by Picking the Correct Panel",
  booktitle="Document Analysis and Recognition -- ICDAR 2025",
  year="2026",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="337--350",
  isbn="978-3-032-04614-7"
}

More Information

Model Card Authors

Francesco Colasurdo

Model Card Contact

francesco.colasurdo04@gmail.com www.linkedin.com/in/francesco-colasurdo

Downloads last month
48
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kaj04/Qwen2.5-VL-7B-ComicsPAP-QLoRA

Adapter
(244)
this model

Dataset used to train kaj04/Qwen2.5-VL-7B-ComicsPAP-QLoRA