Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation
Paper • 2605.01284 • Published • 1
CoE-SlideVQA-8B is an 8B vision-language checkpoint fine-tuned for Chain-of-Evidence question answering over presentation slide screenshots. Given a user question and candidate slide images, the model is trained to answer using visual evidence from the slides.
This checkpoint is intended for research and prototyping on slide-based visual QA, evidence selection, and grounded multimodal reasoning.
The model expects:
The expected output is a JSON-style response with:
evidence_chain: the selected supporting slide screenshots and localized evidenceanswer: the final answerFor exact prompt formatting and evaluation scripts, see the project code.
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_id = "PeiyangLiu/CoE-SlideVQA-8B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
Use the same image preprocessing and prompt format as the CoE repository for reproducible results.