Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation
Paper • 2605.01284 • Published • 1
CoE-Wiki-CoE-8B is an 8B vision-language checkpoint fine-tuned for Chain-of-Evidence question answering on Wiki-CoE. Given a question and candidate evidence screenshots, the model is trained to produce a structured answer with an evidence chain.
This checkpoint is intended for research on multimodal QA, visual evidence selection, and evidence-grounded reasoning over document-like screenshots.
The model expects:
The expected output is a JSON-style response with:
evidence_chain: the selected supporting screenshots and localized evidenceanswer: the final answerFor exact prompt formatting and evaluation scripts, see the project code.
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_id = "PeiyangLiu/CoE-Wiki-CoE-8B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
Use the same image preprocessing and prompt format as the CoE repository for reproducible results.