Papers
arxiv:2605.01284

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

Published on May 2
ยท Submitted by
PeiyangLiu
on May 6
Authors:
,
,
,

Abstract

Chain of Evidence (CoE) presents a visual attribution framework that uses Vision-Language Models to reason over document screenshots, enabling precise, pixel-level evidence localization for iterative retrieval-augmented generation systems.

AI-generated summary

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) Coarse-grained attribution, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) Visual semantic loss, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present Chain of Evidence (CoE), a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: Wiki-CoE, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and SlideVQA, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

Community

Paper author Paper submitter

Excited to share our latest work on visual RAG! ๐Ÿš€

Existing iRAG systems linearise documents into plain text, which (1) breaks visual semantics in slides/charts/diagrams, (2) gives only coarse "[Source: Doc-1]" citations, and (3) leaves multi-hop reasoning chains opaque.

We propose Chain of Evidence (CoE): a visual-first iRAG framework that runs retrieval & reasoning directly on document screenshots and outputs pixel-level bounding boxes for every reasoning step โ€” no parsing, no OCR, fully auditable.

๐Ÿ“Š 71.1% evidence-localisation accuracy on our new Wiki-CoE benchmark (51K multi-hop QA with bbox annotations), and significantly outperforms text-based baselines on SlideVQA where layout matters.

We hope CoE pushes RAG from information finding toward causal, verifiable visual reasoning. Datasets, models & code released โ€” happy to discuss! ๐Ÿ‘‡

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.01284
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.01284 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.