File size: 11,288 Bytes
5d57811
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
---
license: apache-2.0
language:
- en
tags:
- multimodal
- visual-question-answering
- retrieval-augmented-generation
- reasoning
- knowledge-based-vqa
- qwen2_5_vl
pipeline_tag: image-text-to-text
---

# ReAG-Critic β€” Passage Relevance Filter for KB-VQA

[![CVPR 2026 Highlight](https://img.shields.io/badge/CVPR-2026%20Highlight-f9f107.svg)](https://cvpr.thecvf.com/virtual/2026/poster/37311)
[![Paper](https://img.shields.io/badge/Paper-arXiv%202511.22715-B31B1B.svg)](https://arxiv.org/abs/2511.22715)
[![Project Page](https://img.shields.io/badge/🌐-Project%20Page-blue.svg)](https://aimagelab.github.io/ReAG/)
[![HF Collection](https://img.shields.io/badge/πŸ€—-HF%20Collection-yellow.svg)](https://huggingface.co/collections/aimagelab/reag)

ReAG-Critic is the passage filtering component of **ReAG**, a Reasoning-Augmented Multimodal RAG pipeline for Knowledge-Based Visual Question Answering (KB-VQA). Given an image, a question, and a retrieved text passage, it predicts whether the passage is relevant and should be forwarded to the generator β€” or discarded as noise.

It is based on [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) and operates by comparing the probability of the next token being `"Yes"` vs `"No"` against a configurable threshold (default: `0.1`), making it fast and easy to calibrate.

The filtered passages are then passed to the generator ([aimagelab/ReAG-3B](https://huggingface.co/aimagelab/ReAG-3B) or [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B)) for answer generation with explicit chain-of-thought reasoning.

---

## Model Description

Standard retrieval-augmented VQA methods often pass noisy or irrelevant passages directly to the generator, limiting answer quality. ReAG addresses this with a two-step approach:

1. **ReAG-Critic (this model)** evaluates each retrieved passage and filters out irrelevant ones using a multimodal relevance signal (image + question + passage). It outputs a `yes_probability` score; passages above the threshold are kept.
2. **ReAG Generator** receives only the filtered, relevant passages and generates an answer with explicit chain-of-thought reasoning enclosed in `<think>…</think>` tags, followed by a concise `<answer>…</answer>`.

ReAG significantly outperforms prior methods on both **Encyclopedic-VQA** and **InfoSeek**.

---

## Full Pipeline Usage

The snippet below shows the complete ReAG inference pipeline: critic filtering followed by generator inference.

```python
import re
from io import BytesIO

import requests
import torch
from PIL import Image
from transformers import (
    AutoModelForImageTextToText,
    AutoProcessor,
    Qwen2_5_VLForConditionalGeneration,
)

REAG_MODEL_NAME = "aimagelab/ReAG-3B"
CRITIC_MODEL_NAME = "aimagelab/ReAG-Critic"

SYSTEM_PROMPT_REASONING = (
    "A conversation between User and Assistant. The user asks a question, "
    "and the Assistant solves it. The assistant first thinks about the "
    "reasoning process in the mind and then provides the user with the answer. "
    "The reasoning process and answer are enclosed within <think> </think> and "
    "<answer> </answer> tags, respectively, i.e., "
    "<think>reasoning process here</think><answer>short answer here</answer>"
)

RELEVANCY_EVAL_SYSTEM_PROMPT = """You are a multimodal reasoning assistant specialized in Knowledge-Based Visual Question Answering (KB-VQA).
Your task is to evaluate whether a given text passage provides useful and relevant information for answering a question about an image.

You will be given:
- Image: a visual scene containing entities, actions, and context.
- Question: a natural-language question that refers to the image.
- Text Passage: an external knowledge snippet retrieved from a database or the web.

You must analyze the semantic alignment between the text, the image, and the question.
Follow these steps carefully before giving your final decision:
1. Understand the visual scene: Identify the key objects, people, actions, and context visible in the image.
2. Interpret the question: Determine what information the question seeks.
3. Analyze the text passage: Extract the main claims, facts, and entities mentioned in the text.

Compare for relevance: Assess whether the information in the text:
- Contains at least one sentence that supports answering the question about the image, OR
- Provides background knowledge needed to interpret or reason about the image-question pair.

Important:
- If even a single sentence in the passage is relevant or useful, consider the entire passage as relevant and answer "Yes".
- If no part of the passage contributes meaningfully to answering the question, answer "No".

Output only one word:
"Yes" -> if the text provides relevant or useful information for answering the question.
"No" -> if the text is irrelevant or unhelpful."""

SECTION_EVAL_USER_TEMPLATE = """Here is the question on the image above:
{question}

Here is the text passage to analyze:
{passage}

Does the text passage contain at least one sentence that may have some information useful to answer the user question?
"Yes"/"No" answer:"""

CONTEXT_VQA_PROMPT = """\
{question}

The following paragraphs may contain useful information to help answer the question correctly:
{context}
"""


def load_image(image_url: str) -> Image.Image:
    response = requests.get(image_url, timeout=30, headers={"User-Agent": "Mozilla/5.0"})
    response.raise_for_status()
    return Image.open(BytesIO(response.content)).convert("RGB")


def get_model_kwargs():
    if torch.cuda.is_available():
        return {
            "device": "cuda",
            "device_map": "balanced",
            "torch_dtype": torch.bfloat16,
            "attn_implementation": "flash_attention_2",
        }
    return {
        "device": "cpu",
        "device_map": "auto",
        "torch_dtype": torch.float32,
    }


def parse_reag_output(text: str):
    answer_match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
    think_match = re.search(r"<think>(.*?)</think>", text, re.DOTALL)
    return {
        "raw_output": text.strip(),
        "reasoning": think_match.group(1).strip() if think_match else "",
        "answer": answer_match.group(1).strip() if answer_match else text.strip(),
    }


def run_reag_generator(model, processor, image: Image.Image, question: str):
    messages = [
        {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT_REASONING}]},
        {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]},
    ]
    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[prompt + "<think>"], images=[image], return_tensors="pt", padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    generated_ids = model.generate(
        **inputs, max_new_tokens=512, stop_strings=["</answer>"], tokenizer=processor.tokenizer
    )
    input_length = inputs["input_ids"].shape[1]
    generated_text = "<think>" + processor.batch_decode(
        generated_ids[:, input_length:], skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
    return parse_reag_output(generated_text)


def run_reag_critic(critic, processor, image: Image.Image, question: str, passage: str, yes_prob_threshold: float = 0.1):
    messages = [
        {"role": "system", "content": [{"type": "text", "text": RELEVANCY_EVAL_SYSTEM_PROMPT}]},
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": SECTION_EVAL_USER_TEMPLATE.format(question=question, passage=passage)},
            ],
        },
    ]
    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(images=[image], text=[prompt], return_tensors="pt", padding=True)
    inputs = {k: v.to(critic.device) for k, v in inputs.items()}

    with torch.inference_mode():
        outputs = critic(**inputs)
        logits = outputs.logits[:, -1, :].float()
        probs = torch.softmax(logits, dim=-1)

    yes_token_id = processor.tokenizer.convert_tokens_to_ids("Yes")
    no_token_id = processor.tokenizer.convert_tokens_to_ids("No")
    return {
        "relevant": probs[0, yes_token_id].item() > yes_prob_threshold,
        "yes_probability": probs[0, yes_token_id].item(),
        "no_probability": probs[0, no_token_id].item(),
    }


# ── Example ──────────────────────────────────────────────────────────────────

image = load_image(
    "https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Clinopodium_vulgare_inflorescence.jpg/250px-Clinopodium_vulgare_inflorescence.jpg"
)
question = "What kind of properties does this plant have?"
passages = [
    "# Description:\nWild basil is a perennial rhizomatous herb ...",
    "# Distribution:\nWild basil occurs in suitable locations in most of Europe ...",
    "# Uses:\nThe leaves of wild basil are used as an aromatic herb ... It has been shown to have anti-bacterial properties.",
]

model_kwargs = get_model_kwargs()
device = model_kwargs.pop("device")

# 1. Load and run the critic
critic_processor = AutoProcessor.from_pretrained(CRITIC_MODEL_NAME, padding_side="left", use_fast=True)
critic = Qwen2_5_VLForConditionalGeneration.from_pretrained(CRITIC_MODEL_NAME, **model_kwargs)
critic.eval()

relevant_passages = []
for passage in passages:
    result = run_reag_critic(critic, critic_processor, image, question, passage)
    if result["relevant"]:
        relevant_passages.append(passage)

# 2. Load the generator and answer with filtered context
context = "\n\n\n".join(relevant_passages) if relevant_passages else ""
processor = AutoProcessor.from_pretrained(REAG_MODEL_NAME, padding_side="left", use_fast=True)
generator = AutoModelForImageTextToText.from_pretrained(REAG_MODEL_NAME, **model_kwargs)
generator.eval()

question_with_context = CONTEXT_VQA_PROMPT.format(question=question, context=context)
output = run_reag_generator(generator, processor, image, question_with_context)
print("Answer:", output["answer"])
print("Reasoning:", output["reasoning"])
```

---

## Model Collection

| Model | Description |
|---|---|
| [aimagelab/ReAG-3B](https://huggingface.co/aimagelab/ReAG-3B) | Generator (3B) |
| [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B) | Generator (7B) |
| [aimagelab/ReAG-Critic](https://huggingface.co/aimagelab/ReAG-Critic) | Passage relevance critic (this model) |

---

## Repository & Evaluation

Full inference scripts, dataset setup, FAISS index downloads, and evaluation instructions are available in the **[official GitHub repository](https://github.com/aimagelab/ReAG)**.

---

## Citation

```bibtex
@inproceedings{compagnoni2026reag,
  title={{ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering}},
  author={Compagnoni, Alberto and Morini, Marco and Sarto, Sara and Cocchi, Federico and Caffagni, Davide and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference},
  year={2026}
}
```