morry02 commited on
Commit
67a9d8d
Β·
verified Β·
1 Parent(s): 14d9b6e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +257 -0
README.md ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - multimodal
7
+ - visual-question-answering
8
+ - retrieval-augmented-generation
9
+ - reasoning
10
+ - knowledge-based-vqa
11
+ - qwen2_5_vl
12
+ pipeline_tag: image-text-to-text
13
+ ---
14
+
15
+ # ReAG-3B β€” Reasoning-Augmented Generation for KB-VQA
16
+
17
+ [![CVPR 2026 Highlight](https://img.shields.io/badge/CVPR-2026%20Highlight-f9f107.svg)](https://cvpr.thecvf.com/virtual/2026/poster/37311)
18
+ [![Paper](https://img.shields.io/badge/Paper-arXiv%202511.22715-B31B1B.svg)](https://arxiv.org/abs/2511.22715)
19
+ [![Project Page](https://img.shields.io/badge/🌐-Project%20Page-blue.svg)](https://aimagelab.github.io/ReAG/)
20
+ [![HF Collection](https://img.shields.io/badge/πŸ€—-HF%20Collection-yellow.svg)](https://huggingface.co/collections/aimagelab/reag)
21
+
22
+ ReAG-3B is the generator component of **ReAG**, a Reasoning-Augmented Multimodal RAG pipeline for Knowledge-Based Visual Question Answering (KB-VQA). It is based on [Qwen2.5-VL-3B](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) and fine-tuned with a multi-stage strategy: supervised fine-tuning as a cold start, followed by reinforcement learning to promote explicit reasoning grounded in retrieved evidence.
23
+
24
+ For the full pipeline, a companion **critic model** ([aimagelab/ReAG-Critic](https://huggingface.co/aimagelab/ReAG-Critic)) first filters noisy retrieved passages before they are fed to this generator.
25
+
26
+ A larger 7B variant is also available: [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B).
27
+
28
+ ---
29
+
30
+ ## Model Description
31
+
32
+ Standard retrieval-augmented VQA methods often pass noisy or irrelevant passages directly to the generator, limiting answer quality. ReAG addresses this with a two-step approach:
33
+
34
+ 1. **ReAG-Critic** evaluates each retrieved passage and filters out irrelevant ones using a multimodal relevance signal (image + question + passage).
35
+ 2. **ReAG-3B (this model)** generates an answer with explicit chain-of-thought reasoning enclosed in `<think>…</think>` tags, followed by a concise `<answer>…</answer>`.
36
+
37
+ ReAG significantly outperforms prior methods on both **Encyclopedic-VQA** and **InfoSeek**.
38
+
39
+ ---
40
+
41
+ ## Full Pipeline Usage
42
+
43
+ The snippet below shows the complete ReAG inference pipeline: critic filtering followed by generator inference.
44
+
45
+ ```python
46
+ import re
47
+ from io import BytesIO
48
+
49
+ import requests
50
+ import torch
51
+ from PIL import Image
52
+ from transformers import (
53
+ AutoModelForImageTextToText,
54
+ AutoProcessor,
55
+ Qwen2_5_VLForConditionalGeneration,
56
+ )
57
+
58
+ REAG_MODEL_NAME = "aimagelab/ReAG-3B"
59
+ CRITIC_MODEL_NAME = "aimagelab/ReAG-Critic"
60
+
61
+ SYSTEM_PROMPT_REASONING = (
62
+ "A conversation between User and Assistant. The user asks a question, "
63
+ "and the Assistant solves it. The assistant first thinks about the "
64
+ "reasoning process in the mind and then provides the user with the answer. "
65
+ "The reasoning process and answer are enclosed within <think> </think> and "
66
+ "<answer> </answer> tags, respectively, i.e., "
67
+ "<think>reasoning process here</think><answer>short answer here</answer>"
68
+ )
69
+
70
+ RELEVANCY_EVAL_SYSTEM_PROMPT = """You are a multimodal reasoning assistant specialized in Knowledge-Based Visual Question Answering (KB-VQA).
71
+ Your task is to evaluate whether a given text passage provides useful and relevant information for answering a question about an image.
72
+
73
+ You will be given:
74
+ - Image: a visual scene containing entities, actions, and context.
75
+ - Question: a natural-language question that refers to the image.
76
+ - Text Passage: an external knowledge snippet retrieved from a database or the web.
77
+
78
+ You must analyze the semantic alignment between the text, the image, and the question.
79
+ Follow these steps carefully before giving your final decision:
80
+ 1. Understand the visual scene: Identify the key objects, people, actions, and context visible in the image.
81
+ 2. Interpret the question: Determine what information the question seeks.
82
+ 3. Analyze the text passage: Extract the main claims, facts, and entities mentioned in the text.
83
+
84
+ Compare for relevance: Assess whether the information in the text:
85
+ - Contains at least one sentence that supports answering the question about the image, OR
86
+ - Provides background knowledge needed to interpret or reason about the image-question pair.
87
+
88
+ Important:
89
+ - If even a single sentence in the passage is relevant or useful, consider the entire passage as relevant and answer "Yes".
90
+ - If no part of the passage contributes meaningfully to answering the question, answer "No".
91
+
92
+ Output only one word:
93
+ "Yes" -> if the text provides relevant or useful information for answering the question.
94
+ "No" -> if the text is irrelevant or unhelpful."""
95
+
96
+ SECTION_EVAL_USER_TEMPLATE = """Here is the question on the image above:
97
+ {question}
98
+
99
+ Here is the text passage to analyze:
100
+ {passage}
101
+
102
+ Does the text passage contain at least one sentence that may have some information useful to answer the user question?
103
+ "Yes"/"No" answer:"""
104
+
105
+ CONTEXT_VQA_PROMPT = """\
106
+ {question}
107
+
108
+ The following paragraphs may contain useful information to help answer the question correctly:
109
+ {context}
110
+ """
111
+
112
+
113
+ def load_image(image_url: str) -> Image.Image:
114
+ response = requests.get(image_url, timeout=30, headers={"User-Agent": "Mozilla/5.0"})
115
+ response.raise_for_status()
116
+ return Image.open(BytesIO(response.content)).convert("RGB")
117
+
118
+
119
+ def get_model_kwargs():
120
+ if torch.cuda.is_available():
121
+ return {
122
+ "device": "cuda",
123
+ "device_map": "balanced",
124
+ "torch_dtype": torch.bfloat16,
125
+ "attn_implementation": "flash_attention_2",
126
+ }
127
+ return {
128
+ "device": "cpu",
129
+ "device_map": "auto",
130
+ "torch_dtype": torch.float32,
131
+ }
132
+
133
+
134
+ def parse_reag_output(text: str):
135
+ answer_match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
136
+ think_match = re.search(r"<think>(.*?)</think>", text, re.DOTALL)
137
+ return {
138
+ "raw_output": text.strip(),
139
+ "reasoning": think_match.group(1).strip() if think_match else "",
140
+ "answer": answer_match.group(1).strip() if answer_match else text.strip(),
141
+ }
142
+
143
+
144
+ def run_reag_generator(model, processor, image: Image.Image, question: str):
145
+ messages = [
146
+ {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT_REASONING}]},
147
+ {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]},
148
+ ]
149
+ prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
150
+ inputs = processor(text=[prompt + "<think>"], images=[image], return_tensors="pt", padding=True)
151
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
152
+
153
+ generated_ids = model.generate(
154
+ **inputs, max_new_tokens=512, stop_strings=["</answer>"], tokenizer=processor.tokenizer
155
+ )
156
+ input_length = inputs["input_ids"].shape[1]
157
+ generated_text = "<think>" + processor.batch_decode(
158
+ generated_ids[:, input_length:], skip_special_tokens=True, clean_up_tokenization_spaces=False
159
+ )[0]
160
+ return parse_reag_output(generated_text)
161
+
162
+
163
+ def run_reag_critic(critic, processor, image: Image.Image, question: str, passage: str, yes_prob_threshold: float = 0.1):
164
+ messages = [
165
+ {"role": "system", "content": [{"type": "text", "text": RELEVANCY_EVAL_SYSTEM_PROMPT}]},
166
+ {
167
+ "role": "user",
168
+ "content": [
169
+ {"type": "image"},
170
+ {"type": "text", "text": SECTION_EVAL_USER_TEMPLATE.format(question=question, passage=passage)},
171
+ ],
172
+ },
173
+ ]
174
+ prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
175
+ inputs = processor(images=[image], text=[prompt], return_tensors="pt", padding=True)
176
+ inputs = {k: v.to(critic.device) for k, v in inputs.items()}
177
+
178
+ with torch.inference_mode():
179
+ outputs = critic(**inputs)
180
+ logits = outputs.logits[:, -1, :].float()
181
+ probs = torch.softmax(logits, dim=-1)
182
+
183
+ yes_token_id = processor.tokenizer.convert_tokens_to_ids("Yes")
184
+ no_token_id = processor.tokenizer.convert_tokens_to_ids("No")
185
+ return {
186
+ "relevant": probs[0, yes_token_id].item() > yes_prob_threshold,
187
+ "yes_probability": probs[0, yes_token_id].item(),
188
+ "no_probability": probs[0, no_token_id].item(),
189
+ }
190
+
191
+
192
+ # ── Example ──────────────────────────────────────────────────────────────────
193
+
194
+ image = load_image(
195
+ "https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Clinopodium_vulgare_inflorescence.jpg/250px-Clinopodium_vulgare_inflorescence.jpg"
196
+ )
197
+ question = "What kind of properties does this plant have?"
198
+ passages = [
199
+ "# Description:\nWild basil is a perennial rhizomatous herb ...",
200
+ "# Distribution:\nWild basil occurs in suitable locations in most of Europe ...",
201
+ "# Uses:\nThe leaves of wild basil are used as an aromatic herb ... It has been shown to have anti-bacterial properties.",
202
+ ]
203
+
204
+ model_kwargs = get_model_kwargs()
205
+ device = model_kwargs.pop("device")
206
+
207
+ # 1. Load and run the critic
208
+ critic_processor = AutoProcessor.from_pretrained(CRITIC_MODEL_NAME, padding_side="left", use_fast=True)
209
+ critic = Qwen2_5_VLForConditionalGeneration.from_pretrained(CRITIC_MODEL_NAME, **model_kwargs)
210
+ critic.eval()
211
+
212
+ relevant_passages = []
213
+ for passage in passages:
214
+ result = run_reag_critic(critic, critic_processor, image, question, passage)
215
+ if result["relevant"]:
216
+ relevant_passages.append(passage)
217
+
218
+ # 2. Load the generator and answer with filtered context
219
+ context = "\n\n\n".join(relevant_passages) if relevant_passages else ""
220
+ processor = AutoProcessor.from_pretrained(REAG_MODEL_NAME, padding_side="left", use_fast=True)
221
+ generator = AutoModelForImageTextToText.from_pretrained(REAG_MODEL_NAME, **model_kwargs)
222
+ generator.eval()
223
+
224
+ question_with_context = CONTEXT_VQA_PROMPT.format(question=question, context=context)
225
+ output = run_reag_generator(generator, processor, image, question_with_context)
226
+ print("Answer:", output["answer"])
227
+ print("Reasoning:", output["reasoning"])
228
+ ```
229
+
230
+ ---
231
+
232
+ ## Model Collection
233
+
234
+ | Model | Description |
235
+ |---|---|
236
+ | [aimagelab/ReAG-3B](https://huggingface.co/aimagelab/ReAG-3B) | Generator (this model) |
237
+ | [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B) | Larger generator variant |
238
+ | [aimagelab/ReAG-Critic](https://huggingface.co/aimagelab/ReAG-Critic) | Passage relevance critic |
239
+
240
+ ---
241
+
242
+ ## Repository & Evaluation
243
+
244
+ Full inference scripts, dataset setup, FAISS index downloads, and evaluation instructions are available in the **[official GitHub repository](https://github.com/aimagelab/ReAG)**.
245
+
246
+ ---
247
+
248
+ ## Citation
249
+
250
+ ```bibtex
251
+ @article{compagnoni2025reag,
252
+ title={ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering},
253
+ author={Compagnoni, Alberto and Morini, Marco and Sarto, Sara and Cocchi, Federico and Caffagni, Davide and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
254
+ journal={arXiv preprint arXiv:2511.22715},
255
+ year={2025}
256
+ }
257
+ ```