aimagelab
/

ReAG-Critic

+---
+license: apache-2.0
+language:
+- en
+tags:
+- multimodal
+- visual-question-answering
+- retrieval-augmented-generation
+- reasoning
+- knowledge-based-vqa
+- qwen2_5_vl
+pipeline_tag: image-text-to-text
+---
+# ReAG-Critic — Passage Relevance Filter for KB-VQA
+[![CVPR 2026 Highlight](https://img.shields.io/badge/CVPR-2026%20Highlight-f9f107.svg)](https://cvpr.thecvf.com/virtual/2026/poster/37311)
+[![Paper](https://img.shields.io/badge/Paper-arXiv%202511.22715-B31B1B.svg)](https://arxiv.org/abs/2511.22715)
+[![Project Page](https://img.shields.io/badge/🌐-Project%20Page-blue.svg)](https://aimagelab.github.io/ReAG/)
+[![HF Collection](https://img.shields.io/badge/🤗-HF%20Collection-yellow.svg)](https://huggingface.co/collections/aimagelab/reag)
+ReAG-Critic is the passage filtering component of **ReAG**, a Reasoning-Augmented Multimodal RAG pipeline for Knowledge-Based Visual Question Answering (KB-VQA). Given an image, a question, and a retrieved text passage, it predicts whether the passage is relevant and should be forwarded to the generator — or discarded as noise.
+It is based on [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) and operates by comparing the probability of the next token being `"Yes"` vs `"No"` against a configurable threshold (default: `0.1`), making it fast and easy to calibrate.
+The filtered passages are then passed to the generator ([aimagelab/ReAG-3B](https://huggingface.co/aimagelab/ReAG-3B) or [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B)) for answer generation with explicit chain-of-thought reasoning.
+---
+## Model Description
+Standard retrieval-augmented VQA methods often pass noisy or irrelevant passages directly to the generator, limiting answer quality. ReAG addresses this with a two-step approach:
+1. **ReAG-Critic (this model)** evaluates each retrieved passage and filters out irrelevant ones using a multimodal relevance signal (image + question + passage). It outputs a `yes_probability` score; passages above the threshold are kept.
+2. **ReAG Generator** receives only the filtered, relevant passages and generates an answer with explicit chain-of-thought reasoning enclosed in `<think>…</think>` tags, followed by a concise `<answer>…</answer>`.
+ReAG significantly outperforms prior methods on both **Encyclopedic-VQA** and **InfoSeek**.
+---
+## Full Pipeline Usage
+The snippet below shows the complete ReAG inference pipeline: critic filtering followed by generator inference.
+```python
+import re
+from io import BytesIO
+import requests
+import torch
+from PIL import Image
+from transformers import (
+    AutoModelForImageTextToText,
+    AutoProcessor,
+    Qwen2_5_VLForConditionalGeneration,
+)
+REAG_MODEL_NAME = "aimagelab/ReAG-3B"
+CRITIC_MODEL_NAME = "aimagelab/ReAG-Critic"
+SYSTEM_PROMPT_REASONING = (
+    "A conversation between User and Assistant. The user asks a question, "
+    "and the Assistant solves it. The assistant first thinks about the "
+    "reasoning process in the mind and then provides the user with the answer. "
+    "The reasoning process and answer are enclosed within <think> </think> and "
+    "<answer> </answer> tags, respectively, i.e., "
+    "<think>reasoning process here</think><answer>short answer here</answer>"
+)
+RELEVANCY_EVAL_SYSTEM_PROMPT = """You are a multimodal reasoning assistant specialized in Knowledge-Based Visual Question Answering (KB-VQA).
+Your task is to evaluate whether a given text passage provides useful and relevant information for answering a question about an image.
+You will be given:
+- Image: a visual scene containing entities, actions, and context.
+- Question: a natural-language question that refers to the image.
+- Text Passage: an external knowledge snippet retrieved from a database or the web.
+You must analyze the semantic alignment between the text, the image, and the question.
+Follow these steps carefully before giving your final decision:
+1. Understand the visual scene: Identify the key objects, people, actions, and context visible in the image.
+2. Interpret the question: Determine what information the question seeks.
+3. Analyze the text passage: Extract the main claims, facts, and entities mentioned in the text.
+Compare for relevance: Assess whether the information in the text:
+- Contains at least one sentence that supports answering the question about the image, OR
+- Provides background knowledge needed to interpret or reason about the image-question pair.
+Important:
+- If even a single sentence in the passage is relevant or useful, consider the entire passage as relevant and answer "Yes".
+- If no part of the passage contributes meaningfully to answering the question, answer "No".
+Output only one word:
+"Yes" -> if the text provides relevant or useful information for answering the question.
+"No" -> if the text is irrelevant or unhelpful."""
+SECTION_EVAL_USER_TEMPLATE = """Here is the question on the image above:
+{question}
+Here is the text passage to analyze:
+{passage}
+Does the text passage contain at least one sentence that may have some information useful to answer the user question?
+"Yes"/"No" answer:"""
+CONTEXT_VQA_PROMPT = """\
+{question}
+The following paragraphs may contain useful information to help answer the question correctly:
+{context}
+"""
+def load_image(image_url: str) -> Image.Image:
+    response = requests.get(image_url, timeout=30, headers={"User-Agent": "Mozilla/5.0"})
+    response.raise_for_status()
+    return Image.open(BytesIO(response.content)).convert("RGB")
+def get_model_kwargs():
+    if torch.cuda.is_available():
+        return {
+            "device": "cuda",
+            "device_map": "balanced",
+            "torch_dtype": torch.bfloat16,
+            "attn_implementation": "flash_attention_2",
+        }
+    return {
+        "device": "cpu",
+        "device_map": "auto",
+        "torch_dtype": torch.float32,
+    }
+def parse_reag_output(text: str):
+    answer_match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
+    think_match = re.search(r"<think>(.*?)</think>", text, re.DOTALL)
+    return {
+        "raw_output": text.strip(),
+        "reasoning": think_match.group(1).strip() if think_match else "",
+        "answer": answer_match.group(1).strip() if answer_match else text.strip(),
+    }
+def run_reag_generator(model, processor, image: Image.Image, question: str):
+    messages = [
+        {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT_REASONING}]},
+        {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]},
+    ]
+    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    inputs = processor(text=[prompt + "<think>"], images=[image], return_tensors="pt", padding=True)
+    inputs = {k: v.to(model.device) for k, v in inputs.items()}
+    generated_ids = model.generate(
+        **inputs, max_new_tokens=512, stop_strings=["</answer>"], tokenizer=processor.tokenizer
+    )
+    input_length = inputs["input_ids"].shape[1]
+    generated_text = "<think>" + processor.batch_decode(
+        generated_ids[:, input_length:], skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )[0]
+    return parse_reag_output(generated_text)
+def run_reag_critic(critic, processor, image: Image.Image, question: str, passage: str, yes_prob_threshold: float = 0.1):
+    messages = [
+        {"role": "system", "content": [{"type": "text", "text": RELEVANCY_EVAL_SYSTEM_PROMPT}]},
+        {
+            "role": "user",
+            "content": [
+                {"type": "image"},
+                {"type": "text", "text": SECTION_EVAL_USER_TEMPLATE.format(question=question, passage=passage)},
+            ],
+        },
+    ]
+    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    inputs = processor(images=[image], text=[prompt], return_tensors="pt", padding=True)
+    inputs = {k: v.to(critic.device) for k, v in inputs.items()}
+    with torch.inference_mode():
+        outputs = critic(**inputs)
+        logits = outputs.logits[:, -1, :].float()
+        probs = torch.softmax(logits, dim=-1)
+    yes_token_id = processor.tokenizer.convert_tokens_to_ids("Yes")
+    no_token_id = processor.tokenizer.convert_tokens_to_ids("No")
+    return {
+        "relevant": probs[0, yes_token_id].item() > yes_prob_threshold,
+        "yes_probability": probs[0, yes_token_id].item(),
+        "no_probability": probs[0, no_token_id].item(),
+    }
+# ── Example ──────────────────────────────────────────────────────────────────
+image = load_image(
+    "https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Clinopodium_vulgare_inflorescence.jpg/250px-Clinopodium_vulgare_inflorescence.jpg"
+)
+question = "What kind of properties does this plant have?"
+passages = [
+    "# Description:\nWild basil is a perennial rhizomatous herb ...",
+    "# Distribution:\nWild basil occurs in suitable locations in most of Europe ...",
+    "# Uses:\nThe leaves of wild basil are used as an aromatic herb ... It has been shown to have anti-bacterial properties.",
+]
+model_kwargs = get_model_kwargs()
+device = model_kwargs.pop("device")
+# 1. Load and run the critic
+critic_processor = AutoProcessor.from_pretrained(CRITIC_MODEL_NAME, padding_side="left", use_fast=True)
+critic = Qwen2_5_VLForConditionalGeneration.from_pretrained(CRITIC_MODEL_NAME, **model_kwargs)
+critic.eval()
+relevant_passages = []
+for passage in passages:
+    result = run_reag_critic(critic, critic_processor, image, question, passage)
+    if result["relevant"]:
+        relevant_passages.append(passage)
+# 2. Load the generator and answer with filtered context
+context = "\n\n\n".join(relevant_passages) if relevant_passages else ""
+processor = AutoProcessor.from_pretrained(REAG_MODEL_NAME, padding_side="left", use_fast=True)
+generator = AutoModelForImageTextToText.from_pretrained(REAG_MODEL_NAME, **model_kwargs)
+generator.eval()
+question_with_context = CONTEXT_VQA_PROMPT.format(question=question, context=context)
+output = run_reag_generator(generator, processor, image, question_with_context)
+print("Answer:", output["answer"])
+print("Reasoning:", output["reasoning"])
+```
+---
+## Model Collection
+| Model | Description |
+|---|---|
+| [aimagelab/ReAG-3B](https://huggingface.co/aimagelab/ReAG-3B) | Generator (3B) |
+| [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B) | Generator (7B) |
+| [aimagelab/ReAG-Critic](https://huggingface.co/aimagelab/ReAG-Critic) | Passage relevance critic (this model) |
+---
+## Repository & Evaluation
+Full inference scripts, dataset setup, FAISS index downloads, and evaluation instructions are available in the **[official GitHub repository](https://github.com/aimagelab/ReAG)**.
+---
+## Citation
+```bibtex
+@inproceedings{compagnoni2026reag,
+  title={{ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering}},
+  author={Compagnoni, Alberto and Morini, Marco and Sarto, Sara and Cocchi, Federico and Caffagni, Davide and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
+  booktitle={Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference},
+  year={2026}
+}
+```