| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - multimodal |
| - visual-question-answering |
| - retrieval-augmented-generation |
| - reasoning |
| - knowledge-based-vqa |
| - qwen2_5_vl |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # ReAG-3B β Reasoning-Augmented Generation for KB-VQA |
|
|
| [](https://cvpr.thecvf.com/virtual/2026/poster/37311) |
| [](https://arxiv.org/abs/2511.22715) |
| [](https://aimagelab.github.io/ReAG/) |
| [](https://huggingface.co/collections/aimagelab/reag) |
|
|
| ReAG-3B is the generator component of **ReAG**, a Reasoning-Augmented Multimodal RAG pipeline for Knowledge-Based Visual Question Answering (KB-VQA). It is based on [Qwen2.5-VL-3B](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) and fine-tuned with a multi-stage strategy: supervised fine-tuning as a cold start, followed by reinforcement learning to promote explicit reasoning grounded in retrieved evidence. |
|
|
| For the full pipeline, a companion **critic model** ([aimagelab/ReAG-Critic](https://huggingface.co/aimagelab/ReAG-Critic)) first filters noisy retrieved passages before they are fed to this generator. |
|
|
| A larger 7B variant is also available: [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B). |
|
|
| --- |
|
|
| ## Model Description |
|
|
| Standard retrieval-augmented VQA methods often pass noisy or irrelevant passages directly to the generator, limiting answer quality. ReAG addresses this with a two-step approach: |
|
|
| 1. **ReAG-Critic** evaluates each retrieved passage and filters out irrelevant ones using a multimodal relevance signal (image + question + passage). |
| 2. **ReAG-3B (this model)** generates an answer with explicit chain-of-thought reasoning enclosed in `<think>β¦</think>` tags, followed by a concise `<answer>β¦</answer>`. |
|
|
| ReAG significantly outperforms prior methods on both **Encyclopedic-VQA** and **InfoSeek**. |
|
|
| --- |
|
|
| ## Full Pipeline Usage |
|
|
| The snippet below shows the complete ReAG inference pipeline: critic filtering followed by generator inference. |
|
|
| ```python |
| import re |
| from io import BytesIO |
| |
| import requests |
| import torch |
| from PIL import Image |
| from transformers import ( |
| AutoModelForImageTextToText, |
| AutoProcessor, |
| Qwen2_5_VLForConditionalGeneration, |
| ) |
| |
| REAG_MODEL_NAME = "aimagelab/ReAG-3B" |
| CRITIC_MODEL_NAME = "aimagelab/ReAG-Critic" |
| |
| SYSTEM_PROMPT_REASONING = ( |
| "A conversation between User and Assistant. The user asks a question, " |
| "and the Assistant solves it. The assistant first thinks about the " |
| "reasoning process in the mind and then provides the user with the answer. " |
| "The reasoning process and answer are enclosed within <think> </think> and " |
| "<answer> </answer> tags, respectively, i.e., " |
| "<think>reasoning process here</think><answer>short answer here</answer>" |
| ) |
| |
| RELEVANCY_EVAL_SYSTEM_PROMPT = """You are a multimodal reasoning assistant specialized in Knowledge-Based Visual Question Answering (KB-VQA). |
| Your task is to evaluate whether a given text passage provides useful and relevant information for answering a question about an image. |
| |
| You will be given: |
| - Image: a visual scene containing entities, actions, and context. |
| - Question: a natural-language question that refers to the image. |
| - Text Passage: an external knowledge snippet retrieved from a database or the web. |
| |
| You must analyze the semantic alignment between the text, the image, and the question. |
| Follow these steps carefully before giving your final decision: |
| 1. Understand the visual scene: Identify the key objects, people, actions, and context visible in the image. |
| 2. Interpret the question: Determine what information the question seeks. |
| 3. Analyze the text passage: Extract the main claims, facts, and entities mentioned in the text. |
| |
| Compare for relevance: Assess whether the information in the text: |
| - Contains at least one sentence that supports answering the question about the image, OR |
| - Provides background knowledge needed to interpret or reason about the image-question pair. |
| |
| Important: |
| - If even a single sentence in the passage is relevant or useful, consider the entire passage as relevant and answer "Yes". |
| - If no part of the passage contributes meaningfully to answering the question, answer "No". |
| |
| Output only one word: |
| "Yes" -> if the text provides relevant or useful information for answering the question. |
| "No" -> if the text is irrelevant or unhelpful.""" |
| |
| SECTION_EVAL_USER_TEMPLATE = """Here is the question on the image above: |
| {question} |
| |
| Here is the text passage to analyze: |
| {passage} |
| |
| Does the text passage contain at least one sentence that may have some information useful to answer the user question? |
| "Yes"/"No" answer:""" |
| |
| CONTEXT_VQA_PROMPT = """\ |
| {question} |
| |
| The following paragraphs may contain useful information to help answer the question correctly: |
| {context} |
| """ |
| |
| |
| def load_image(image_url: str) -> Image.Image: |
| response = requests.get(image_url, timeout=30, headers={"User-Agent": "Mozilla/5.0"}) |
| response.raise_for_status() |
| return Image.open(BytesIO(response.content)).convert("RGB") |
| |
| |
| def get_model_kwargs(): |
| if torch.cuda.is_available(): |
| return { |
| "device": "cuda", |
| "device_map": "balanced", |
| "torch_dtype": torch.bfloat16, |
| "attn_implementation": "flash_attention_2", |
| } |
| return { |
| "device": "cpu", |
| "device_map": "auto", |
| "torch_dtype": torch.float32, |
| } |
| |
| |
| def parse_reag_output(text: str): |
| answer_match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL) |
| think_match = re.search(r"<think>(.*?)</think>", text, re.DOTALL) |
| return { |
| "raw_output": text.strip(), |
| "reasoning": think_match.group(1).strip() if think_match else "", |
| "answer": answer_match.group(1).strip() if answer_match else text.strip(), |
| } |
| |
| |
| def run_reag_generator(model, processor, image: Image.Image, question: str): |
| messages = [ |
| {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT_REASONING}]}, |
| {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]}, |
| ] |
| prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = processor(text=[prompt + "<think>"], images=[image], return_tensors="pt", padding=True) |
| inputs = {k: v.to(model.device) for k, v in inputs.items()} |
| |
| generated_ids = model.generate( |
| **inputs, max_new_tokens=512, stop_strings=["</answer>"], tokenizer=processor.tokenizer |
| ) |
| input_length = inputs["input_ids"].shape[1] |
| generated_text = "<think>" + processor.batch_decode( |
| generated_ids[:, input_length:], skip_special_tokens=True, clean_up_tokenization_spaces=False |
| )[0] |
| return parse_reag_output(generated_text) |
| |
| |
| def run_reag_critic(critic, processor, image: Image.Image, question: str, passage: str, yes_prob_threshold: float = 0.1): |
| messages = [ |
| {"role": "system", "content": [{"type": "text", "text": RELEVANCY_EVAL_SYSTEM_PROMPT}]}, |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image"}, |
| {"type": "text", "text": SECTION_EVAL_USER_TEMPLATE.format(question=question, passage=passage)}, |
| ], |
| }, |
| ] |
| prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = processor(images=[image], text=[prompt], return_tensors="pt", padding=True) |
| inputs = {k: v.to(critic.device) for k, v in inputs.items()} |
| |
| with torch.inference_mode(): |
| outputs = critic(**inputs) |
| logits = outputs.logits[:, -1, :].float() |
| probs = torch.softmax(logits, dim=-1) |
| |
| yes_token_id = processor.tokenizer.convert_tokens_to_ids("Yes") |
| no_token_id = processor.tokenizer.convert_tokens_to_ids("No") |
| return { |
| "relevant": probs[0, yes_token_id].item() > yes_prob_threshold, |
| "yes_probability": probs[0, yes_token_id].item(), |
| "no_probability": probs[0, no_token_id].item(), |
| } |
| |
| |
| # ββ Example ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| |
| image = load_image( |
| "https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Clinopodium_vulgare_inflorescence.jpg/250px-Clinopodium_vulgare_inflorescence.jpg" |
| ) |
| question = "What kind of properties does this plant have?" |
| passages = [ |
| "# Description:\nWild basil is a perennial rhizomatous herb ...", |
| "# Distribution:\nWild basil occurs in suitable locations in most of Europe ...", |
| "# Uses:\nThe leaves of wild basil are used as an aromatic herb ... It has been shown to have anti-bacterial properties.", |
| ] |
| |
| model_kwargs = get_model_kwargs() |
| device = model_kwargs.pop("device") |
| |
| # 1. Load and run the critic |
| critic_processor = AutoProcessor.from_pretrained(CRITIC_MODEL_NAME, padding_side="left", use_fast=True) |
| critic = Qwen2_5_VLForConditionalGeneration.from_pretrained(CRITIC_MODEL_NAME, **model_kwargs) |
| critic.eval() |
| |
| relevant_passages = [] |
| for passage in passages: |
| result = run_reag_critic(critic, critic_processor, image, question, passage) |
| if result["relevant"]: |
| relevant_passages.append(passage) |
| |
| # 2. Load the generator and answer with filtered context |
| context = "\n\n\n".join(relevant_passages) if relevant_passages else "" |
| processor = AutoProcessor.from_pretrained(REAG_MODEL_NAME, padding_side="left", use_fast=True) |
| generator = AutoModelForImageTextToText.from_pretrained(REAG_MODEL_NAME, **model_kwargs) |
| generator.eval() |
| |
| question_with_context = CONTEXT_VQA_PROMPT.format(question=question, context=context) |
| output = run_reag_generator(generator, processor, image, question_with_context) |
| print("Answer:", output["answer"]) |
| print("Reasoning:", output["reasoning"]) |
| ``` |
|
|
| --- |
|
|
| ## Model Collection |
|
|
| | Model | Description | |
| |---|---| |
| | [aimagelab/ReAG-3B](https://huggingface.co/aimagelab/ReAG-3B) | Generator (this model) | |
| | [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B) | Larger generator variant | |
| | [aimagelab/ReAG-Critic](https://huggingface.co/aimagelab/ReAG-Critic) | Passage relevance critic | |
|
|
| --- |
|
|
| ## Repository & Evaluation |
|
|
| Full inference scripts, dataset setup, FAISS index downloads, and evaluation instructions are available in the **[official GitHub repository](https://github.com/aimagelab/ReAG)**. |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{compagnoni2026reag, |
| title={{ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering}}, |
| author={Compagnoni, Alberto and Morini, Marco and Sarto, Sara and Cocchi, Federico and Caffagni, Davide and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita}, |
| booktitle={Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference}, |
| year={2026} |
| } |
| ``` |