--- license: apache-2.0 language: - en tags: - multimodal - visual-question-answering - retrieval-augmented-generation - reasoning - knowledge-based-vqa - qwen2_5_vl pipeline_tag: image-text-to-text --- # ReAG-3B — Reasoning-Augmented Generation for KB-VQA [![CVPR 2026 Highlight](https://img.shields.io/badge/CVPR-2026%20Highlight-f9f107.svg)](https://cvpr.thecvf.com/virtual/2026/poster/37311) [![Paper](https://img.shields.io/badge/Paper-arXiv%202511.22715-B31B1B.svg)](https://arxiv.org/abs/2511.22715) [![Project Page](https://img.shields.io/badge/🌐-Project%20Page-blue.svg)](https://aimagelab.github.io/ReAG/) [![HF Collection](https://img.shields.io/badge/🤗-HF%20Collection-yellow.svg)](https://huggingface.co/collections/aimagelab/reag) ReAG-3B is the generator component of **ReAG**, a Reasoning-Augmented Multimodal RAG pipeline for Knowledge-Based Visual Question Answering (KB-VQA). It is based on [Qwen2.5-VL-3B](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) and fine-tuned with a multi-stage strategy: supervised fine-tuning as a cold start, followed by reinforcement learning to promote explicit reasoning grounded in retrieved evidence. For the full pipeline, a companion **critic model** ([aimagelab/ReAG-Critic](https://huggingface.co/aimagelab/ReAG-Critic)) first filters noisy retrieved passages before they are fed to this generator. A larger 7B variant is also available: [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B). --- ## Model Description Standard retrieval-augmented VQA methods often pass noisy or irrelevant passages directly to the generator, limiting answer quality. ReAG addresses this with a two-step approach: 1. **ReAG-Critic** evaluates each retrieved passage and filters out irrelevant ones using a multimodal relevance signal (image + question + passage). 2. **ReAG-3B (this model)** generates an answer with explicit chain-of-thought reasoning enclosed in `` tags, followed by a concise ``. ReAG significantly outperforms prior methods on both **Encyclopedic-VQA** and **InfoSeek**. --- ## Full Pipeline Usage The snippet below shows the complete ReAG inference pipeline: critic filtering followed by generator inference. ```python import re from io import BytesIO import requests import torch from PIL import Image from transformers import ( AutoModelForImageTextToText, AutoProcessor, Qwen2_5_VLForConditionalGeneration, ) REAG_MODEL_NAME = "aimagelab/ReAG-3B" CRITIC_MODEL_NAME = "aimagelab/ReAG-Critic" SYSTEM_PROMPT_REASONING = ( "A conversation between User and Assistant. The user asks a question, " "and the Assistant solves it. The assistant first thinks about the " "reasoning process in the mind and then provides the user with the answer. " "The reasoning process and answer are enclosed within and " " tags, respectively, i.e., " "reasoning process hereshort answer here" ) RELEVANCY_EVAL_SYSTEM_PROMPT = """You are a multimodal reasoning assistant specialized in Knowledge-Based Visual Question Answering (KB-VQA). Your task is to evaluate whether a given text passage provides useful and relevant information for answering a question about an image. You will be given: - Image: a visual scene containing entities, actions, and context. - Question: a natural-language question that refers to the image. - Text Passage: an external knowledge snippet retrieved from a database or the web. You must analyze the semantic alignment between the text, the image, and the question. Follow these steps carefully before giving your final decision: 1. Understand the visual scene: Identify the key objects, people, actions, and context visible in the image. 2. Interpret the question: Determine what information the question seeks. 3. Analyze the text passage: Extract the main claims, facts, and entities mentioned in the text. Compare for relevance: Assess whether the information in the text: - Contains at least one sentence that supports answering the question about the image, OR - Provides background knowledge needed to interpret or reason about the image-question pair. Important: - If even a single sentence in the passage is relevant or useful, consider the entire passage as relevant and answer "Yes". - If no part of the passage contributes meaningfully to answering the question, answer "No". Output only one word: "Yes" -> if the text provides relevant or useful information for answering the question. "No" -> if the text is irrelevant or unhelpful.""" SECTION_EVAL_USER_TEMPLATE = """Here is the question on the image above: {question} Here is the text passage to analyze: {passage} Does the text passage contain at least one sentence that may have some information useful to answer the user question? "Yes"/"No" answer:""" CONTEXT_VQA_PROMPT = """\ {question} The following paragraphs may contain useful information to help answer the question correctly: {context} """ def load_image(image_url: str) -> Image.Image: response = requests.get(image_url, timeout=30, headers={"User-Agent": "Mozilla/5.0"}) response.raise_for_status() return Image.open(BytesIO(response.content)).convert("RGB") def get_model_kwargs(): if torch.cuda.is_available(): return { "device": "cuda", "device_map": "balanced", "torch_dtype": torch.bfloat16, "attn_implementation": "flash_attention_2", } return { "device": "cpu", "device_map": "auto", "torch_dtype": torch.float32, } def parse_reag_output(text: str): answer_match = re.search(r"(.*?)", text, re.DOTALL) think_match = re.search(r"(.*?)", text, re.DOTALL) return { "raw_output": text.strip(), "reasoning": think_match.group(1).strip() if think_match else "", "answer": answer_match.group(1).strip() if answer_match else text.strip(), } def run_reag_generator(model, processor, image: Image.Image, question: str): messages = [ {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT_REASONING}]}, {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]}, ] prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[prompt + ""], images=[image], return_tensors="pt", padding=True) inputs = {k: v.to(model.device) for k, v in inputs.items()} generated_ids = model.generate( **inputs, max_new_tokens=512, stop_strings=[""], tokenizer=processor.tokenizer ) input_length = inputs["input_ids"].shape[1] generated_text = "" + processor.batch_decode( generated_ids[:, input_length:], skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return parse_reag_output(generated_text) def run_reag_critic(critic, processor, image: Image.Image, question: str, passage: str, yes_prob_threshold: float = 0.1): messages = [ {"role": "system", "content": [{"type": "text", "text": RELEVANCY_EVAL_SYSTEM_PROMPT}]}, { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": SECTION_EVAL_USER_TEMPLATE.format(question=question, passage=passage)}, ], }, ] prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(images=[image], text=[prompt], return_tensors="pt", padding=True) inputs = {k: v.to(critic.device) for k, v in inputs.items()} with torch.inference_mode(): outputs = critic(**inputs) logits = outputs.logits[:, -1, :].float() probs = torch.softmax(logits, dim=-1) yes_token_id = processor.tokenizer.convert_tokens_to_ids("Yes") no_token_id = processor.tokenizer.convert_tokens_to_ids("No") return { "relevant": probs[0, yes_token_id].item() > yes_prob_threshold, "yes_probability": probs[0, yes_token_id].item(), "no_probability": probs[0, no_token_id].item(), } # ── Example ────────────────────────────────────────────────────────────────── image = load_image( "https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Clinopodium_vulgare_inflorescence.jpg/250px-Clinopodium_vulgare_inflorescence.jpg" ) question = "What kind of properties does this plant have?" passages = [ "# Description:\nWild basil is a perennial rhizomatous herb ...", "# Distribution:\nWild basil occurs in suitable locations in most of Europe ...", "# Uses:\nThe leaves of wild basil are used as an aromatic herb ... It has been shown to have anti-bacterial properties.", ] model_kwargs = get_model_kwargs() device = model_kwargs.pop("device") # 1. Load and run the critic critic_processor = AutoProcessor.from_pretrained(CRITIC_MODEL_NAME, padding_side="left", use_fast=True) critic = Qwen2_5_VLForConditionalGeneration.from_pretrained(CRITIC_MODEL_NAME, **model_kwargs) critic.eval() relevant_passages = [] for passage in passages: result = run_reag_critic(critic, critic_processor, image, question, passage) if result["relevant"]: relevant_passages.append(passage) # 2. Load the generator and answer with filtered context context = "\n\n\n".join(relevant_passages) if relevant_passages else "" processor = AutoProcessor.from_pretrained(REAG_MODEL_NAME, padding_side="left", use_fast=True) generator = AutoModelForImageTextToText.from_pretrained(REAG_MODEL_NAME, **model_kwargs) generator.eval() question_with_context = CONTEXT_VQA_PROMPT.format(question=question, context=context) output = run_reag_generator(generator, processor, image, question_with_context) print("Answer:", output["answer"]) print("Reasoning:", output["reasoning"]) ``` --- ## Model Collection | Model | Description | |---|---| | [aimagelab/ReAG-3B](https://huggingface.co/aimagelab/ReAG-3B) | Generator (this model) | | [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B) | Larger generator variant | | [aimagelab/ReAG-Critic](https://huggingface.co/aimagelab/ReAG-Critic) | Passage relevance critic | --- ## Repository & Evaluation Full inference scripts, dataset setup, FAISS index downloads, and evaluation instructions are available in the **[official GitHub repository](https://github.com/aimagelab/ReAG)**. --- ## Citation ```bibtex @inproceedings{compagnoni2026reag, title={{ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering}}, author={Compagnoni, Alberto and Morini, Marco and Sarto, Sara and Cocchi, Federico and Caffagni, Davide and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita}, booktitle={Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference}, year={2026} } ```