ReAG-3B / README.md

Update README.md

a49599c verified 14 days ago

11 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- multimodal
	- visual-question-answering
	- retrieval-augmented-generation
	- reasoning
	- knowledge-based-vqa
	- qwen2_5_vl
	pipeline_tag: image-text-to-text
	---

	# ReAG-3B — Reasoning-Augmented Generation for KB-VQA

	[![CVPR 2026 Highlight](https://img.shields.io/badge/CVPR-2026%20Highlight-f9f107.svg)](https://cvpr.thecvf.com/virtual/2026/poster/37311)
	[![Paper](https://img.shields.io/badge/Paper-arXiv%202511.22715-B31B1B.svg)](https://arxiv.org/abs/2511.22715)
	[![Project Page](https://img.shields.io/badge/🌐-Project%20Page-blue.svg)](https://aimagelab.github.io/ReAG/)
	[![HF Collection](https://img.shields.io/badge/🤗-HF%20Collection-yellow.svg)](https://huggingface.co/collections/aimagelab/reag)

	ReAG-3B is the generator component of ReAG, a Reasoning-Augmented Multimodal RAG pipeline for Knowledge-Based Visual Question Answering (KB-VQA). It is based on [Qwen2.5-VL-3B](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) and fine-tuned with a multi-stage strategy: supervised fine-tuning as a cold start, followed by reinforcement learning to promote explicit reasoning grounded in retrieved evidence.

	For the full pipeline, a companion critic model ([aimagelab/ReAG-Critic](https://huggingface.co/aimagelab/ReAG-Critic)) first filters noisy retrieved passages before they are fed to this generator.

	A larger 7B variant is also available: [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B).

	---

	## Model Description

	Standard retrieval-augmented VQA methods often pass noisy or irrelevant passages directly to the generator, limiting answer quality. ReAG addresses this with a two-step approach:

	1. ReAG-Critic evaluates each retrieved passage and filters out irrelevant ones using a multimodal relevance signal (image + question + passage).
	2. ReAG-3B (this model) generates an answer with explicit chain-of-thought reasoning enclosed in `<think>…</think>` tags, followed by a concise `<answer>…</answer>`.

	ReAG significantly outperforms prior methods on both Encyclopedic-VQA and InfoSeek.

	---

	## Full Pipeline Usage

	The snippet below shows the complete ReAG inference pipeline: critic filtering followed by generator inference.

	```python
	import re
	from io import BytesIO

	import requests
	import torch
	from PIL import Image
	from transformers import (
	AutoModelForImageTextToText,
	AutoProcessor,
	Qwen2_5_VLForConditionalGeneration,
	)

	REAG_MODEL_NAME = "aimagelab/ReAG-3B"
	CRITIC_MODEL_NAME = "aimagelab/ReAG-Critic"

	SYSTEM_PROMPT_REASONING = (
	"A conversation between User and Assistant. The user asks a question, "
	"and the Assistant solves it. The assistant first thinks about the "
	"reasoning process in the mind and then provides the user with the answer. "
	"The reasoning process and answer are enclosed within <think> </think> and "
	"<answer> </answer> tags, respectively, i.e., "
	"<think>reasoning process here</think><answer>short answer here</answer>"
	)

	RELEVANCY_EVAL_SYSTEM_PROMPT = """You are a multimodal reasoning assistant specialized in Knowledge-Based Visual Question Answering (KB-VQA).
	Your task is to evaluate whether a given text passage provides useful and relevant information for answering a question about an image.

	You will be given:
	- Image: a visual scene containing entities, actions, and context.
	- Question: a natural-language question that refers to the image.
	- Text Passage: an external knowledge snippet retrieved from a database or the web.

	You must analyze the semantic alignment between the text, the image, and the question.
	Follow these steps carefully before giving your final decision:
	1. Understand the visual scene: Identify the key objects, people, actions, and context visible in the image.
	2. Interpret the question: Determine what information the question seeks.
	3. Analyze the text passage: Extract the main claims, facts, and entities mentioned in the text.

	Compare for relevance: Assess whether the information in the text:
	- Contains at least one sentence that supports answering the question about the image, OR
	- Provides background knowledge needed to interpret or reason about the image-question pair.

	Important:
	- If even a single sentence in the passage is relevant or useful, consider the entire passage as relevant and answer "Yes".
	- If no part of the passage contributes meaningfully to answering the question, answer "No".

	Output only one word:
	"Yes" -> if the text provides relevant or useful information for answering the question.
	"No" -> if the text is irrelevant or unhelpful."""

	SECTION_EVAL_USER_TEMPLATE = """Here is the question on the image above:
	{question}

	Here is the text passage to analyze:
	{passage}

	Does the text passage contain at least one sentence that may have some information useful to answer the user question?
	"Yes"/"No" answer:"""

	CONTEXT_VQA_PROMPT = """\
	{question}

	The following paragraphs may contain useful information to help answer the question correctly:
	{context}
	"""


	def load_image(image_url: str) -> Image.Image:
	response = requests.get(image_url, timeout=30, headers={"User-Agent": "Mozilla/5.0"})
	response.raise_for_status()
	return Image.open(BytesIO(response.content)).convert("RGB")


	def get_model_kwargs():
	if torch.cuda.is_available():
	return {
	"device": "cuda",
	"device_map": "balanced",
	"torch_dtype": torch.bfloat16,
	"attn_implementation": "flash_attention_2",
	}
	return {
	"device": "cpu",
	"device_map": "auto",
	"torch_dtype": torch.float32,
	}


	def parse_reag_output(text: str):
	answer_match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
	think_match = re.search(r"<think>(.*?)</think>", text, re.DOTALL)
	return {
	"raw_output": text.strip(),
	"reasoning": think_match.group(1).strip() if think_match else "",
	"answer": answer_match.group(1).strip() if answer_match else text.strip(),
	}


	def run_reag_generator(model, processor, image: Image.Image, question: str):
	messages = [
	{"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT_REASONING}]},
	{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]},
	]
	prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(text=[prompt + "<think>"], images=[image], return_tensors="pt", padding=True)
	inputs = {k: v.to(model.device) for k, v in inputs.items()}

	generated_ids = model.generate(
	**inputs, max_new_tokens=512, stop_strings=["</answer>"], tokenizer=processor.tokenizer
	)
	input_length = inputs["input_ids"].shape[1]
	generated_text = "<think>" + processor.batch_decode(
	generated_ids[:, input_length:], skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]
	return parse_reag_output(generated_text)


	def run_reag_critic(critic, processor, image: Image.Image, question: str, passage: str, yes_prob_threshold: float = 0.1):
	messages = [
	{"role": "system", "content": [{"type": "text", "text": RELEVANCY_EVAL_SYSTEM_PROMPT}]},
	{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": SECTION_EVAL_USER_TEMPLATE.format(question=question, passage=passage)},
	],
	},
	]
	prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(images=[image], text=[prompt], return_tensors="pt", padding=True)
	inputs = {k: v.to(critic.device) for k, v in inputs.items()}

	with torch.inference_mode():
	outputs = critic(**inputs)
	logits = outputs.logits[:, -1, :].float()
	probs = torch.softmax(logits, dim=-1)

	yes_token_id = processor.tokenizer.convert_tokens_to_ids("Yes")
	no_token_id = processor.tokenizer.convert_tokens_to_ids("No")
	return {
	"relevant": probs[0, yes_token_id].item() > yes_prob_threshold,
	"yes_probability": probs[0, yes_token_id].item(),
	"no_probability": probs[0, no_token_id].item(),
	}


	# ── Example ──────────────────────────────────────────────────────────────────

	image = load_image(
	"https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Clinopodium_vulgare_inflorescence.jpg/250px-Clinopodium_vulgare_inflorescence.jpg"
	)
	question = "What kind of properties does this plant have?"
	passages = [
	"# Description:\nWild basil is a perennial rhizomatous herb ...",
	"# Distribution:\nWild basil occurs in suitable locations in most of Europe ...",
	"# Uses:\nThe leaves of wild basil are used as an aromatic herb ... It has been shown to have anti-bacterial properties.",
	]

	model_kwargs = get_model_kwargs()
	device = model_kwargs.pop("device")

	# 1. Load and run the critic
	critic_processor = AutoProcessor.from_pretrained(CRITIC_MODEL_NAME, padding_side="left", use_fast=True)
	critic = Qwen2_5_VLForConditionalGeneration.from_pretrained(CRITIC_MODEL_NAME, **model_kwargs)
	critic.eval()

	relevant_passages = []
	for passage in passages:
	result = run_reag_critic(critic, critic_processor, image, question, passage)
	if result["relevant"]:
	relevant_passages.append(passage)

	# 2. Load the generator and answer with filtered context
	context = "\n\n\n".join(relevant_passages) if relevant_passages else ""
	processor = AutoProcessor.from_pretrained(REAG_MODEL_NAME, padding_side="left", use_fast=True)
	generator = AutoModelForImageTextToText.from_pretrained(REAG_MODEL_NAME, **model_kwargs)
	generator.eval()

	question_with_context = CONTEXT_VQA_PROMPT.format(question=question, context=context)
	output = run_reag_generator(generator, processor, image, question_with_context)
	print("Answer:", output["answer"])
	print("Reasoning:", output["reasoning"])
	```

	---

	## Model Collection

	\| Model \| Description \|
	\|---\|---\|
	\| [aimagelab/ReAG-3B](https://huggingface.co/aimagelab/ReAG-3B) \| Generator (this model) \|
	\| [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B) \| Larger generator variant \|
	\| [aimagelab/ReAG-Critic](https://huggingface.co/aimagelab/ReAG-Critic) \| Passage relevance critic \|

	---

	## Repository & Evaluation

	Full inference scripts, dataset setup, FAISS index downloads, and evaluation instructions are available in the [official GitHub repository](https://github.com/aimagelab/ReAG).

	---

	## Citation

	```bibtex
	@inproceedings{compagnoni2026reag,
	title={{ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering}},
	author={Compagnoni, Alberto and Morini, Marco and Sarto, Sara and Cocchi, Federico and Caffagni, Davide and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
	booktitle={Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference},
	year={2026}
	}
	```