File size: 11,046 Bytes

---
license: apache-2.0
language:
- en
tags:
- multimodal
- visual-question-answering
- retrieval-augmented-generation
- reasoning
- knowledge-based-vqa
- qwen2_5_vl
pipeline_tag: image-text-to-text
---

# ReAG-3B — Reasoning-Augmented Generation for KB-VQA

[![CVPR 2026 Highlight](https://img.shields.io/badge/CVPR-2026%20Highlight-f9f107.svg)](https://cvpr.thecvf.com/virtual/2026/poster/37311)
[![Paper](https://img.shields.io/badge/Paper-arXiv%202511.22715-B31B1B.svg)](https://arxiv.org/abs/2511.22715)
[![Project Page](https://img.shields.io/badge/🌐-Project%20Page-blue.svg)](https://aimagelab.github.io/ReAG/)
[![HF Collection](https://img.shields.io/badge/🤗-HF%20Collection-yellow.svg)](https://huggingface.co/collections/aimagelab/reag)

ReAG-3B is the generator component of **ReAG**, a Reasoning-Augmented Multimodal RAG pipeline for Knowledge-Based Visual Question Answering (KB-VQA). It is based on [Qwen2.5-VL-3B](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) and fine-tuned with a multi-stage strategy: supervised fine-tuning as a cold start, followed by reinforcement learning to promote explicit reasoning grounded in retrieved evidence.

For the full pipeline, a companion **critic model** ([aimagelab/ReAG-Critic](https://huggingface.co/aimagelab/ReAG-Critic)) first filters noisy retrieved passages before they are fed to this generator.

A larger 7B variant is also available: [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B).

---

## Model Description

Standard retrieval-augmented VQA methods often pass noisy or irrelevant passages directly to the generator, limiting answer quality. ReAG addresses this with a two-step approach:

1. **ReAG-Critic** evaluates each retrieved passage and filters out irrelevant ones using a multimodal relevance signal (image + question + passage).
2. **ReAG-3B (this model)** generates an answer with explicit chain-of-thought reasoning enclosed in `<think>…</think>` tags, followed by a concise `<answer>…</answer>`.

ReAG significantly outperforms prior methods on both **Encyclopedic-VQA** and **InfoSeek**.

---

## Full Pipeline Usage

The snippet below shows the complete ReAG inference pipeline: critic filtering followed by generator inference.

```python
import re
from io import BytesIO

import requests
import torch
from PIL import Image
from transformers import (
    AutoModelForImageTextToText,
    AutoProcessor,
    Qwen2_5_VLForConditionalGeneration,
)

REAG_MODEL_NAME = "aimagelab/ReAG-3B"
CRITIC_MODEL_NAME = "aimagelab/ReAG-Critic"

SYSTEM_PROMPT_REASONING = (
    "A conversation between User and Assistant. The user asks a question, "
    "and the Assistant solves it. The assistant first thinks about the "
    "reasoning process in the mind and then provides the user with the answer. "
    "The reasoning process and answer are enclosed within <think> </think> and "
    "<answer> </answer> tags, respectively, i.e., "
    "<think>reasoning process here</think><answer>short answer here</answer>"
)

RELEVANCY_EVAL_SYSTEM_PROMPT = """You are a multimodal reasoning assistant specialized in Knowledge-Based Visual Question Answering (KB-VQA).
Your task is to evaluate whether a given text passage provides useful and relevant information for answering a question about an image.

You will be given:
- Image: a visual scene containing entities, actions, and context.
- Question: a natural-language question that refers to the image.
- Text Passage: an external knowledge snippet retrieved from a database or the web.

You must analyze the semantic alignment between the text, the image, and the question.
Follow these steps carefully before giving your final decision:
1. Understand the visual scene: Identify the key objects, people, actions, and context visible in the image.
2. Interpret the question: Determine what information the question seeks.
3. Analyze the text passage: Extract the main claims, facts, and entities mentioned in the text.

Compare for relevance: Assess whether the information in the text:
- Contains at least one sentence that supports answering the question about the image, OR
- Provides background knowledge needed to interpret or reason about the image-question pair.

Important:
- If even a single sentence in the passage is relevant or useful, consider the entire passage as relevant and answer "Yes".
- If no part of the passage contributes meaningfully to answering the question, answer "No".

Output only one word:
"Yes" -> if the text provides relevant or useful information for answering the question.
"No" -> if the text is irrelevant or unhelpful."""

SECTION_EVAL_USER_TEMPLATE = """Here is the question on the image above:
{question}

Here is the text passage to analyze:
{passage}

Does the text passage contain at least one sentence that may have some information useful to answer the user question?
"Yes"/"No" answer:"""

CONTEXT_VQA_PROMPT = """\
{question}

The following paragraphs may contain useful information to help answer the question correctly:
{context}
"""


def load_image(image_url: str) -> Image.Image:
    response = requests.get(image_url, timeout=30, headers={"User-Agent": "Mozilla/5.0"})
    response.raise_for_status()
    return Image.open(BytesIO(response.content)).convert("RGB")


def get_model_kwargs():
    if torch.cuda.is_available():
        return {
            "device": "cuda",
            "device_map": "balanced",
            "torch_dtype": torch.bfloat16,
            "attn_implementation": "flash_attention_2",
        }
    return {
        "device": "cpu",
        "device_map": "auto",
        "torch_dtype": torch.float32,
    }


def parse_reag_output(text: str):
    answer_match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
    think_match = re.search(r"<think>(.*?)</think>", text, re.DOTALL)
    return {
        "raw_output": text.strip(),
        "reasoning": think_match.group(1).strip() if think_match else "",
        "answer": answer_match.group(1).strip() if answer_match else text.strip(),
    }


def run_reag_generator(model, processor, image: Image.Image, question: str):
    messages = [
        {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT_REASONING}]},
        {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]},
    ]
    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[prompt + "<think>"], images=[image], return_tensors="pt", padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    generated_ids = model.generate(
        **inputs, max_new_tokens=512, stop_strings=["</answer>"], tokenizer=processor.tokenizer
    )
    input_length = inputs["input_ids"].shape[1]
    generated_text = "<think>" + processor.batch_decode(
        generated_ids[:, input_length:], skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
    return parse_reag_output(generated_text)


def run_reag_critic(critic, processor, image: Image.Image, question: str, passage: str, yes_prob_threshold: float = 0.1):
    messages = [
        {"role": "system", "content": [{"type": "text", "text": RELEVANCY_EVAL_SYSTEM_PROMPT}]},
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": SECTION_EVAL_USER_TEMPLATE.format(question=question, passage=passage)},
            ],
        },
    ]
    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(images=[image], text=[prompt], return_tensors="pt", padding=True)
    inputs = {k: v.to(critic.device) for k, v in inputs.items()}

    with torch.inference_mode():
        outputs = critic(**inputs)
        logits = outputs.logits[:, -1, :].float()
        probs = torch.softmax(logits, dim=-1)

    yes_token_id = processor.tokenizer.convert_tokens_to_ids("Yes")
    no_token_id = processor.tokenizer.convert_tokens_to_ids("No")
    return {
        "relevant": probs[0, yes_token_id].item() > yes_prob_threshold,
        "yes_probability": probs[0, yes_token_id].item(),
        "no_probability": probs[0, no_token_id].item(),
    }


# ── Example ──────────────────────────────────────────────────────────────────

image = load_image(
    "https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Clinopodium_vulgare_inflorescence.jpg/250px-Clinopodium_vulgare_inflorescence.jpg"
)
question = "What kind of properties does this plant have?"
passages = [
    "# Description:\nWild basil is a perennial rhizomatous herb ...",
    "# Distribution:\nWild basil occurs in suitable locations in most of Europe ...",
    "# Uses:\nThe leaves of wild basil are used as an aromatic herb ... It has been shown to have anti-bacterial properties.",
]

model_kwargs = get_model_kwargs()
device = model_kwargs.pop("device")

# 1. Load and run the critic
critic_processor = AutoProcessor.from_pretrained(CRITIC_MODEL_NAME, padding_side="left", use_fast=True)
critic = Qwen2_5_VLForConditionalGeneration.from_pretrained(CRITIC_MODEL_NAME, **model_kwargs)
critic.eval()

relevant_passages = []
for passage in passages:
    result = run_reag_critic(critic, critic_processor, image, question, passage)
    if result["relevant"]:
        relevant_passages.append(passage)

# 2. Load the generator and answer with filtered context
context = "\n\n\n".join(relevant_passages) if relevant_passages else ""
processor = AutoProcessor.from_pretrained(REAG_MODEL_NAME, padding_side="left", use_fast=True)
generator = AutoModelForImageTextToText.from_pretrained(REAG_MODEL_NAME, **model_kwargs)
generator.eval()

question_with_context = CONTEXT_VQA_PROMPT.format(question=question, context=context)
output = run_reag_generator(generator, processor, image, question_with_context)
print("Answer:", output["answer"])
print("Reasoning:", output["reasoning"])
```

---

## Model Collection

| Model | Description |
|---|---|
| [aimagelab/ReAG-3B](https://huggingface.co/aimagelab/ReAG-3B) | Generator (this model) |
| [aimagelab/ReAG-7B](https://huggingface.co/aimagelab/ReAG-7B) | Larger generator variant |
| [aimagelab/ReAG-Critic](https://huggingface.co/aimagelab/ReAG-Critic) | Passage relevance critic |

---

## Repository & Evaluation

Full inference scripts, dataset setup, FAISS index downloads, and evaluation instructions are available in the **[official GitHub repository](https://github.com/aimagelab/ReAG)**.

---

## Citation

```bibtex
@inproceedings{compagnoni2026reag,
  title={{ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering}},
  author={Compagnoni, Alberto and Morini, Marco and Sarto, Sara and Cocchi, Federico and Caffagni, Davide and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference},
  year={2026}
}
```