---
datasets:
- OpenSearch-VL/Search-VL-SFT-36K
- OpenSearch-VL/Search-VL-RL-8K
language:
- en
- zh
base_model:
- Qwen/Qwen3-VL-8B-Instruct
---
An Open Recipe for Frontier Multimodal Search Agents
Cold-Start Agentic SFT Β· Multi-Turn Fatal-Aware GRPO Β· Visual Tool Use
[](https://arxiv.org/pdf/2605.05185)
[](https://www.alphaxiv.org/abs/2605.05185)
[](https://github.com/shawn0728/OpenSearch-VL)
[](https://huggingface.co/OpenSearch-VL)
[](https://github.com/shawn0728/OpenSearch-VL)
[](https://opensource.org/licenses/Apache-2.0)
[](https://www.python.org/)

## π Table of Contents
- [π Introduction](#-introduction)
- [πΊοΈ Overview](#%EF%B8%8F-overview)
- [π Method Overview](#-method-overview)
- [π Main Results](#-main-results)
- [π Case Study](#-case-study)
- [π Repository Layout](#-repository-layout)
- [π οΈ Prerequisites](#%EF%B8%8F-prerequisites)
- [ποΈ Agentic SFT Β· `code/SFT`](#%EF%B8%8F-agentic-sft--codesft)
- [π Agentic RL Β· `code/RL`](#-agentic-rl--coderl)
- [π Inference & Evaluation Β· `code/infer`](#-inference--evaluation--codeinfer)
- [π§ TODO](#-todo)
- [π Acknowledgements](#-acknowledgements)
---
## π Introduction
**OpenSearch-VL** is a fully open recipe for training frontier multimodal deep-research agents with agentic reinforcement learning. In contrast to standard VLMs that answer in a single forward pass, the agent operates as a closed loop: it inspects the image, crops or enhances the regions of interest, issues web and image searches, visits the retrieved pages, and only then writes an answer grounded in the gathered evidence.
Reproducing top-tier multimodal search agents has so far been difficult because the underlying training data, trajectory-synthesis pipelines, and training recipes remain proprietary. This release aims to close that gap: we open-source the **data, code, and model checkpoints** required to reproduce the paper end-to-end.
The recipe addresses three challenges that we found to be largely independent in practice:
- **Data.** A curation pipeline built on top of the Wikipedia hyperlink graph synthesizes image-grounded multi-hop VQA. *Fuzzy entity rewriting* and *source-anchored visual grounding* are used to suppress shortcut solutions in which a single retrieval step is sufficient. The pipeline yields two open datasets: **SearchVL-SFT-36k** for supervised fine-tuning and **SearchVL-RL-8k** for reinforcement learning.
- **Tools.** A unified visual and retrieval tool environment (`crop`, `layout_parsing`, `text_search`, `image_search`, `web_search`, `visit`, `perspective_correct`, `super_resolution`, `sharpen`, `python_interpreter`) is shared across SFT data generation, RL rollout, and inference. This allows the agent both to recover from imperfect visual inputs and to acquire external knowledge through a consistent interface.
- **Algorithm.** A multi-turn **fatal-aware GRPO** algorithm explicitly handles cascading tool failures during long rollouts. Tokens that follow a fatal step are masked out of the policy gradient, while *one-sided advantage clamping* preserves the credit assigned to valid pre-failure reasoning rather than penalizing the entire trajectory.
Across seven knowledge-intensive multimodal benchmarksβSimpleVQA, VDR, MMSearch, LiveVQA, BrowseComp-VL, FVQA, and InfoSeekβOpenSearch-VL improves the average score by more than **10 points** over the corresponding agentic baselines, and at the 30B / 32B scale matches the accuracy of strong proprietary systems.
---
## πΊοΈ Overview
This repository provides everything needed to **reproduce, fine-tune, and evaluate** OpenSearch-VL:
| Component | Path | Description |
|-----------|------|-------------|
| **SFT Training** | [`SFT/`](SFT/) | Agentic cold-start with LLaMA-Factory + Ray + ZeRO-3 (full-parameter fine-tune of LLM + ViT + projector) |
| **RL Training** | [`RL/`](RL/) | Asynchronous agentic RLOO/GRPO on top of SFT, built on rLLM + verl + Megatron-LM + sglang |
| **Inference & Evaluation** | [`infer/`](infer/) | Unified `run_infer.py --model {8b,30b-a3b,32b,claude}` rollout + GPT-4o judge for BrowseComp-VL, HLE, VDR-Bench |
| **Models** | [OpenSearch-VL](https://huggingface.co/OpenSearch-VL) | OpenSearch-VL-{8B, 30B-A3B, 32B} checkpoints |
| **Datasets** | [OpenSearch-VL](https://huggingface.co/OpenSearch-VL) | SearchVL-SFT-36k (cold-start) and SearchVL-RL-8k (RL) |
### Workflow at a Glance
```
ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββ
β Qwen3-VL base β βββ β Agentic SFT β βββ β Async Agentic RL β ββββΆ OpenSearch-VL
β (HF weights) β β (code/SFT) β β (code/RL) β
ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββ
β β
βΌ βΌ
SearchVL-SFT-36k SearchVL-RL-8k
7-domain tool-use Vision-DeepResearch-QA
cold-start trajectories (RLOO / GRPO + fatal-aware)
```
### Tool Environment
OpenSearch-VL is equipped with a heterogeneous tool set $\mathcal{T} = \mathcal{T}_v \cup \mathcal{T}_r$ shared by SFT, RL, and inference:
| Category | Tools | Purpose |
|---|---|---|
| **Retrieval** ($\mathcal{T}_r$) | `text_search`, `image_search`, `web_search`, `visit` | Acquire external textual / visual evidence and visit pages |
| **Image Enhancement** ($\mathcal{T}_v$) | `sharpen`, `super_resolution`, `perspective_correct` | Repair blurry, low-resolution, or skewed inputs before retrieval |
| **Attention & Parsing** ($\mathcal{T}_v$) | `crop`, `layout_parsing` (OCR) | Localize regions of interest and decode fine-grained content |
| **Computation** | `python_interpreter` | Numerical / programmatic computation on retrieved evidence |
### Quick Links
- **Get started** β [Prerequisites](#-prerequisites)
- **Train your own SFT model** β [Agentic SFT](#%EF%B8%8F-agentic-sft--codesft)
- **Run agentic RL** β [Agentic RL](#-agentic-rl--coderl)
- **Inference & benchmark** β [Inference & Evaluation](#-inference--evaluation--codeinfer)
---
## π Method Overview
**Data Curation Pipeline.**
Starting from the English Wikipedia hyperlink graph, we sample multi-hop entity paths and convert them into multi-hop VQA instances by (a) extracting canonical questionβanswer pairs along the path, (b) rewriting each intermediate entity into a fuzzy descriptor while certifying answer invariance and uniqueness, and (c) anchoring the question on a representative image of the **source** node β *not* the answer node β so that single-hop image lookup shortcuts are eliminated. The pipeline finishes with staged tool-demanding filtering and an enhancement subset (random degradations paired with the corresponding restoration tools) before synthesizing multi-turn expert trajectories with answer- and process-level rejection sampling.