Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,490 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
datasets:
|
| 3 |
+
- OpenSearch-VL/Search-VL-SFT-36K
|
| 4 |
+
- OpenSearch-VL/Search-VL-RL-8K
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
- zh
|
| 8 |
+
base_model:
|
| 9 |
+
- Qwen/Qwen3-VL-8B-Instruct
|
| 10 |
+
---
|
| 11 |
+
<div align="center">
|
| 12 |
+
|
| 13 |
+
<img src="https://github.com/shawn0728/OpenSearch-VL/blob/main/images/logo.png?raw=true" alt="OpenSearch-VL" width="35%">
|
| 14 |
+
<h1 style="margin: -18px 0 0; font-size: 1.8em;">
|
| 15 |
+
An Open Recipe for Frontier Multimodal Search Agents
|
| 16 |
+
</h1>
|
| 17 |
+
|
| 18 |
+
<p><b>Cold-Start Agentic SFT · Multi-Turn Fatal-Aware GRPO · Visual Tool Use</b></p>
|
| 19 |
+
|
| 20 |
+
[](https://github.com/shawn0728/OpenSearch-VL)
|
| 21 |
+
[](https://huggingface.co/OpenSearch-VL)
|
| 22 |
+
|
| 23 |
+
[](https://github.com/shawn0728/OpenSearch-VL)
|
| 24 |
+
[](https://opensource.org/licenses/Apache-2.0)
|
| 25 |
+
[](https://www.python.org/)
|
| 26 |
+

|
| 27 |
+
|
| 28 |
+
</div>
|
| 29 |
+
|
| 30 |
+
## 📑 Table of Contents
|
| 31 |
+
|
| 32 |
+
- [📖 Introduction](#-introduction)
|
| 33 |
+
- [🗺️ Overview](#%EF%B8%8F-overview)
|
| 34 |
+
- [🍭 Method Overview](#-method-overview)
|
| 35 |
+
- [📊 Main Results](#-main-results)
|
| 36 |
+
- [🔎 Case Study](#-case-study)
|
| 37 |
+
- [📁 Repository Layout](#-repository-layout)
|
| 38 |
+
- [🛠️ Prerequisites](#%EF%B8%8F-prerequisites)
|
| 39 |
+
- [🏋️ Agentic SFT · `code/SFT`](#%EF%B8%8F-agentic-sft--codesft)
|
| 40 |
+
- [🚀 Agentic RL · `code/RL`](#-agentic-rl--coderl)
|
| 41 |
+
- [📊 Inference & Evaluation · `code/infer`](#-inference--evaluation--codeinfer)
|
| 42 |
+
- [🚧 TODO](#-todo)
|
| 43 |
+
- [🙌 Acknowledgements](#-acknowledgements)
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## 📖 Introduction
|
| 48 |
+
|
| 49 |
+
**OpenSearch-VL** is a fully open recipe for training frontier multimodal deep-research agents with agentic reinforcement learning. In contrast to standard VLMs that answer in a single forward pass, the agent operates as a closed loop: it inspects the image, crops or enhances the regions of interest, issues web and image searches, visits the retrieved pages, and only then writes an answer grounded in the gathered evidence.
|
| 50 |
+
|
| 51 |
+
Reproducing top-tier multimodal search agents has so far been difficult because the underlying training data, trajectory-synthesis pipelines, and training recipes remain proprietary. This release aims to close that gap: we open-source the **data, code, and model checkpoints** required to reproduce the paper end-to-end.
|
| 52 |
+
|
| 53 |
+
The recipe addresses three challenges that we found to be largely independent in practice:
|
| 54 |
+
|
| 55 |
+
- **Data.** A curation pipeline built on top of the Wikipedia hyperlink graph synthesizes image-grounded multi-hop VQA. *Fuzzy entity rewriting* and *source-anchored visual grounding* are used to suppress shortcut solutions in which a single retrieval step is sufficient. The pipeline yields two open datasets: **SearchVL-SFT-36k** for supervised fine-tuning and **SearchVL-RL-8k** for reinforcement learning.
|
| 56 |
+
- **Tools.** A unified visual and retrieval tool environment (`crop`, `layout_parsing`, `text_search`, `image_search`, `web_search`, `visit`, `perspective_correct`, `super_resolution`, `sharpen`, `python_interpreter`) is shared across SFT data generation, RL rollout, and inference. This allows the agent both to recover from imperfect visual inputs and to acquire external knowledge through a consistent interface.
|
| 57 |
+
- **Algorithm.** A multi-turn **fatal-aware GRPO** algorithm explicitly handles cascading tool failures during long rollouts. Tokens that follow a fatal step are masked out of the policy gradient, while *one-sided advantage clamping* preserves the credit assigned to valid pre-failure reasoning rather than penalizing the entire trajectory.
|
| 58 |
+
|
| 59 |
+
Across seven knowledge-intensive multimodal benchmarks—SimpleVQA, VDR, MMSearch, LiveVQA, BrowseComp-VL, FVQA, and InfoSeek—OpenSearch-VL improves the average score by more than **10 points** over the corresponding agentic baselines, and at the 30B / 32B scale matches the accuracy of strong proprietary systems.
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## 🗺️ Overview
|
| 64 |
+
|
| 65 |
+
This repository provides everything needed to **reproduce, fine-tune, and evaluate** OpenSearch-VL:
|
| 66 |
+
|
| 67 |
+
| Component | Path | Description |
|
| 68 |
+
|-----------|------|-------------|
|
| 69 |
+
| **SFT Training** | [`SFT/`](SFT/) | Agentic cold-start with LLaMA-Factory + Ray + ZeRO-3 (full-parameter fine-tune of LLM + ViT + projector) |
|
| 70 |
+
| **RL Training** | [`RL/`](RL/) | Asynchronous agentic RLOO/GRPO on top of SFT, built on rLLM + verl + Megatron-LM + sglang |
|
| 71 |
+
| **Inference & Evaluation** | [`infer/`](infer/) | Unified `run_infer.py --model {8b,30b-a3b,32b,claude}` rollout + GPT-4o judge for BrowseComp-VL, HLE, VDR-Bench |
|
| 72 |
+
| **Models** | [OpenSearch-VL](https://huggingface.co/OpenSearch-VL) | OpenSearch-VL-{8B, 30B-A3B, 32B} checkpoints |
|
| 73 |
+
| **Datasets** | [OpenSearch-VL](https://huggingface.co/OpenSearch-VL) | SearchVL-SFT-36k (cold-start) and SearchVL-RL-8k (RL) |
|
| 74 |
+
|
| 75 |
+
### Workflow at a Glance
|
| 76 |
+
|
| 77 |
+
```
|
| 78 |
+
┌────────────────┐ ┌────────────────┐ ┌────────────────────┐
|
| 79 |
+
│ Qwen3-VL base │ ─── │ Agentic SFT │ ─── │ Async Agentic RL │ ───▶ OpenSearch-VL
|
| 80 |
+
│ (HF weights) │ │ (code/SFT) │ │ (code/RL) │
|
| 81 |
+
└────────────────┘ └────────────────┘ └────────────────────┘
|
| 82 |
+
│ │
|
| 83 |
+
▼ ▼
|
| 84 |
+
SearchVL-SFT-36k SearchVL-RL-8k
|
| 85 |
+
7-domain tool-use Vision-DeepResearch-QA
|
| 86 |
+
cold-start trajectories (RLOO / GRPO + fatal-aware)
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
### Tool Environment
|
| 90 |
+
|
| 91 |
+
OpenSearch-VL is equipped with a heterogeneous tool set $\mathcal{T} = \mathcal{T}_v \cup \mathcal{T}_r$ shared by SFT, RL, and inference:
|
| 92 |
+
|
| 93 |
+
| Category | Tools | Purpose |
|
| 94 |
+
|---|---|---|
|
| 95 |
+
| **Retrieval** ($\mathcal{T}_r$) | `text_search`, `image_search`, `web_search`, `visit` | Acquire external textual / visual evidence and visit pages |
|
| 96 |
+
| **Image Enhancement** ($\mathcal{T}_v$) | `sharpen`, `super_resolution`, `perspective_correct` | Repair blurry, low-resolution, or skewed inputs before retrieval |
|
| 97 |
+
| **Attention & Parsing** ($\mathcal{T}_v$) | `crop`, `layout_parsing` (OCR) | Localize regions of interest and decode fine-grained content |
|
| 98 |
+
| **Computation** | `python_interpreter` | Numerical / programmatic computation on retrieved evidence |
|
| 99 |
+
|
| 100 |
+
### Quick Links
|
| 101 |
+
|
| 102 |
+
- **Get started** → [Prerequisites](#-prerequisites)
|
| 103 |
+
- **Train your own SFT model** → [Agentic SFT](#%EF%B8%8F-agentic-sft--codesft)
|
| 104 |
+
- **Run agentic RL** → [Agentic RL](#-agentic-rl--coderl)
|
| 105 |
+
- **Inference & benchmark** → [Inference & Evaluation](#-inference--evaluation--codeinfer)
|
| 106 |
+
|
| 107 |
+
---
|
| 108 |
+
|
| 109 |
+
## 🍭 Method Overview
|
| 110 |
+
|
| 111 |
+
**Data Curation Pipeline.**
|
| 112 |
+
Starting from the English Wikipedia hyperlink graph, we sample multi-hop entity paths and convert them into multi-hop VQA instances by (a) extracting canonical question–answer pairs along the path, (b) rewriting each intermediate entity into a fuzzy descriptor while certifying answer invariance and uniqueness, and (c) anchoring the question on a representative image of the **source** node — *not* the answer node — so that single-hop image lookup shortcuts are eliminated. The pipeline finishes with staged tool-demanding filtering and an enhancement subset (random degradations paired with the corresponding restoration tools) before synthesizing multi-turn expert trajectories with answer- and process-level rejection sampling.
|
| 113 |
+
|
| 114 |
+
<div align="center">
|
| 115 |
+
<img src="https://github.com/shawn0728/OpenSearch-VL/blob/main/images/data_pipeline.png?raw=true" alt="Data Curation Pipeline" width="85%">
|
| 116 |
+
</div>
|
| 117 |
+
|
| 118 |
+
**RL Training Pipeline.**
|
| 119 |
+
Starting from the SFT-initialized checkpoint, we sample a group of multi-turn trajectories against the real environment $\mathcal{E}$. Each trajectory is scored by a composite reward combining final-task accuracy ($r_{\text{acc}}$), process-level search-query quality ($r_{\text{query}}$), and a multiplicative format gate ($r_{\text{fmt}}$). To preserve valid reasoning when a trajectory eventually triggers cascading tool failures, we apply **fatal-aware token masking** to truncate the sequence at the fatal step $f_i$ and **one-sided advantage clamping** ($\hat{A}_i = \max(\widetilde{r}_i, 0)$ for fatal trajectories) during policy optimization, preventing the suppression of viable early reasoning.
|
| 120 |
+
|
| 121 |
+
<div align="center">
|
| 122 |
+
<img src="https://github.com/shawn0728/OpenSearch-VL/blob/main/images/rl_pipeline.png?raw=true" alt="RL Pipeline" width="85%">
|
| 123 |
+
</div>
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
+
## 📊 Main Results
|
| 130 |
+
|
| 131 |
+
OpenSearch-VL is built on three Qwen3-VL variants and evaluated on **7 multimodal knowledge-intensive QA / web-search benchmarks** under the same Pass@1 + GPT-4o judge protocol as VDR-Bench.
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
**Highlights.**
|
| 135 |
+
*OpenSearch-VL-8B* is the strongest open 8B agent (**+3.9** Avg over SenseNova-MARS-8B). *OpenSearch-VL-30B-A3B* improves the Qwen3-VL agentic baseline by **+13.8** Avg, with large gains on **VDR (+13.3)**, **MMSearch (+24.5)**, **FVQA (+10.2)**, and **InfoSeek (+16.2)**. *OpenSearch-VL-32B* surpasses Gemini-2.5-Pro and Claude-4-Sonnet direct-reasoning baselines on most benchmarks.
|
| 136 |
+
|
| 137 |
+
**Fatal-aware GRPO ablation.**
|
| 138 |
+
Vanilla search-augmented GRPO improves SFT 64.6 → 67.6 Avg; the hard-masking baseline of Vision-DeepResearch saturates at 67.7; **fatal masking** alone reaches 69.1; our **full method with one-sided advantage clamping reaches 71.8** — a +4.2 gain over vanilla GRPO and the best score on every benchmark.
|
| 139 |
+
|
| 140 |
+
<div align="center">
|
| 141 |
+
<img src="https://github.com/shawn0728/OpenSearch-VL/blob/main/images/turn_acc_combined.png?raw=true" width="65%" alt="turn_acc_combined">
|
| 142 |
+
</div>
|
| 143 |
+
|
| 144 |
+
Fatal-aware GRPO sustains a **higher number of tool-use turns *and* a higher batch accuracy** than vanilla GRPO and the Hard-Mask baseline — encouraging productive exploration rather than prematurely suppressing difficult rollouts.
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
|
| 148 |
+
## 🔎 Case Study
|
| 149 |
+
|
| 150 |
+
The example below illustrates a representative OpenSearch-VL trajectory on a knowledge-intensive visual question: **“In what year did this bridge open?”** The answer cannot be read directly from the image or reliably produced by parametric knowledge alone. Instead, the agent progressively grounds the query through tool use.
|
| 151 |
+
|
| 152 |
+
<div align="center">
|
| 153 |
+
<img src="https://github.com/shawn0728/OpenSearch-VL/blob/main/images/case_study.png?raw=true" width="88%" alt="OpenSearch-VL case study">
|
| 154 |
+
</div>
|
| 155 |
+
|
| 156 |
+
**Tool-use flow.**
|
| 157 |
+
1. **Visual inspection** — The agent identifies the roadside sign as the most useful visual clue.
|
| 158 |
+
2. **Crop** — It zooms into the sign to obtain a cleaner local view.
|
| 159 |
+
3. **Image search** — The cropped region helps identify the structure as the **Kessock Bridge**.
|
| 160 |
+
4. **Text search / verification** — A targeted search verifies the official opening year as **1982**.
|
| 161 |
+
|
| 162 |
+
This case highlights the core behavior encouraged by OpenSearch-VL: the agent does not guess from a single model pass, but chains visual perception, image retrieval, and textual evidence acquisition until the answer is grounded.
|
| 163 |
+
|
| 164 |
+
---
|
| 165 |
+
|
| 166 |
+
## 📁 Repository Layout
|
| 167 |
+
|
| 168 |
+
```
|
| 169 |
+
code/
|
| 170 |
+
├── SFT/ # agentic supervised fine-tuning (LLaMA-Factory fork)
|
| 171 |
+
│ ├── examples/agentic_full/ # training YAMLs for Qwen2.5-VL / Qwen3-VL / Qwen3.5-VL
|
| 172 |
+
│ ├── examples/deepspeed/ # ZeRO-2 / ZeRO-3 configs
|
| 173 |
+
│ ├── data/dataset_info.json # 7 agentic-SFT datasets (relative paths)
|
| 174 |
+
│ ├── src/llamafactory/ # LLaMA-Factory source (trainer, data loader, CLI)
|
| 175 |
+
│ └── README.md # SFT-specific instructions
|
| 176 |
+
│
|
| 177 |
+
├── RL/ # reinforcement learning on the SFT checkpoint
|
| 178 |
+
│ ├── rllm/ # rLLM + verl + vision_deepresearch_async_workflow/
|
| 179 |
+
│ ├── Megatron-LM/ # Megatron backend
|
| 180 |
+
│ ├── mbridge/ # HF ↔ Megatron parallelism bridge
|
| 181 |
+
│ └── README.md # RL-specific instructions
|
| 182 |
+
│
|
| 183 |
+
├── infer/ # inference + benchmark evaluation
|
| 184 |
+
│ ├── run_infer.py # unified entrypoint (--model 8b|32b|30b-a3b|claude)
|
| 185 |
+
│ ├── run_infer.sh # env-driven launcher around run_infer.py
|
| 186 |
+
│ ├── eval_with_gpt4o.py # GPT-4o judge for BrowseComp-VL / HLE / VDR-Bench
|
| 187 |
+
│ ├── run_eval.sh # env-driven judge driver across all benchmarks
|
| 188 |
+
│ ├── .env.example # template for inference / judge / search keys
|
| 189 |
+
│ └── opensearch_infer/ # modular package: config, prompts, auth, tools,
|
| 190 |
+
│ # search, runners (Claude + Qwen3-VL dense / MoE),
|
| 191 |
+
│ # message converters, multi-turn pipeline
|
| 192 |
+
│
|
| 193 |
+
└── README.md # (this file)
|
| 194 |
+
```
|
| 195 |
+
|
| 196 |
+
---
|
| 197 |
+
|
| 198 |
+
## 🛠️ Prerequisites
|
| 199 |
+
|
| 200 |
+
| Component | Minimum |
|
| 201 |
+
| ----------- | ------------------------------------------------------------- |
|
| 202 |
+
| Python | 3.10+ |
|
| 203 |
+
| CUDA | 12.1+ (12.4 recommended) |
|
| 204 |
+
| PyTorch | ≥ 2.4 with CUDA support |
|
| 205 |
+
| GPU | ≥ 1× H100 / H800 / A100-80GB for 8B (multi-node for 30B / 32B) |
|
| 206 |
+
| NCCL / RDMA | InfiniBand / RoCE recommended for multi-node; see `RL/rllm/.env.example` |
|
| 207 |
+
|
| 208 |
+
The three components share most of their Python dependencies (PyTorch, `transformers`, `transformer_engine`, `flash-attn`, `deepspeed`, `ray`, `qwen-vl-utils`, `sglang`) — we recommend installing each sub-project into its **own virtual environment**.
|
| 209 |
+
|
| 210 |
+
### External API keys
|
| 211 |
+
|
| 212 |
+
All keys are optional; components gracefully no-op if unset.
|
| 213 |
+
|
| 214 |
+
| Variable | Used by | Purpose |
|
| 215 |
+
| --- | --- | --- |
|
| 216 |
+
| `API_GATEWAY_HOST` / `API_GATEWAY_USER` / `API_GATEWAY_KEY` | RL | Optional HMAC-secured gateway that proxies Serper + Jina behind one credential (set on RL workers). |
|
| 217 |
+
| `API_HOST` / `API_USER` / `API_KEY` | infer | Same gateway, named to match the inference package's env vars. |
|
| 218 |
+
| `SERPER_API_KEY` | RL, infer | [Serper.dev](https://serper.dev) text & image search (used when no gateway is configured). |
|
| 219 |
+
| `JINA_API_KEY` | RL, infer | [Jina AI](https://jina.ai) reader (page visit / content extraction). |
|
| 220 |
+
| `QWEN_API_BASE` / `QWEN_MODEL_NAME` | infer | OpenAI-compatible chat-completions server used for search summarization (defaults to a local Qwen3-32B). |
|
| 221 |
+
| `LAYOUT_PARSING_API_URL` / `LAYOUT_PARSING_TOKEN` | RL, infer | PP-StructureV3-compatible OCR / layout endpoint. |
|
| 222 |
+
| `CLAUDE_API_HOST` / `CLAUDE_API_USER` / `CLAUDE_API_KEY` | infer | Optional HMAC-secured gateway for the Claude Opus 4.5 backend. |
|
| 223 |
+
| `JUDGE_API_BASE_URL` / `JUDGE_APP_ID` / `JUDGE_APP_KEY` / `JUDGE_MODEL_MARKER` | infer | OpenAI-compatible GPT-4o judge used by `eval_with_gpt4o.py`. |
|
| 224 |
+
| `QWEN3VL_8B_PATH` / `QWEN3VL_32B_PATH` / `QWEN3VL_30B_A3B_PATH` | infer | Local checkpoints for the three Qwen3-VL variants (overrideable via `--checkpoint`). |
|
| 225 |
+
| `FVQA_IMAGE_DIR` | infer | Optional fallback directory of `<case_id>.<ext>` images used when a benchmark URL is unreachable. |
|
| 226 |
+
| `WANDB_API_KEY` | SFT, RL | W&B logging. |
|
| 227 |
+
|
| 228 |
+
Two templates are provided: [`RL/rllm/.env.example`](RL/rllm/.env.example) for the RL workers, and [`infer/.env.example`](infer/.env.example) for inference + judge. Copy whichever applies and source it before launching.
|
| 229 |
+
|
| 230 |
+
---
|
| 231 |
+
|
| 232 |
+
## 🏋️ Agentic SFT · `code/SFT`
|
| 233 |
+
|
| 234 |
+
Cold-starts the base VLM on **7 tool-use datasets** (FVQA, Palace, WebQA, LiveVQA, WikiArt, Wiki-zh, Wiki-en — together forming **SearchVL-SFT-36k**, with an average of 6.3 tool-invocation turns per trajectory). We perform a **full-parameter fine-tune of the LLM + vision tower + projector** with DeepSpeed ZeRO-3, distributed via Ray.
|
| 235 |
+
|
| 236 |
+
### Install
|
| 237 |
+
|
| 238 |
+
```bash
|
| 239 |
+
cd code/SFT
|
| 240 |
+
pip install -e ".[torch,metrics,deepspeed,ray]"
|
| 241 |
+
pip install qwen-vl-utils pillow av decord torchvision flash-attn
|
| 242 |
+
```
|
| 243 |
+
|
| 244 |
+
### Data layout
|
| 245 |
+
|
| 246 |
+
Download the **SearchVL-SFT-36k** bundle from the [HuggingFace collection](https://huggingface.co/OpenSearch-VL) and place the 7 sub-sets under `code/SFT/data/` so that the **relative** `file_name` values in [`data/dataset_info.json`](SFT/data/dataset_info.json) resolve:
|
| 247 |
+
|
| 248 |
+
```
|
| 249 |
+
SFT/data/
|
| 250 |
+
├── dataset_info.json
|
| 251 |
+
├── new_fvqa/fvqa_llama_factory_clean.json
|
| 252 |
+
├── palace/palace_llama_factory_filtered.json
|
| 253 |
+
├── WebQA/webqa_llama_factory_filtered.json
|
| 254 |
+
├── new_livevqa/livevqa_llama_factory_filtered.json
|
| 255 |
+
├── wikiart/wikiart_llama_factory_filtered.json
|
| 256 |
+
├── wiki_en/wiki_en_llama_factory_filtered.json
|
| 257 |
+
└── wiki_zh/wiki_zh_llama_factory_filtered.json
|
| 258 |
+
```
|
| 259 |
+
|
| 260 |
+
Each JSON is in **ShareGPT format** with `conversations`, `images`, `system`, and `tools` columns.
|
| 261 |
+
|
| 262 |
+
### Launch
|
| 263 |
+
|
| 264 |
+
```bash
|
| 265 |
+
cd code/SFT
|
| 266 |
+
|
| 267 |
+
# Multi-node Ray (16 nodes × 8 GPU by default):
|
| 268 |
+
USE_RAY=1 llamafactory-cli train \
|
| 269 |
+
examples/agentic_full/qwen3_vl_full_sft_8b_ray.yaml
|
| 270 |
+
|
| 271 |
+
# Single-node smoke test:
|
| 272 |
+
FORCE_TORCHRUN=1 NNODES=1 NPROC_PER_NODE=8 llamafactory-cli train \
|
| 273 |
+
examples/agentic_full/qwen3_vl_full_sft_8b_ray.yaml
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
### Available training configs
|
| 277 |
+
|
| 278 |
+
Edit `ray_num_workers`, `placement_strategy`, and NCCL / IB vars to match your cluster.
|
| 279 |
+
|
| 280 |
+
| YAML | Model | # workers |
|
| 281 |
+
| ---- | ----- | --------- |
|
| 282 |
+
| `qwen3_vl_full_sft_8b_ray.yaml` | Qwen3-VL-8B-Instruct | 256 |
|
| 283 |
+
| `qwen3_vl_full_sft_30_3b_ray.yaml` | Qwen3-VL-30B-A3B-Instruct | 256 |
|
| 284 |
+
| `qwen3_vl_full_sft_32b_ray.yaml` | Qwen3-VL-32B-Instruct | 256 |
|
| 285 |
+
| `qwen3_5vl_full_sft_27b_ray.yaml` | Qwen3.5-VL-27B-Instruct | 256 |
|
| 286 |
+
| `qwen3_5vl_full_sft_35b_3b_ray.yaml` | Qwen3.5-VL-35B-A3B | 256 |
|
| 287 |
+
| `qwen2_5_vl_full_sft_7b_ray.yaml` | Qwen2.5-VL-7B-Instruct | 256 |
|
| 288 |
+
| `qwen2_5_vl_full_sft_32b_ray.yaml` | Qwen2.5-VL-32B-Instruct | 256 |
|
| 289 |
+
| `qwen2_5_vl_full_sft_72b_ray.yaml` | Qwen2.5-VL-72B-Instruct | 256 |
|
| 290 |
+
|
| 291 |
+
### Shared hyper-parameters
|
| 292 |
+
|
| 293 |
+
| Hyperparameter | Value |
|
| 294 |
+
| --- | --- |
|
| 295 |
+
| Cutoff length | `32000` |
|
| 296 |
+
| Precision | `bf16` |
|
| 297 |
+
| Learning rate | `2e-5` (cosine, `warmup_ratio=0.1`) |
|
| 298 |
+
| Epochs | `8` |
|
| 299 |
+
| Per-device batch | `1` (with `gradient_checkpointing: true`) |
|
| 300 |
+
| DeepSpeed | ZeRO-3 (`examples/deepspeed/ds_z3_config.json`) |
|
| 301 |
+
| Frozen modules | none (`freeze_vision_tower: false`, `freeze_multi_modal_projector: false`) |
|
| 302 |
+
|
| 303 |
+
Checkpoints land in `saves/<model>/full/sft_data_v1/` (override via `output_dir` / `ray_storage_path` in the YAML).
|
| 304 |
+
|
| 305 |
+
> Full details, dataset format, and cluster notes: [`code/SFT/README.md`](SFT/README.md).
|
| 306 |
+
|
| 307 |
+
---
|
| 308 |
+
|
| 309 |
+
## 🚀 Agentic RL · `code/RL`
|
| 310 |
+
|
| 311 |
+
**Asynchronous agentic RLOO / GRPO / PPO** on top of the SFT checkpoint, using [rLLM](https://github.com/rllm-org/rllm)'s `AgentWorkflowEngine`, [verl](https://github.com/volcengine/verl) as the policy-optimization backend, and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) + [mbridge](https://github.com/ISEEKYAN/mbridge) for large-scale model parallelism. Trajectories are rolled out by sglang; Megatron handles actor / ref updates.
|
| 312 |
+
|
| 313 |
+
### Install
|
| 314 |
+
|
| 315 |
+
```bash
|
| 316 |
+
cd code/RL/rllm && pip install -e .
|
| 317 |
+
cd ../Megatron-LM && pip install -e .
|
| 318 |
+
cd ../mbridge && pip install -e .
|
| 319 |
+
pip install "sglang[all]" transformer_engine flash-attn \
|
| 320 |
+
ray==2.34.* hydra-core omegaconf wandb \
|
| 321 |
+
pillow requests python-dotenv
|
| 322 |
+
```
|
| 323 |
+
|
| 324 |
+
Copy the env template:
|
| 325 |
+
|
| 326 |
+
```bash
|
| 327 |
+
cp RL/rllm/.env.example RL/rllm/.env # edit keys as needed
|
| 328 |
+
```
|
| 329 |
+
|
| 330 |
+
### Data preparation
|
| 331 |
+
|
| 332 |
+
The workflow expects `rllm.data.DatasetRegistry` to hold a dataset named `Vision-DeepResearch-QA` (i.e. **SearchVL-RL-8k**). Two helpers in `RL/rllm/vision_deepresearch_async_workflow/data_prepare/` handle the conversion:
|
| 333 |
+
|
| 334 |
+
```bash
|
| 335 |
+
cd RL/rllm/vision_deepresearch_async_workflow/data_prepare
|
| 336 |
+
|
| 337 |
+
# 1) Extract embedded image bytes → PNG + JSONL
|
| 338 |
+
DATA_ROOT=./data/Vision-DeepResearch-RL-Data \
|
| 339 |
+
bash convert_parquet2jsonl.sh
|
| 340 |
+
|
| 341 |
+
# 2) Register it with rLLM as "Vision-DeepResearch-QA" (90 / 10 split)
|
| 342 |
+
JSONL_PATH=./data/Vision-DeepResearch-RL-Data/vision-deepresearch_RL_Demo_1k.jsonl \
|
| 343 |
+
bash register_rl_dataset.sh
|
| 344 |
+
```
|
| 345 |
+
|
| 346 |
+
### Launch
|
| 347 |
+
|
| 348 |
+
All run scripts `cd` into `rllm/`, auto-source `.env`, and call `python -m vision_deepresearch_async_workflow.train_deepresearch_workflow_megatron` with the right Hydra overrides.
|
| 349 |
+
|
| 350 |
+
```bash
|
| 351 |
+
# Primary configuration in the paper: 8B dense, 8 nodes × 8 GPU
|
| 352 |
+
bash RL/rllm/vision_deepresearch_async_workflow/run/qwen3-vl-8b-multi-node.sh
|
| 353 |
+
|
| 354 |
+
# Other presets:
|
| 355 |
+
bash RL/rllm/vision_deepresearch_async_workflow/run/qwen3-vl-8b-single-node.sh # 1-node smoke test
|
| 356 |
+
bash RL/rllm/vision_deepresearch_async_workflow/run/qwen3-vl-30b-3b-multi-node.sh # 30B-A3B MoE
|
| 357 |
+
bash RL/rllm/vision_deepresearch_async_workflow/run/qwen3-vl-32b-multi-node.sh # 32B dense
|
| 358 |
+
```
|
| 359 |
+
|
| 360 |
+
### Key hyper-parameters (8B multi-node)
|
| 361 |
+
|
| 362 |
+
| Field | Value |
|
| 363 |
+
| --- | --- |
|
| 364 |
+
| Advantage estimator | `rloo` (set `grpo` / `reinforce_plus_plus` to swap) |
|
| 365 |
+
| KL coefficient | `0.001` |
|
| 366 |
+
| Clip ratio (high) | `0.28` |
|
| 367 |
+
| Train prompt batch | `256` (group size `n_resp_per_prompt=8`, mini-batch `64`) |
|
| 368 |
+
| Max prompt / response length | `4096` / `70000` |
|
| 369 |
+
| Megatron parallelism | `TP=4 / PP=2 / CP=8` (dense) |
|
| 370 |
+
| sglang rollout | `TP=4`, `gpu_memory_utilization=0.85` |
|
| 371 |
+
| Reward composition | $r = r_{\text{fmt}} \cdot [\,0.8\, r_{\text{acc}} + 0.2\, r_{\text{query}}\,]$ |
|
| 372 |
+
| Fatal threshold $K$ | `3` consecutive tool-execution errors |
|
| 373 |
+
|
| 374 |
+
Checkpoints go to `checkpoints/${project_name}/${exp_name}/`; trajectories can be dumped to `$TRAJ_DUMP_DIR` (default `./trajectory_dumps/<exp>/`).
|
| 375 |
+
|
| 376 |
+
### Reproducing the paper
|
| 377 |
+
|
| 378 |
+
| Variant | Script | Cluster |
|
| 379 |
+
| --- | --- | --- |
|
| 380 |
+
| OpenSearch-VL-8B | `qwen3-vl-8b-multi-node.sh` | 8 × 8 H100 / H800 |
|
| 381 |
+
| OpenSearch-VL-30B-A3B | `qwen3-vl-30b-3b-multi-node.sh` | 8 × 8 H100 / H800 |
|
| 382 |
+
| OpenSearch-VL-32B | `qwen3-vl-32b-multi-node.sh` | 16 × 8 H100 / H800 |
|
| 383 |
+
|
| 384 |
+
> Full details, environment variables, and cluster notes: [`code/RL/README.md`](RL/README.md).
|
| 385 |
+
|
| 386 |
+
---
|
| 387 |
+
|
| 388 |
+
## 📊 Inference & Evaluation · `code/infer`
|
| 389 |
+
|
| 390 |
+
Modular Python package that drives the trained model as a **tool-using Visual Investigation Agent**, plus a **GPT-4o judge** for standardized benchmark scoring. The same agent loop, tool environment and search/visual utilities are shared across all three Qwen3-VL backends and the optional Claude Opus 4.5 backend; the variant is selected with a single `--model` flag.
|
| 391 |
+
|
| 392 |
+
### Package layout
|
| 393 |
+
|
| 394 |
+
```
|
| 395 |
+
infer/
|
| 396 |
+
├── run_infer.py # unified entrypoint (--model 8b|32b|30b-a3b|claude)
|
| 397 |
+
├── run_infer.sh # env-driven wrapper around run_infer.py
|
| 398 |
+
├── run_eval.sh # env-driven judge driver across all benchmarks
|
| 399 |
+
├── eval_with_gpt4o.py # GPT-4o judge for BrowseComp-VL / HLE / VDR-Bench
|
| 400 |
+
├── .env.example # full env-variable template
|
| 401 |
+
└── opensearch_infer/
|
| 402 |
+
├── config.py # env-driven settings + ModelSpec registry
|
| 403 |
+
├── prompts.py # Visual Investigation Agent system prompt
|
| 404 |
+
├── auth.py # HMAC helper + Claude gateway client
|
| 405 |
+
├── cos_upload.py # optional COS uploader bootstrap
|
| 406 |
+
├── image_io.py # image download / decode / cache utilities
|
| 407 |
+
├── image_engines.py # PIL + OpenCV crop / OCR / enhance pipelines
|
| 408 |
+
├── search.py # text_search / image_search / layout_parsing
|
| 409 |
+
├── tools.py # JSON tool schema + parsing + dispatcher
|
| 410 |
+
├── messages.py # Gemini ↔ Claude / Qwen3-VL converters
|
| 411 |
+
├── runners.py # ClaudeRunner + Qwen3VLRunner (dense + MoE)
|
| 412 |
+
└── pipeline.py # per-case multi-turn agent loop
|
| 413 |
+
```
|
| 414 |
+
|
| 415 |
+
### Rollout
|
| 416 |
+
|
| 417 |
+
One entrypoint, four backends. Each call accepts a parquet of questions + images and writes one trajectory JSON per sample:
|
| 418 |
+
|
| 419 |
+
| `--model` | Backend |
|
| 420 |
+
| ----------- | ---------------------------------------------------- |
|
| 421 |
+
| `8b` | OpenSearch-VL-8B (Qwen3-VL-8B base, dense) |
|
| 422 |
+
| `32b` | OpenSearch-VL-32B (Qwen3-VL-32B base, dense) |
|
| 423 |
+
| `30b-a3b` | OpenSearch-VL-30B-A3B (Qwen3-VL-30B-A3B base, MoE) |
|
| 424 |
+
| `claude` | Claude Opus 4.5 via HMAC gateway (no GPU required) |
|
| 425 |
+
|
| 426 |
+
Multi-GPU model parallelism is enabled automatically when `--gpus 0,1,...` lists more than one device (`device_map="auto"`); single-GPU placement uses `device_map={"": "cuda:N"}`. The MoE scatter dtype patch for 30B-A3B is applied automatically.
|
| 427 |
+
|
| 428 |
+
```bash
|
| 429 |
+
# Source the env template first; only the entries you need have to be filled in.
|
| 430 |
+
cp infer/.env.example ~/.opensearch-vl.env
|
| 431 |
+
source ~/.opensearch-vl.env
|
| 432 |
+
|
| 433 |
+
# Dense Qwen3-VL-8B on a single GPU
|
| 434 |
+
python infer/run_infer.py --model 8b --gpus 0 \
|
| 435 |
+
--data-path /path/to/benchmark.parquet \
|
| 436 |
+
--output-dir ./outputs/opensearch_vl_8b \
|
| 437 |
+
--start 0 --end 1000
|
| 438 |
+
|
| 439 |
+
# MoE Qwen3-VL-30B-A3B with 4-way model parallel (auto-applies the scatter dtype patch)
|
| 440 |
+
python infer/run_infer.py --model 30b-a3b --gpus 0,1,2,3 \
|
| 441 |
+
--checkpoint /path/to/OpenSearch-VL-30B-A3B \
|
| 442 |
+
--data-path /path/to/benchmark.parquet \
|
| 443 |
+
--output-dir ./outputs/opensearch_vl_30b_a3b
|
| 444 |
+
|
| 445 |
+
# Claude Opus 4.5 (CLAUDE_API_HOST / _USER / _KEY required)
|
| 446 |
+
python infer/run_infer.py --model claude \
|
| 447 |
+
--data-path /path/to/benchmark.parquet \
|
| 448 |
+
--output-dir ./outputs/claude_opus
|
| 449 |
+
```
|
| 450 |
+
|
| 451 |
+
The shell wrapper `infer/run_infer.sh` reads the same parameters from environment variables (`MODEL`, `GPUS`, `DATA_PATH`, `OUTPUT_DIR`, `LIMIT`, `CATEGORY`, ...) for one-line invocations.
|
| 452 |
+
|
| 453 |
+
### Benchmark evaluation
|
| 454 |
+
|
| 455 |
+
[`eval_with_gpt4o.py`](infer/eval_with_gpt4o.py) consumes the trajectory directory produced above and calls a GPT-4o-class judge to compute per-sample correctness using the **VDR-Bench evaluation prompt**:
|
| 456 |
+
|
| 457 |
+
```bash
|
| 458 |
+
python infer/eval_with_gpt4o.py \
|
| 459 |
+
--traj_dir ./outputs/opensearch_vl_8b/bc_vl_level1 \
|
| 460 |
+
--benchmark bc_vl # one of: hle | bc_vl | vdr
|
| 461 |
+
--max_workers 20
|
| 462 |
+
```
|
| 463 |
+
|
| 464 |
+
`--answer_file` is required for VDR-Bench (pass the `.parquet` with `id` / `answer` columns).
|
| 465 |
+
|
| 466 |
+
[`run_eval.sh`](infer/run_eval.sh) is a thin driver that chains the five reported evaluations (BrowseComp-VL L1, BrowseComp-VL L2, HLE, VDR-Bench testmini × 2 models). Configure trajectory directories via env variables (`TRAJ_BC_VL_LEVEL1`, `TRAJ_BC_VL_LEVEL2`, `TRAJ_HLE`, `TRAJ_VDR_PRIMARY`, `TRAJ_VDR_SECONDARY`, `VDR_ANSWER_PARQUET`) and run:
|
| 467 |
+
|
| 468 |
+
```bash
|
| 469 |
+
bash infer/run_eval.sh --workers 20
|
| 470 |
+
```
|
| 471 |
+
|
| 472 |
+
> Full inference details: [`code/infer/README.md`](infer/README.md).
|
| 473 |
+
|
| 474 |
+
|
| 475 |
+
---
|
| 476 |
+
|
| 477 |
+
## 🙌 Acknowledgements
|
| 478 |
+
|
| 479 |
+
This repository bundles and builds on several outstanding open-source projects; each sub-directory retains its upstream `LICENSE`:
|
| 480 |
+
|
| 481 |
+
- [**LLaMA-Factory**](https://github.com/hiyouga/LLaMA-Factory) — SFT trainer and CLI (`code/SFT/`).
|
| 482 |
+
- [**rLLM**](https://github.com/rllm-org/rllm) and [**verl**](https://github.com/volcengine/verl) — agentic RL framework (`code/RL/rllm/`).
|
| 483 |
+
- [**Megatron-LM**](https://github.com/NVIDIA/Megatron-LM) and [**mbridge**](https://github.com/ISEEKYAN/mbridge) — model-parallel backend (`code/RL/Megatron-LM/`, `code/RL/mbridge/`).
|
| 484 |
+
- [**sglang**](https://github.com/sgl-project/sglang) — async rollout engine.
|
| 485 |
+
- [**Qwen-VL**](https://github.com/QwenLM/Qwen3-VL) — base VLM checkpoints.
|
| 486 |
+
- We also thank the authors of [Search-R1](https://github.com/PeterGriffinJin/Search-R1) and [Vision-DeepResearch](https://github.com/Alibaba-NLP/Vision-DeepResearch) whose ideas inspired our multi-turn search-augmented RL formulation.
|
| 487 |
+
|
| 488 |
+
Project-specific additions are released under the root [`LICENSE`](LICENSE) (Apache 2.0).
|
| 489 |
+
|
| 490 |
+
---
|