arxiv:2605.04036

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Published on May 5

· Submitted by

taesiri on May 6

#2 Paper of the day

Upvote

Authors:

Abstract

A simple supervised fine-tuning approach achieves state-of-the-art performance in deep search capabilities using minimal data, outperforming complex industrial pipelines and demonstrating the effectiveness of academic-led development in large language model agents.

AI-generated summary

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In this report, we show that when fueled with informative and high-difficulty trajectories, a simple SFT approach could be surprisingly powerful for training frontier search agents. By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low-step filtering, we establish a stronger baseline. Trained on merely 10.6k data points, our OpenSeeker-v2 achieves state-of-the-art performance across 4 benchmarks (30B-sized agents with ReAct paradigm): 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively. Notably, OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT. We are excited to open-source the OpenSeeker-v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community.

View arXiv page View PDF GitHub 645 Add to collection

Community

avahal

about 18 hours ago

the real clever twist here is the data curriculum itself: with informative, high-difficulty trajectories, a plain sft-only 30b model can rival heavy cpt+sft+rl stacks. they do it by enlarging the evidence subgraph (bumping k), expanding the tool set, and enforcing strict low-step filtering, all while keeping training purely sft. this combination seems to act as a curriculum that forces multi-hop reasoning under long contexts, which the 256k window and up to 200 tool calls per trajectory enable. the arxivlens breakdown helped me parse the method details—worth a read for folks trying to audit the data flow: https://arxivlens.com/PaperView/Details/openseeker-v2-pushing-the-limits-of-search-agents-with-informative-and-high-difficulty-trajectories-5035-ac8e2c31. one q: how sensitive are the gains to the exact k and tool-set size when you move to different domains?