Papers
arxiv:2605.05724

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

Published on May 7
· Submitted by
Ethan Ning
on May 8
Authors:
,
,
,
,

Abstract

Auto research operates as an empirical loop where agents iteratively refine code based on evaluation feedback, achieving improved performance across multiple tasks without human intervention.

AI-generated summary

We study auto research as a closed empirical loop driven by external measurement. Each submitted trial carries a hypothesis, an executable code edit, an evaluator-owned outcome, and feedback that shapes the next proposal. The output is not a generated paper or a single model checkpoint, but an auditable trajectory of proposals, code diffs, experiments, scores, and failure labels. We instantiate this loop with specialist agents that partition recipe surfaces and share measured lineage across trials. The central empirical finding is that lineage feedback lets agents turn evaluator outcomes, including crashes, budget overruns, size failures, and accuracy-gate misses, into later program-level recipe edits rather than one-shot suggestions. Across 1,197 headline-run trials plus 600 Parameter Golf control trials after one-time setup and launch, humans did not choose proposals, edit recipes, override scores, or repair failed trials during the search. In the three headline runs, the same submitted-trial loop reduces Parameter Golf validation bpb by 0.81%, raises NanoChat-D12 CORE by 38.7%, and reduces CIFAR-10 Airbench96 wallclock by 4.59%, with each task measured by its own external evaluator and legality checks. The trace includes a strict architecture-domain audit of 157 headline-run submissions and program rewrites such as a NanoChat attention-kernel path change. Within this scope the loop autonomously writes code, submits experiments, absorbs feedback, applies and combines known techniques inside each environment, and improves public starting recipes.

Community

Paper submitter

Closed-loop auto research turns agent-written code, real experiments, and evaluator feedback into an autonomous feedback loop that develops non-trivial training recipes.

the most interesting nugget here is how lineage feedback lets evaluator outcomes morph into concrete, program-level edits rather than just one-shot suggestions. crashes, timeouts, and artifact limits are absorbed and routed into next-trial edits that touch architecture, optimization schedules, or data handling, all while keeping an auditable trace. that detail, with specialist agents sharing a lineage and translating external signals into code diffs, feels like a practical blueprint for scalable auto research. arxivlens had a solid breakdown that helped me parse the method, e.g. this page: https://arxivlens.com/PaperView/Details/auto-research-with-specialist-agents-develops-effective-and-non-trivial-training-recipes-6929-1546ee80. do you think this could be destabilized by feedback loops chasing short-term fixes in uneven compute budgets?

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.05724
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.05724 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.05724 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.05724 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.