Papers
arxiv:2604.09408

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

Published on Apr 29
· Submitted by
Tu Trinh
on May 5
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Frontier AI agents struggle with judgment calls about when to seek help, leading to poor performance on incomplete or ambiguous tasks despite having sufficient capabilities.

AI-generated summary

Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam. Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it.

Community

Paper author Paper submitter

Frontier agents can solve up to 89% of SWE and SQL tasks when given full information. But with messy specifications, a few details removed or obscured — just like what is often found in real-world scenarios — the best model's performance drops to 24% — even with access to a tool it can use to ask for help.

HiL-Bench is a benchmark that measures how well agents can identify when they're missing important information and how well they can ask questions that will actually unblock them. Spoiler alert: not well!

Instead of asking for help when a blocker arose, agents usually overconfidently guessed and wrote incorrect solutions. Whenever they did try to ask, their questions were rarely targeted or even necessary. But there is still hope: we show that with RL training, agents can get better at asking questions when needed, and this skill can transfer across domains.

Current benchmarks focus only on execution correctness; an agent that would ask can be rewarded the same as an agent that would guess. HiL-Bench is the first to dive deep into problem-solving behavior with a human in the loop and reward correct ambiguity handling.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.09408
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.09408 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.09408 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.09408 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.