HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
Abstract
Frontier AI agents struggle with judgment calls about when to seek help, leading to poor performance on incomplete or ambiguous tasks despite having sufficient capabilities.
Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam. Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it.
Community
Frontier agents can solve up to 89% of SWE and SQL tasks when given full information. But with messy specifications, a few details removed or obscured — just like what is often found in real-world scenarios — the best model's performance drops to 24% — even with access to a tool it can use to ask for help.
HiL-Bench is a benchmark that measures how well agents can identify when they're missing important information and how well they can ask questions that will actually unblock them. Spoiler alert: not well!
Instead of asking for help when a blocker arose, agents usually overconfidently guessed and wrote incorrect solutions. Whenever they did try to ask, their questions were rarely targeted or even necessary. But there is still hope: we show that with RL training, agents can get better at asking questions when needed, and this skill can transfer across domains.
Current benchmarks focus only on execution correctness; an agent that would ask can be rewarded the same as an agent that would guess. HiL-Bench is the first to dive deep into problem-solving behavior with a human in the loop and reward correct ambiguity handling.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents (2026)
- Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks (2026)
- BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks (2026)
- Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents (2026)
- ClawArena: Benchmarking AI Agents in Evolving Information Environments (2026)
- AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation (2026)
- OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.09408 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper