Papers
arxiv:2604.01212

YC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Published on Apr 1
Authors:
,
,
,
,
,
,

Abstract

Large language model agents face challenges in maintaining strategic coherence over extended tasks, as demonstrated by a startup simulation benchmark that reveals key failure modes and performance gaps.

AI-generated summary

As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce YC-Bench, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of \200K, with Claude Opus 4.6 achieving the highest average final funds at 1.27 M, followed by GLM-5 at \1.21 M at 11\times lower inference cost. Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for 47\% of bankruptcies. Our analysis reveals that frontier models still fail through distinct failure modes such as over-parallelization, demonstrating the capability gaps for long-horizon performance. YC-Bench$ is open-source, reproducible, and configurable.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.01212
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.01212 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.01212 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.01212 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.