Post
5357
Rolling Benchmarks - Evaluating AI Agents on Unseen GitHub Repos
Static benchmarks are prone to leaderboard hacking and training data contamination, so how about a dynamic/rolling benchmark?
By limiting submissions to only freshly published code, we could evaluate based on consistency over time with rolling averages instead of finding agents overfit to a static benchmark.
Can rolling benchmarks bring us closer to evaluating agents in a way more closely aligned with their real-world applications? Perhaps a new direction for agent evaluation?
Would love to hear what you think about this!
More on reddit: https://www.reddit.com/r/LocalLLaMA/comments/1nmvw7a/rolling_benchmarks_evaluating_ai_agents_on_unseen/
Static benchmarks are prone to leaderboard hacking and training data contamination, so how about a dynamic/rolling benchmark?
By limiting submissions to only freshly published code, we could evaluate based on consistency over time with rolling averages instead of finding agents overfit to a static benchmark.
Can rolling benchmarks bring us closer to evaluating agents in a way more closely aligned with their real-world applications? Perhaps a new direction for agent evaluation?
Would love to hear what you think about this!
More on reddit: https://www.reddit.com/r/LocalLLaMA/comments/1nmvw7a/rolling_benchmarks_evaluating_ai_agents_on_unseen/