-
ClawBench: Can AI Agents Complete Everyday Online Tasks?
Paper • 2604.08523 • Published • 246 -
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
Paper • 2604.06132 • Published • 111 -
FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios
Paper • 2604.07413 • Published • 75 -
GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers
Paper • 2604.02648 • Published • 44
bogeumkim
bogeumkim
AI & ML interests
NLP
Recent Activity
updated a collection about 6 hours ago
eval-papers-collection updated a collection about 6 hours ago
eval-papers-collection updated a collection about 6 hours ago
eval-papers-collection