EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings Paper β’ 2603.13594 β’ Published Mar 13 β’ 148
view article Article Nemotron 3 Nano \- A new Standard for Efficient, Open, and Intelligent Agentic Models Dec 15, 2025 β’ 111
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper β’ 2509.24002 β’ Published Sep 28, 2025 β’ 179
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain Paper β’ 2509.26507 β’ Published Sep 30, 2025 β’ 550
Less is More: Recursive Reasoning with Tiny Networks Paper β’ 2510.04871 β’ Published Oct 6, 2025 β’ 513
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? Paper β’ 2510.02209 β’ Published Oct 2, 2025 β’ 57
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge Paper β’ 2506.21506 β’ Published Jun 26, 2025 β’ 52
Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA Paper β’ 2505.21115 β’ Published May 27, 2025 β’ 143
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning Paper β’ 2504.17192 β’ Published Apr 24, 2025 β’ 124
RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy Paper β’ 2503.24388 β’ Published Mar 31, 2025 β’ 29
view article Article π¦Έπ»#14: What Is MCP, and Why Is Everyone β Suddenly!β Talking About It? Mar 17, 2025 β’ 355
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers Paper β’ 2502.15007 β’ Published Feb 20, 2025 β’ 175
Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems Paper β’ 2502.11098 β’ Published Feb 16, 2025 β’ 13
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? Paper β’ 2502.12115 β’ Published Feb 17, 2025 β’ 46