Synthetic Sandbox for Training Machine Learning Engineering Agents Paper • 2604.04872 • Published 7 days ago • 14
Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents Paper • 2601.18217 • Published Jan 26 • 13
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following Paper • 2511.21662 • Published Nov 26, 2025 • 11
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following Paper • 2511.21662 • Published Nov 26, 2025 • 11
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought Paper • 2511.02779 • Published Nov 4, 2025 • 60
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation Paper • 2511.01163 • Published Nov 3, 2025 • 32
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation Paper • 2511.01163 • Published Nov 3, 2025 • 32