CocoaBench: Evaluating Unified Digital Agents in the Wild Paper • 2604.11201 • Published 2 days ago • 28
CocoaBench: Evaluating Unified Digital Agents in the Wild Paper • 2604.11201 • Published 2 days ago • 28
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution Paper • 2510.08697 • Published Oct 9, 2025 • 39
view article Article BigCodeArena: Judging code generations end to end with code executions Oct 7, 2025 • 22
Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models Paper • 2411.08733 • Published Nov 13, 2024 • 1
Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective Paper • 2506.14965 • Published Jun 17, 2025 • 50
Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective Paper • 2506.14965 • Published Jun 17, 2025 • 50
Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs Paper • 2502.19411 • Published Feb 26, 2025 • 2
Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs Paper • 2502.19411 • Published Feb 26, 2025 • 2
Runtime error 26 Decentralized Arena Leaderboard 🥇 26 View and compare LLM evaluations across various domains
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models Paper • 2404.05221 • Published Apr 8, 2024 • 1
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models Paper • 2404.05221 • Published Apr 8, 2024 • 1