-
RubricBench: Aligning Model-Generated Rubrics with Human Standards
Paper • 2603.01562 • Published • 63 -
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
Paper • 2603.03790 • Published • 121 -
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Paper • 2505.20411 • Published • 96 -
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale
Paper • 2602.23866 • Published • 88
Collections
Discover the best community collections!
Collections including paper arxiv:2505.20411
-
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
Paper • 2505.19253 • Published • 34 -
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Paper • 2505.20411 • Published • 96 -
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Paper • 2505.21497 • Published • 109 -
Agentic Reinforced Policy Optimization
Paper • 2507.19849 • Published • 161
-
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Paper • 2501.18585 • Published • 61 -
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper • 2503.14456 • Published • 154 -
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
Paper • 2503.15265 • Published • 46 -
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper • 2503.15558 • Published • 50
-
Safety in Large Reasoning Models: A Survey
Paper • 2504.17704 • Published -
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute
Paper • 2503.23803 • Published • 8 -
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
Paper • 2508.18106 • Published • 350 -
Where LLM Agents Fail and How They can Learn From Failures
Paper • 2509.25370 • Published • 12
-
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
Paper • 2502.08910 • Published • 150 -
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens
Paper • 2502.18890 • Published • 30 -
MPO: Boosting LLM Agents with Meta Plan Optimization
Paper • 2503.02682 • Published • 29 -
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Paper • 2505.20411 • Published • 96
-
RubricBench: Aligning Model-Generated Rubrics with Human Standards
Paper • 2603.01562 • Published • 63 -
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
Paper • 2603.03790 • Published • 121 -
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Paper • 2505.20411 • Published • 96 -
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale
Paper • 2602.23866 • Published • 88
-
Safety in Large Reasoning Models: A Survey
Paper • 2504.17704 • Published -
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute
Paper • 2503.23803 • Published • 8 -
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
Paper • 2508.18106 • Published • 350 -
Where LLM Agents Fail and How They can Learn From Failures
Paper • 2509.25370 • Published • 12
-
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
Paper • 2505.19253 • Published • 34 -
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Paper • 2505.20411 • Published • 96 -
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Paper • 2505.21497 • Published • 109 -
Agentic Reinforced Policy Optimization
Paper • 2507.19849 • Published • 161
-
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
Paper • 2502.08910 • Published • 150 -
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens
Paper • 2502.18890 • Published • 30 -
MPO: Boosting LLM Agents with Meta Plan Optimization
Paper • 2503.02682 • Published • 29 -
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Paper • 2505.20411 • Published • 96
-
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Paper • 2501.18585 • Published • 61 -
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper • 2503.14456 • Published • 154 -
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
Paper • 2503.15265 • Published • 46 -
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper • 2503.15558 • Published • 50