Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2505.20411

Evaluation Methods & Metrics

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Paper • 2603.01562 • Published Mar 2 • 63
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Paper • 2603.03790 • Published Mar 4 • 121
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Paper • 2505.20411 • Published May 26, 2025 • 96
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Paper • 2602.23866 • Published Feb 27 • 88

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

Paper • 2505.19253 • Published May 25, 2025 • 34
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Paper • 2505.20411 • Published May 26, 2025 • 96
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Paper • 2505.21497 • Published May 27, 2025 • 109
Agentic Reinforced Policy Optimization

Paper • 2507.19849 • Published Jul 26, 2025 • 161

Research Papers/Reviews/Literature

Daily Research papers and review including older relevant content.

Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

Paper • 2501.18585 • Published Jan 30, 2025 • 61
RWKV-7 "Goose" with Expressive Dynamic State Evolution

Paper • 2503.14456 • Published Mar 18, 2025 • 154
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

Paper • 2503.15265 • Published Mar 19, 2025 • 46
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Paper • 2503.15558 • Published Mar 18, 2025 • 50

Safety in Large Reasoning Models: A Survey

Paper • 2504.17704 • Published Apr 24, 2025
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

Paper • 2503.23803 • Published Mar 31, 2025 • 8
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Paper • 2508.18106 • Published Aug 25, 2025 • 350
Where LLM Agents Fail and How They can Learn From Failures

Paper • 2509.25370 • Published Sep 29, 2025 • 12

InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU

Paper • 2502.08910 • Published Feb 13, 2025 • 150
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

Paper • 2502.18890 • Published Feb 26, 2025 • 30
MPO: Boosting LLM Agents with Meta Plan Optimization

Paper • 2503.02682 • Published Mar 4, 2025 • 29
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Paper • 2505.20411 • Published May 26, 2025 • 96

Evaluation Methods & Metrics

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Paper • 2603.01562 • Published Mar 2 • 63
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Paper • 2603.03790 • Published Mar 4 • 121
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Paper • 2505.20411 • Published May 26, 2025 • 96
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Paper • 2602.23866 • Published Feb 27 • 88

Safety in Large Reasoning Models: A Survey

Paper • 2504.17704 • Published Apr 24, 2025
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

Paper • 2503.23803 • Published Mar 31, 2025 • 8
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Paper • 2508.18106 • Published Aug 25, 2025 • 350
Where LLM Agents Fail and How They can Learn From Failures

Paper • 2509.25370 • Published Sep 29, 2025 • 12

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

Paper • 2505.19253 • Published May 25, 2025 • 34
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Paper • 2505.20411 • Published May 26, 2025 • 96
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Paper • 2505.21497 • Published May 27, 2025 • 109
Agentic Reinforced Policy Optimization

Paper • 2507.19849 • Published Jul 26, 2025 • 161

InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU

Paper • 2502.08910 • Published Feb 13, 2025 • 150
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

Paper • 2502.18890 • Published Feb 26, 2025 • 30
MPO: Boosting LLM Agents with Meta Plan Optimization

Paper • 2503.02682 • Published Mar 4, 2025 • 29
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Paper • 2505.20411 • Published May 26, 2025 • 96

Research Papers/Reviews/Literature

Daily Research papers and review including older relevant content.

Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

Paper • 2501.18585 • Published Jan 30, 2025 • 61
RWKV-7 "Goose" with Expressive Dynamic State Evolution

Paper • 2503.14456 • Published Mar 18, 2025 • 154
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

Paper • 2503.15265 • Published Mar 19, 2025 • 46
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Paper • 2503.15558 • Published Mar 18, 2025 • 50

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs