-
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
Paper • 2601.11077 • Published • 67 -
OpenMOSS-Team/ABC-Bench
Viewer • Updated • 224 • 207 • 4 -
OpenMOSS-Team/Qwen3-32B-ABC
Text Generation • 33B • Updated • 43 • 3 -
OpenMOSS-Team/Qwen3-8B-ABC
Text Generation • 8B • Updated • 17 • 3
Collections
Discover the best community collections!
Collections including paper arxiv:2601.11077
-
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
Paper • 2602.12783 • Published • 216 -
MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
Paper • 2602.22638 • Published • 107 -
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
Paper • 2601.22027 • Published • 85 -
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
Paper • 2601.11077 • Published • 67
-
Xtra-Computing/XtraGPT-14B
Text Generation • Updated • 1.26k • 3 -
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
Paper • 2601.11077 • Published • 67 -
Molecular Contrastive Learning with Chemical Element Knowledge Graph
Paper • 2112.00544 • Published • 1 -
Self-Demos: Eliciting Out-of-Demonstration Generalizability in Large Language Models
Paper • 2404.00884 • Published • 1
-
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
Paper • 2502.05664 • Published • 24 -
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Paper • 2312.13010 • Published • 6 -
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale
Paper • 2409.16299 • Published • 11 -
Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI
Paper • 2505.19443 • Published • 15
-
Can Large Language Models Understand Context?
Paper • 2402.00858 • Published • 24 -
OLMo: Accelerating the Science of Language Models
Paper • 2402.00838 • Published • 85 -
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 153 -
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity
Paper • 2401.17072 • Published • 25
-
RubricBench: Aligning Model-Generated Rubrics with Human Standards
Paper • 2603.01562 • Published • 63 -
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
Paper • 2603.03790 • Published • 121 -
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Paper • 2505.20411 • Published • 96 -
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale
Paper • 2602.23866 • Published • 88
-
Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models
Paper • 2602.12036 • Published • 93 -
Reinforcement Learning for Self-Improving Agent with Skill Library
Paper • 2512.17102 • Published • 42 -
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
Paper • 2512.23705 • Published • 45 -
Schoenfeld's Anatomy of Mathematical Reasoning by Language Models
Paper • 2512.19995 • Published • 16
-
The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs
Paper • 2506.18403 • Published • 3 -
ReCode: Updating Code API Knowledge with Reinforcement Learning
Paper • 2506.20495 • Published • 10 -
SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution
Paper • 2507.23348 • Published • 12 -
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering
Paper • 2509.09614 • Published • 7
-
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
Paper • 2411.03562 • Published • 69 -
Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning
Paper • 2502.06060 • Published • 38 -
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Paper • 2502.14499 • Published • 195 -
SurveyX: Academic Survey Automation via Large Language Models
Paper • 2502.14776 • Published • 100
-
bigcode/starcoder
Text Generation • 16B • Updated • 10.5k • 2.94k -
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 322 -
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper • 2512.16676 • Published • 222 -
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Paper • 2511.18538 • Published • 304
-
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
Paper • 2601.11077 • Published • 67 -
OpenMOSS-Team/ABC-Bench
Viewer • Updated • 224 • 207 • 4 -
OpenMOSS-Team/Qwen3-32B-ABC
Text Generation • 33B • Updated • 43 • 3 -
OpenMOSS-Team/Qwen3-8B-ABC
Text Generation • 8B • Updated • 17 • 3
-
RubricBench: Aligning Model-Generated Rubrics with Human Standards
Paper • 2603.01562 • Published • 63 -
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
Paper • 2603.03790 • Published • 121 -
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Paper • 2505.20411 • Published • 96 -
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale
Paper • 2602.23866 • Published • 88
-
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
Paper • 2602.12783 • Published • 216 -
MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
Paper • 2602.22638 • Published • 107 -
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
Paper • 2601.22027 • Published • 85 -
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
Paper • 2601.11077 • Published • 67
-
Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models
Paper • 2602.12036 • Published • 93 -
Reinforcement Learning for Self-Improving Agent with Skill Library
Paper • 2512.17102 • Published • 42 -
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
Paper • 2512.23705 • Published • 45 -
Schoenfeld's Anatomy of Mathematical Reasoning by Language Models
Paper • 2512.19995 • Published • 16
-
Xtra-Computing/XtraGPT-14B
Text Generation • Updated • 1.26k • 3 -
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
Paper • 2601.11077 • Published • 67 -
Molecular Contrastive Learning with Chemical Element Knowledge Graph
Paper • 2112.00544 • Published • 1 -
Self-Demos: Eliciting Out-of-Demonstration Generalizability in Large Language Models
Paper • 2404.00884 • Published • 1
-
The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs
Paper • 2506.18403 • Published • 3 -
ReCode: Updating Code API Knowledge with Reinforcement Learning
Paper • 2506.20495 • Published • 10 -
SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution
Paper • 2507.23348 • Published • 12 -
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering
Paper • 2509.09614 • Published • 7
-
CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
Paper • 2502.05664 • Published • 24 -
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Paper • 2312.13010 • Published • 6 -
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale
Paper • 2409.16299 • Published • 11 -
Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI
Paper • 2505.19443 • Published • 15
-
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
Paper • 2411.03562 • Published • 69 -
Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning
Paper • 2502.06060 • Published • 38 -
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Paper • 2502.14499 • Published • 195 -
SurveyX: Academic Survey Automation via Large Language Models
Paper • 2502.14776 • Published • 100
-
Can Large Language Models Understand Context?
Paper • 2402.00858 • Published • 24 -
OLMo: Accelerating the Science of Language Models
Paper • 2402.00838 • Published • 85 -
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 153 -
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity
Paper • 2401.17072 • Published • 25
-
bigcode/starcoder
Text Generation • 16B • Updated • 10.5k • 2.94k -
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 322 -
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper • 2512.16676 • Published • 222 -
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Paper • 2511.18538 • Published • 304