Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2512.18470

Evaluation Methods & Metrics

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Paper • 2603.01562 • Published Mar 2 • 63
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Paper • 2603.03790 • Published Mar 4 • 121
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Paper • 2505.20411 • Published May 26, 2025 • 96
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Paper • 2602.23866 • Published Feb 27 • 88

CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging

Paper • 2502.05664 • Published Feb 8, 2025 • 24
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Paper • 2312.13010 • Published Dec 20, 2023 • 6
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

Paper • 2409.16299 • Published Sep 9, 2024 • 11
Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI

Paper • 2505.19443 • Published May 26, 2025 • 15

bigcode/starcoder

Text Generation • 16B • Updated Oct 8, 2024 • 10.5k • 2.94k
mHC: Manifold-Constrained Hyper-Connections

Paper • 2512.24880 • Published Dec 31, 2025 • 322
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Paper • 2512.16676 • Published Dec 18, 2025 • 222
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

Paper • 2511.18538 • Published Nov 23, 2025 • 304

Coding Benchmarks

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

Paper • 2511.18538 • Published Nov 23, 2025 • 304
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Paper • 2511.05459 • Published Nov 7, 2025 • 4
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Paper • 2512.18470 • Published Dec 20, 2025 • 12
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

Paper • 2601.09688 • Published Jan 14 • 127

MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models

Paper • 2310.11954 • Published Oct 18, 2023 • 24
Training Chain-of-Thought via Latent-Variable Inference

Paper • 2312.02179 • Published Nov 28, 2023 • 9
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Paper • 2401.16158 • Published Jan 29, 2024 • 20
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts

Paper • 2402.09727 • Published Feb 15, 2024 • 38

Evaluation Methods & Metrics

RubricBench: Aligning Model-Generated Rubrics with Human Standards

Paper • 2603.01562 • Published Mar 2 • 63
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Paper • 2603.03790 • Published Mar 4 • 121
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Paper • 2505.20411 • Published May 26, 2025 • 96
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Paper • 2602.23866 • Published Feb 27 • 88

Coding Benchmarks

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

Paper • 2511.18538 • Published Nov 23, 2025 • 304
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Paper • 2511.05459 • Published Nov 7, 2025 • 4
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Paper • 2512.18470 • Published Dec 20, 2025 • 12
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

Paper • 2601.09688 • Published Jan 14 • 127

CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging

Paper • 2502.05664 • Published Feb 8, 2025 • 24
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Paper • 2312.13010 • Published Dec 20, 2023 • 6
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

Paper • 2409.16299 • Published Sep 9, 2024 • 11
Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI

Paper • 2505.19443 • Published May 26, 2025 • 15

MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models

Paper • 2310.11954 • Published Oct 18, 2023 • 24
Training Chain-of-Thought via Latent-Variable Inference

Paper • 2312.02179 • Published Nov 28, 2023 • 9
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Paper • 2401.16158 • Published Jan 29, 2024 • 20
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts

Paper • 2402.09727 • Published Feb 15, 2024 • 38

bigcode/starcoder

Text Generation • 16B • Updated Oct 8, 2024 • 10.5k • 2.94k
mHC: Manifold-Constrained Hyper-Connections

Paper • 2512.24880 • Published Dec 31, 2025 • 322
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Paper • 2512.16676 • Published Dec 18, 2025 • 222
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

Paper • 2511.18538 • Published Nov 23, 2025 • 304

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs