Collections
Discover the best community collections!
Collections including paper arxiv:2403.07974
-
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
Paper • 2402.07456 • Published • 46 -
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
Paper • 2402.14658 • Published • 84 -
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Paper • 2312.13010 • Published • 6 -
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Paper • 2403.07974 • Published • 5
-
Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code
Paper • 2311.07989 • Published • 26 -
Evaluating Large Language Models Trained on Code
Paper • 2107.03374 • Published • 10 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 9 -
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Paper • 2102.04664 • Published • 2
-
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11 -
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Paper • 2305.01210 • Published • 3 -
AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models
Paper • 2309.06495 • Published • 1 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 40
-
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
Paper • 2402.07456 • Published • 46 -
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
Paper • 2402.14658 • Published • 84 -
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Paper • 2312.13010 • Published • 6 -
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Paper • 2403.07974 • Published • 5
-
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11 -
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Paper • 2305.01210 • Published • 3 -
AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models
Paper • 2309.06495 • Published • 1 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 40
-
Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code
Paper • 2311.07989 • Published • 26 -
Evaluating Large Language Models Trained on Code
Paper • 2107.03374 • Published • 10 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 9 -
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Paper • 2102.04664 • Published • 2