Papers
arxiv:2511.07833

MURPHY: Multi-Turn GRPO for Self Correcting Code Generation

Published on Feb 15
Authors:
,
,
,
,
,
,

Abstract

MURPHY enhances multi-turn reasoning in language models by integrating execution feedback into reinforcement learning, improving code generation performance through trajectory-level credit assignment and pruning techniques.

AI-generated summary

Reinforcement Learning with Verifiable Rewards(RLVR) has emerged as a powerful framework for enhancing the reasoning capabilities of large language models (LLMs). However, existing approaches such as Group Relative Policy Optimization (GRPO) and its variants, while effective on reasoning benchmarks, struggle with agentic tasks that require iterative decision-making. We introduce MURPHY, a multi-turn RLVR framework that incorporates execution feedback directly into training, extending GRPO to optimize over multi-turn trajectories where models iteratively refine solutions. MURPHY combines a feedback conditioned rollout tree with trajectory-level credit assignment, and uses pruning to reduce the cost of multi-turn optimization. Evaluations on code generation benchmarks with two model families show that MURPHY consistently improves multi-iteration performance, achieving up to an 8% absolute gain in pass@1 over compute-matched GRPO baselines, and outperforming the prior leading method that incorporates multi-turn execution feedback.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2511.07833
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.07833 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.07833 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.