Papers
arxiv:2604.08706

Efficient RL Training for LLMs with Experience Replay

Published on Apr 9
· Submitted by
Vivien Cabannes
on Apr 14
Authors:
,
,
,
,

Abstract

Experience replay techniques for large language model post-training balance staleness variance and computational costs while maintaining performance and policy entropy.

AI-generated summary

While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.

Community

Paper submitter

Experience replay can cut LLM RL training compute by up to ~40% (without hurting final accuracy—and sometimes improving it).

Experience replay (reusing past rollouts) is a staple of classical RL, but is still underexplored in LLM post-training—where the default is “stay as on-policy as possible”.
In modern LLM RL pipelines, rollout generation can be >80% of total GPU time. Reusing rollouts even a little can save a lot of compute.

We studied a minimal, easy-to-drop-in replay buffer for async RL:
inference workers continuously push trajectories into a FIFO buffer
trainers sample uniformly from the buffer (sampling doesn’t remove items)

Main result: replay can slightly hurt performance per gradient step, but improves performance per unit of compute.
On MATH with Qwen2.5-7B, a well-chosen buffer reaches the same accuracy with up to ~40% less compute.

We also see a “slow-but-stable” effect: larger buffers learn more slowly, but training becomes more stable and can sometimes reach higher peak accuracy.
Replay can also help preserve output diversity → better pass@k for k>1.

Intuition: replay changes the effective training distribution. Mixing in older samples makes it more diverse over time than in purely on-policy training, which helps stabilize the training.

We also explored extensions beyond uniform replay:

  • alternative losses beyond GRPO
  • alternative sampling (e.g., biasing toward positive/correct trajectories)
    Early results look promising.

Theory: SGD with replay can converge faster as a function of compute by optimizing the trade-off between:

  • expensive rollout generation
  • staleness-induced variance
  • sample correlations / diversity
    It connects practical knobs—buffer size + replay ratio—to those costs.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.08706
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.08706 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.08706 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.08706 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.