Abstract
Experience replay techniques for large language model post-training balance staleness variance and computational costs while maintaining performance and policy entropy.
While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.
Community
Experience replay can cut LLM RL training compute by up to ~40% (without hurting final accuracy—and sometimes improving it).
Experience replay (reusing past rollouts) is a staple of classical RL, but is still underexplored in LLM post-training—where the default is “stay as on-policy as possible”.
In modern LLM RL pipelines, rollout generation can be >80% of total GPU time. Reusing rollouts even a little can save a lot of compute.
We studied a minimal, easy-to-drop-in replay buffer for async RL:
inference workers continuously push trajectories into a FIFO buffer
trainers sample uniformly from the buffer (sampling doesn’t remove items)
Main result: replay can slightly hurt performance per gradient step, but improves performance per unit of compute.
On MATH with Qwen2.5-7B, a well-chosen buffer reaches the same accuracy with up to ~40% less compute.
We also see a “slow-but-stable” effect: larger buffers learn more slowly, but training becomes more stable and can sometimes reach higher peak accuracy.
Replay can also help preserve output diversity → better pass@k for k>1.
Intuition: replay changes the effective training distribution. Mixing in older samples makes it more diverse over time than in purely on-policy training, which helps stabilize the training.
We also explored extensions beyond uniform replay:
- alternative losses beyond GRPO
- alternative sampling (e.g., biasing toward positive/correct trajectories)
Early results look promising.
Theory: SGD with replay can converge faster as a function of compute by optimizing the trade-off between:
- expensive rollout generation
- staleness-induced variance
- sample correlations / diversity
It connects practical knobs—buffer size + replay ratio—to those costs.
Get this paper in your agent:
hf papers read 2604.08706 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper