Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models Paper • 2508.05613 • Published Aug 7, 2025 • 17
Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization Paper • 2508.07629 • Published Aug 11, 2025 • 43