-
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Paper • 2601.05242 • Published • 230 -
On Predictability of Reinforcement Learning Dynamics for Large Language Models
Paper • 2510.00553 • Published • 9 -
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
Paper • 2501.03262 • Published • 104 -
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Paper • 2402.03300 • Published • 145
Collections
Discover the best community collections!
Collections including paper arxiv:2501.03262
-
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
Paper • 2501.03262 • Published • 104 -
Agentic Entropy-Balanced Policy Optimization
Paper • 2510.14545 • Published • 108 -
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
Paper • 2510.18927 • Published • 85
-
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
Paper • 2504.12626 • Published • 51 -
Qwen3 Technical Report
Paper • 2505.09388 • Published • 339 -
Qwen-Image Technical Report
Paper • 2508.02324 • Published • 274 -
DINOv3
Paper • 2508.10104 • Published • 305
-
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
Paper • 2501.03262 • Published • 104 -
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
Paper • 2412.06531 • Published • 72 -
The Differences Between Direct Alignment Algorithms are a Blur
Paper • 2502.01237 • Published • 113 -
Process Reinforcement through Implicit Rewards
Paper • 2502.01456 • Published • 62
-
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
Paper • 2501.03262 • Published • 104 -
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Paper • 2505.24864 • Published • 146 -
Reinforcement Learning in Vision: A Survey
Paper • 2508.08189 • Published • 30 -
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
Paper • 2508.03100 • Published
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628 -
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper • 2501.08313 • Published • 302 -
Group Sequence Policy Optimization
Paper • 2507.18071 • Published • 320 -
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
Paper • 2509.03867 • Published • 213
-
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper • 2503.01785 • Published • 86 -
When an LLM is apprehensive about its answers -- and when its uncertainty is justified
Paper • 2503.01688 • Published • 22 -
Predictive Data Selection: The Data That Predicts Is the Data That Teaches
Paper • 2503.00808 • Published • 57 -
Chain of Draft: Thinking Faster by Writing Less
Paper • 2502.18600 • Published • 50
-
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
Paper • 2501.03262 • Published • 104 -
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper • 2501.08313 • Published • 302 -
Towards Best Practices for Open Datasets for LLM Training
Paper • 2501.08365 • Published • 62 -
Qwen2.5-1M Technical Report
Paper • 2501.15383 • Published • 72
-
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Paper • 2601.05242 • Published • 230 -
On Predictability of Reinforcement Learning Dynamics for Large Language Models
Paper • 2510.00553 • Published • 9 -
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
Paper • 2501.03262 • Published • 104 -
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Paper • 2402.03300 • Published • 145
-
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
Paper • 2501.03262 • Published • 104 -
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Paper • 2505.24864 • Published • 146 -
Reinforcement Learning in Vision: A Survey
Paper • 2508.08189 • Published • 30 -
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
Paper • 2508.03100 • Published
-
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
Paper • 2501.03262 • Published • 104 -
Agentic Entropy-Balanced Policy Optimization
Paper • 2510.14545 • Published • 108 -
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
Paper • 2510.18927 • Published • 85
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628 -
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper • 2501.08313 • Published • 302 -
Group Sequence Policy Optimization
Paper • 2507.18071 • Published • 320 -
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
Paper • 2509.03867 • Published • 213
-
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
Paper • 2504.12626 • Published • 51 -
Qwen3 Technical Report
Paper • 2505.09388 • Published • 339 -
Qwen-Image Technical Report
Paper • 2508.02324 • Published • 274 -
DINOv3
Paper • 2508.10104 • Published • 305
-
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper • 2503.01785 • Published • 86 -
When an LLM is apprehensive about its answers -- and when its uncertainty is justified
Paper • 2503.01688 • Published • 22 -
Predictive Data Selection: The Data That Predicts Is the Data That Teaches
Paper • 2503.00808 • Published • 57 -
Chain of Draft: Thinking Faster by Writing Less
Paper • 2502.18600 • Published • 50
-
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
Paper • 2501.03262 • Published • 104 -
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper • 2501.08313 • Published • 302 -
Towards Best Practices for Open Datasets for LLM Training
Paper • 2501.08365 • Published • 62 -
Qwen2.5-1M Technical Report
Paper • 2501.15383 • Published • 72
-
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
Paper • 2501.03262 • Published • 104 -
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
Paper • 2412.06531 • Published • 72 -
The Differences Between Direct Alignment Algorithms are a Blur
Paper • 2502.01237 • Published • 113 -
Process Reinforcement through Implicit Rewards
Paper • 2502.01456 • Published • 62