-
Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning
Paper • 2603.04597 • Published • 210 -
SII-Enigma/Llama3.2-8B-Ins-AMPO
Text Generation • 8B • Updated • 48 -
Understanding R1-Zero-Like Training: A Critical Perspective
Paper • 2503.20783 • Published • 59 -
Planner-R1: Reward Shaping Enables Efficient Agentic RL with Smaller LLMs
Paper • 2509.25779 • Published • 19
Collections
Discover the best community collections!
Collections including paper arxiv:2503.20783
-
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 125 -
Understanding R1-Zero-Like Training: A Critical Perspective
Paper • 2503.20783 • Published • 59 -
Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
Paper • 2508.08221 • Published • 50 -
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Paper • 1910.02054 • Published • 11
-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 32 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 110 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 27 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 107
-
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Paper • 2602.10693 • Published • 220 -
Reinforced Attention Learning
Paper • 2602.04884 • Published • 29 -
Learning to Reason in 13 Parameters
Paper • 2602.04118 • Published • 6 -
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
Paper • 2405.17604 • Published • 3
-
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Paper • 2503.14476 • Published • 146 -
Training language models to follow instructions with human feedback
Paper • 2203.02155 • Published • 24 -
Llama 2: Open Foundation and Fine-Tuned Chat Models
Paper • 2307.09288 • Published • 251 -
The Llama 3 Herd of Models
Paper • 2407.21783 • Published • 118
-
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Paper • 2504.05118 • Published • 26 -
T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models
Paper • 2504.04718 • Published • 43 -
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
Paper • 2504.03561 • Published • 18 -
Concept Lancet: Image Editing with Compositional Representation Transplant
Paper • 2504.02828 • Published • 16
-
Understanding R1-Zero-Like Training: A Critical Perspective
Paper • 2503.20783 • Published • 59 -
sail/Qwen2.5-Math-7B-Oat-Zero
Text Generation • 8B • Updated • 197 • • 6 -
sail/Qwen2.5-Math-1.5B-Oat-Zero
Text Generation • 2B • Updated • 405 • • 4 -
sail/Llama-3.2-3B-Oat-Zero
Text Generation • 3B • Updated • 12 • 1
-
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39 -
Token-Budget-Aware LLM Reasoning
Paper • 2412.18547 • Published • 46 -
Efficiently Serving LLM Reasoning Programs with Certaindex
Paper • 2412.20993 • Published • 36 -
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
Paper • 2412.17256 • Published • 47
-
Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning
Paper • 2603.04597 • Published • 210 -
SII-Enigma/Llama3.2-8B-Ins-AMPO
Text Generation • 8B • Updated • 48 -
Understanding R1-Zero-Like Training: A Critical Perspective
Paper • 2503.20783 • Published • 59 -
Planner-R1: Reward Shaping Enables Efficient Agentic RL with Smaller LLMs
Paper • 2509.25779 • Published • 19
-
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Paper • 2602.10693 • Published • 220 -
Reinforced Attention Learning
Paper • 2602.04884 • Published • 29 -
Learning to Reason in 13 Parameters
Paper • 2602.04118 • Published • 6 -
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
Paper • 2405.17604 • Published • 3
-
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 125 -
Understanding R1-Zero-Like Training: A Critical Perspective
Paper • 2503.20783 • Published • 59 -
Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
Paper • 2508.08221 • Published • 50 -
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Paper • 1910.02054 • Published • 11
-
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Paper • 2503.14476 • Published • 146 -
Training language models to follow instructions with human feedback
Paper • 2203.02155 • Published • 24 -
Llama 2: Open Foundation and Fine-Tuned Chat Models
Paper • 2307.09288 • Published • 251 -
The Llama 3 Herd of Models
Paper • 2407.21783 • Published • 118
-
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Paper • 2504.05118 • Published • 26 -
T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models
Paper • 2504.04718 • Published • 43 -
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
Paper • 2504.03561 • Published • 18 -
Concept Lancet: Image Editing with Compositional Representation Transplant
Paper • 2504.02828 • Published • 16
-
Understanding R1-Zero-Like Training: A Critical Perspective
Paper • 2503.20783 • Published • 59 -
sail/Qwen2.5-Math-7B-Oat-Zero
Text Generation • 8B • Updated • 197 • • 6 -
sail/Qwen2.5-Math-1.5B-Oat-Zero
Text Generation • 2B • Updated • 405 • • 4 -
sail/Llama-3.2-3B-Oat-Zero
Text Generation • 3B • Updated • 12 • 1
-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 32 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 110 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 27 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 107
-
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
Paper • 2412.18319 • Published • 39 -
Token-Budget-Aware LLM Reasoning
Paper • 2412.18547 • Published • 46 -
Efficiently Serving LLM Reasoning Programs with Certaindex
Paper • 2412.20993 • Published • 36 -
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
Paper • 2412.17256 • Published • 47