oguzhanercan 's Collections Finetuning Strategies
updated
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Paper
• 2507.21183
• Published • 15
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Paper
• 2507.21802
• Published • 19
EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for
Advantage Diversity
Paper
• 2507.21848
• Published • 9
Agentic Reinforced Policy Optimization
Paper
• 2507.19849
• Published • 161
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
Paper
• 2508.16153
• Published • 162
DCPO: Dynamic Clipping Policy Optimization
Paper
• 2509.02333
• Published • 22
Towards a Unified View of Large Language Model Post-Training
Paper
• 2509.04419
• Published • 76
Learning to Optimize Multi-Objective Alignment Through Dynamic Reward
Weighting
Paper
• 2509.11452
• Published • 14
Reinforcement Learning on Pre-Training Data
Paper
• 2509.19249
• Published • 67
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
Paper
• 2510.07242
• Published • 30
Reinforcing Diffusion Models by Direct Group Preference Optimization
Paper
• 2510.08425
• Published • 12
Free Lunch Alignment of Text-to-Image Diffusion Models without
Preference Image Pairs
Paper
• 2509.25771
• Published • 11
Routing Manifold Alignment Improves Generalization of Mixture-of-Experts
LLMs
Paper
• 2511.07419
• Published • 27
Video Generation Models Are Good Latent Reward Models
Paper
• 2511.21541
• Published • 47
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
Paper
• 2601.08763
• Published • 150
Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting
Paper
• 2601.02151
• Published • 113
How Far Can Unsupervised RLVR Scale LLM Training?
Paper
• 2603.08660
• Published • 58
Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models
Paper
• 2603.13985
• Published • 10