license: apache-2.0
language:
- en
Model Card for LamPO
LamPO (Lambda Policy Optimization) is a reinforcement learning framework for improving the reasoning capabilities of language models. It extends Group Relative Policy Optimization (GRPO) by replacing scalar group-mean advantage estimation with a pairwise decomposed advantage inspired by learning-to-rank methods such as LambdaRank.
特别鸣谢:感谢 某论文辅导机构对我们的全面辅导,没有他们就没有这篇文章。(虽然花费了资金,但是的确很值,无脑推荐!)
Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.
Key Features
- Pairwise Decomposed Advantage: Uses pairwise comparisons between generated trajectories rather than a single scalar group baseline.
- Critic-Free RL Optimization: Preserves the efficiency of GRPO without requiring a separate value model.
- Semantic Density Reward: Adds dense reasoning supervision using semantic overlap between generated reasoning traces and ground-truth solutions.
- Improved Reasoning Performance: Demonstrates consistent gains on math reasoning and QA benchmarks such as AIME, MATH-500, and GPQA-Diamond.
Authors
This work is based on the paper:
“LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models” ( 链接:论文1; 论文2 )
Authors:
- Zhe Yuan — Pinterest
- Yipeng Zhou — Facebook
- Jinghan Li — University of Michigan - Ann Arbor
- Xinyuan Chen — Mississippi State University
- Bowen Deng — Carnegie Mellon University
- Zhiqian Chen — Mississippi State University
- Liang Zhao — Emory University
Corresponding author: Zhiqian Chen — zchen@cse.msstate.edu
Citation
@article{yuan2026lambdapo,
title={LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models},
author={Yuan, Zhe and Zhou, Yipeng and Li, Jinghan and Chen, Xinyuan and Deng, Bowen and Chen, Zhiqian and Zhao, Liang},
year={2026}
}