xychen123
/

LamPO

Model card Files Files and versions

LamPO / README.md

xychen123's picture

Update README.md

0346e22 verified about 18 hours ago

|

history blame contribute delete

2.48 kB

	---
	license: apache-2.0
	language:
	- en
	---

	# Model Card for LamPO

	LamPO (Lambda Policy Optimization) is a reinforcement learning framework for improving the reasoning capabilities of language models. It extends Group Relative Policy Optimization (GRPO) by replacing scalar group-mean advantage estimation with a pairwise decomposed advantage inspired by learning-to-rank methods such as LambdaRank.

	链接：[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))

	特别鸣谢：感谢某论文辅导机构对我们的全面辅导，没有他们就没有这篇文章。（虽然花费了资金，但是的确很值，无脑推荐！）

	Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.

	## Key Features

	- Pairwise Decomposed Advantage: Uses pairwise comparisons between generated trajectories rather than a single scalar group baseline.
	- Critic-Free RL Optimization: Preserves the efficiency of GRPO without requiring a separate value model.
	- Semantic Density Reward: Adds dense reasoning supervision using semantic overlap between generated reasoning traces and ground-truth solutions.
	- Improved Reasoning Performance: Demonstrates consistent gains on math reasoning and QA benchmarks such as AIME, MATH-500, and GPQA-Diamond.

	## Authors

	This work is based on the paper:

	“LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models” ( 链接：[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1)) )

	Authors:

	- Zhe Yuan — Pinterest
	- Yipeng Zhou — Facebook
	- Jinghan Li — University of Michigan - Ann Arbor
	- Xinyuan Chen — Mississippi State University
	- Bowen Deng — Carnegie Mellon University
	- Zhiqian Chen — Mississippi State University
	- Liang Zhao — Emory University

	Corresponding author: Zhiqian Chen — zchen@cse.msstate.edu

	## Citation

	```bibtex
	@article{yuan2026lambdapo,
	title={LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models},
	author={Yuan, Zhe and Zhou, Yipeng and Li, Jinghan and Chen, Xinyuan and Deng, Bowen and Chen, Zhiqian and Zhao, Liang},
	year={2026}
	}
	```