xychen123
/

LamPO

+---
+# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
+# Doc / guide: https://huggingface.co/docs/hub/model-cards
+{}
+---
+# Model Card for LambdaPO
+**LambdaPO (Lambda Policy Optimization)** is a reinforcement learning framework for improving the reasoning capabilities of language models. It extends Group Relative Policy Optimization (GRPO) by replacing scalar group-mean advantage estimation with a **pairwise decomposed advantage** inspired by learning-to-rank methods such as LambdaRank.
+Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.
+## Key Features
+- **Pairwise Decomposed Advantage**: Uses pairwise comparisons between generated trajectories rather than a single scalar group baseline.
+- **Critic-Free RL Optimization**: Preserves the efficiency of GRPO without requiring a separate value model.
+- **Semantic Density Reward**: Adds dense reasoning supervision using semantic overlap between generated reasoning traces and ground-truth solutions.
+- **Improved Reasoning Performance**: Demonstrates consistent gains on math reasoning and QA benchmarks such as AIME, MATH-500, and GPQA-Diamond.
+## Authors
+This work is based on the paper:
+**“LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models”**
+Authors:
+- Zhe Yuan — Pinterest
+- Yipeng Zhou — Facebook
+- Jinghan Li — University of Michigan - Ann Arbor
+- Xinyuan Chen — Mississippi State University
+- Bowen Deng — Carnegie Mellon University
+- Zhiqian Chen — Mississippi State University
+- Liang Zhao — Emory University
+Corresponding author: **Zhiqian Chen** — zchen@cse.msstate.edu
+## Citation
+```bibtex
+@article{yuan2026lambdapo,
+  title={LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models},
+  author={Yuan, Zhe and Zhou, Yipeng and Li, Jinghan and Chen, Xinyuan and Deng, Bowen and Chen, Zhiqian and Zhao, Liang},
+  year={2026}
+}
+```