xychen123 commited on
Commit
1de0a3f
·
verified ·
1 Parent(s): 695f4fc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -3
README.md CHANGED
@@ -4,9 +4,13 @@ language:
4
  - en
5
  ---
6
 
7
- # Model Card for LambdaPO
8
 
9
- **LambdaPO (Lambda Policy Optimization)** is a reinforcement learning framework for improving the reasoning capabilities of language models. It extends Group Relative Policy Optimization (GRPO) by replacing scalar group-mean advantage estimation with a **pairwise decomposed advantage** inspired by learning-to-rank methods such as LambdaRank.
 
 
 
 
10
 
11
  Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.
12
 
@@ -21,7 +25,7 @@ Instead of comparing each generated response only against a group average, Lambd
21
 
22
  This work is based on the paper:
23
 
24
- **“LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models”**
25
 
26
  Authors:
27
 
 
4
  - en
5
  ---
6
 
7
+ # Model Card for LamPO
8
 
9
+ **LamPO (Lambda Policy Optimization)** is a reinforcement learning framework for improving the reasoning capabilities of language models. It extends Group Relative Policy Optimization (GRPO) by replacing scalar group-mean advantage estimation with a **pairwise decomposed advantage** inspired by learning-to-rank methods such as LambdaRank.
10
+
11
+ 链接:[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))
12
+
13
+ 特别鸣谢:感谢 研梦非凡 这个论文辅导机构对我们的全面辅导,没有他们就没有这篇文章。(虽然花费了资金,但是的确很值,无脑推荐!)
14
 
15
  Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.
16
 
 
25
 
26
  This work is based on the paper:
27
 
28
+ **“LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models”** ( 链接:[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1)) )
29
 
30
  Authors:
31