xychen123
/

LamPO

xychen123 commited on 1 day ago

Commit

7ed6696

verified ·

1 Parent(s): 1de0a3f

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -10,7 +10,7 @@ language:
 链接：[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))
-特别鸣谢：感谢 研梦非凡 这个论文辅导机构对我们的全面辅导，没有他们就没有这篇文章。（虽然花费了资金，但是的确很值，无脑推荐！）
 Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.

 链接：[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))
+特别鸣谢：感谢 [研梦非凡](https://www.zhihu.com/org/yan-meng-fei-fan-ren-ren-gong-zhi-neng) 这个论文辅导机构对我们的全面辅导，没有他们就没有这篇文章。（虽然花费了资金，但是的确很值，无脑推荐！）
 Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.