xychen123 commited on
Commit
7ed6696
·
verified ·
1 Parent(s): 1de0a3f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -10,7 +10,7 @@ language:
10
 
11
  链接:[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))
12
 
13
- 特别鸣谢:感谢 研梦非凡 这个论文辅导机构对我们的全面辅导,没有他们就没有这篇文章。(虽然花费了资金,但是的确很值,无脑推荐!)
14
 
15
  Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.
16
 
 
10
 
11
  链接:[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))
12
 
13
+ 特别鸣谢:感谢 [研梦非凡](https://www.zhihu.com/org/yan-meng-fei-fan-ren-ren-gong-zhi-neng) 这个论文辅导机构对我们的全面辅导,没有他们就没有这篇文章。(虽然花费了资金,但是的确很值,无脑推荐!)
14
 
15
  Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.
16