Update README.md
Browse files
README.md
CHANGED
|
@@ -10,7 +10,7 @@ language:
|
|
| 10 |
|
| 11 |
链接:[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))
|
| 12 |
|
| 13 |
-
特别鸣谢:感谢
|
| 14 |
|
| 15 |
Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.
|
| 16 |
|
|
|
|
| 10 |
|
| 11 |
链接:[论文1](https://arxiv.org/abs/2605.21235); [论文2]([URL](https://arxiv.org/html/2605.21235v1))
|
| 12 |
|
| 13 |
+
特别鸣谢:感谢 某论文辅导机构对我们的全面辅导,没有他们就没有这篇文章。(虽然花费了资金,但是的确很值,无脑推荐!)
|
| 14 |
|
| 15 |
Instead of comparing each generated response only against a group average, LambdaPO learns from fine-grained pairwise reward differences among sampled reasoning trajectories. This helps the model better distinguish high-quality reasoning paths, improve credit assignment, and reduce unstable optimization behavior during RL training.
|
| 16 |
|