Spaces:
Paused
Paused
siddeshwar-kagatikar commited on
Commit ·
9e3e5ff
1
Parent(s): 72cba6b
Fix LaTeX math rendering in blog (use \$\$ and \$ delimiters)
Browse files
blog.md
CHANGED
|
@@ -135,20 +135,20 @@ In the self-play setting, the generator and solver have different reward functio
|
|
| 135 |
|
| 136 |
For the **generator**, the training reward is a weighted objective over four core terms:
|
| 137 |
|
| 138 |
-
|
| 139 |
R_{\text{gen}} =
|
| 140 |
-
w_v R_{\text{validity}} +
|
| 141 |
-
w_h R_{\text{hardness}} +
|
| 142 |
-
w_d R_{\text{diversity}} +
|
| 143 |
-
w_c R_{\text{consistency}}
|
| 144 |
-
|
| 145 |
|
| 146 |
where:
|
| 147 |
|
| 148 |
-
-
|
| 149 |
-
-
|
| 150 |
-
-
|
| 151 |
-
-
|
| 152 |
|
| 153 |
In `swarm_v2`, this goes one step further: invalid or non-replayable generations are hard-gated by validation, and the reward then emphasizes replayability, hardness, swarm diversity, and shared-context pressure. This is what keeps the generator from gaming training by emitting flashy but unusable tasks.
|
| 154 |
|
|
@@ -156,9 +156,9 @@ For the **solver**, the training reward reuses the environment-native answer rew
|
|
| 156 |
|
| 157 |
The PARL-style orchestration term follows the project’s Kimi-inspired formulation:
|
| 158 |
|
| 159 |
-
|
| 160 |
-
r_{\text{PARL}} = r_{\text{perf}} + \lambda_1 r_{\text{parallel}} + \lambda_2 r_{\text{finish}} + r_{\text{latency}}
|
| 161 |
-
|
| 162 |
|
| 163 |
Therefore the final rewards having the components:
|
| 164 |
|
|
|
|
| 135 |
|
| 136 |
For the **generator**, the training reward is a weighted objective over four core terms:
|
| 137 |
|
| 138 |
+
$$
|
| 139 |
R_{\text{gen}} =
|
| 140 |
+
w_v\, R_{\text{validity}} +
|
| 141 |
+
w_h\, R_{\text{hardness}} +
|
| 142 |
+
w_d\, R_{\text{diversity}} +
|
| 143 |
+
w_c\, R_{\text{consistency}}
|
| 144 |
+
$$
|
| 145 |
|
| 146 |
where:
|
| 147 |
|
| 148 |
+
- $R_{\text{validity}}$ checks that the proposed task is well-formed and bounded
|
| 149 |
+
- $R_{\text{hardness}}$ is higher when the frozen solver fails the generated task
|
| 150 |
+
- $R_{\text{diversity}}$ penalizes near-duplicate generations
|
| 151 |
+
- $R_{\text{consistency}}$ rewards graph-grounded, replayable tasks
|
| 152 |
|
| 153 |
In `swarm_v2`, this goes one step further: invalid or non-replayable generations are hard-gated by validation, and the reward then emphasizes replayability, hardness, swarm diversity, and shared-context pressure. This is what keeps the generator from gaming training by emitting flashy but unusable tasks.
|
| 154 |
|
|
|
|
| 156 |
|
| 157 |
The PARL-style orchestration term follows the project’s Kimi-inspired formulation:
|
| 158 |
|
| 159 |
+
$$
|
| 160 |
+
r_{\text{PARL}} = r_{\text{perf}} + \lambda_1\, r_{\text{parallel}} + \lambda_2\, r_{\text{finish}} + r_{\text{latency}}
|
| 161 |
+
$$
|
| 162 |
|
| 163 |
Therefore the final rewards having the components:
|
| 164 |
|