siddeshwar-kagatikar commited on
Commit
9e3e5ff
·
1 Parent(s): 72cba6b

Fix LaTeX math rendering in blog (use \$\$ and \$ delimiters)

Browse files
Files changed (1) hide show
  1. blog.md +13 -13
blog.md CHANGED
@@ -135,20 +135,20 @@ In the self-play setting, the generator and solver have different reward functio
135
 
136
  For the **generator**, the training reward is a weighted objective over four core terms:
137
 
138
- \[
139
  R_{\text{gen}} =
140
- w_v R_{\text{validity}} +
141
- w_h R_{\text{hardness}} +
142
- w_d R_{\text{diversity}} +
143
- w_c R_{\text{consistency}}
144
- \]
145
 
146
  where:
147
 
148
- - \(R_{\text{validity}}\) checks that the proposed task is well-formed and bounded
149
- - \(R_{\text{hardness}}\) is higher when the frozen solver fails the generated task
150
- - \(R_{\text{diversity}}\) penalizes near-duplicate generations
151
- - \(R_{\text{consistency}}\) rewards graph-grounded, replayable tasks
152
 
153
  In `swarm_v2`, this goes one step further: invalid or non-replayable generations are hard-gated by validation, and the reward then emphasizes replayability, hardness, swarm diversity, and shared-context pressure. This is what keeps the generator from gaming training by emitting flashy but unusable tasks.
154
 
@@ -156,9 +156,9 @@ For the **solver**, the training reward reuses the environment-native answer rew
156
 
157
  The PARL-style orchestration term follows the project’s Kimi-inspired formulation:
158
 
159
- \[
160
- r_{\text{PARL}} = r_{\text{perf}} + \lambda_1 r_{\text{parallel}} + \lambda_2 r_{\text{finish}} + r_{\text{latency}}
161
- \]
162
 
163
  Therefore the final rewards having the components:
164
 
 
135
 
136
  For the **generator**, the training reward is a weighted objective over four core terms:
137
 
138
+ $$
139
  R_{\text{gen}} =
140
+ w_v\, R_{\text{validity}} +
141
+ w_h\, R_{\text{hardness}} +
142
+ w_d\, R_{\text{diversity}} +
143
+ w_c\, R_{\text{consistency}}
144
+ $$
145
 
146
  where:
147
 
148
+ - $R_{\text{validity}}$ checks that the proposed task is well-formed and bounded
149
+ - $R_{\text{hardness}}$ is higher when the frozen solver fails the generated task
150
+ - $R_{\text{diversity}}$ penalizes near-duplicate generations
151
+ - $R_{\text{consistency}}$ rewards graph-grounded, replayable tasks
152
 
153
  In `swarm_v2`, this goes one step further: invalid or non-replayable generations are hard-gated by validation, and the reward then emphasizes replayability, hardness, swarm diversity, and shared-context pressure. This is what keeps the generator from gaming training by emitting flashy but unusable tasks.
154
 
 
156
 
157
  The PARL-style orchestration term follows the project’s Kimi-inspired formulation:
158
 
159
+ $$
160
+ r_{\text{PARL}} = r_{\text{perf}} + \lambda_1\, r_{\text{parallel}} + \lambda_2\, r_{\text{finish}} + r_{\text{latency}}
161
+ $$
162
 
163
  Therefore the final rewards having the components:
164