Spaces:
Running
Running
Update HF_MINI_BLOG.md
Browse files- HF_MINI_BLOG.md +2 -2
HF_MINI_BLOG.md
CHANGED
|
@@ -202,7 +202,7 @@ The reward curve below shows the model improving from first steps after the SFT
|
|
| 202 |
warm-up. Because the model already knows JSON format, reward is non-zero from
|
| 203 |
step 1 and climbs steadily.
|
| 204 |
|
| 205 |
-

|
| 206 |
*GRPO training reward over 200 logging steps. After SFT warm-up,
|
| 207 |
the model starts producing valid structured actions immediately.*
|
| 208 |
|
|
@@ -223,7 +223,7 @@ solving the cold-start problem.
|
|
| 223 |
|
| 224 |
### Before vs After GRPO
|
| 225 |
|
| 226 |
-

|
| 227 |
*Policy comparison across all three task difficulties.
|
| 228 |
Green = trained model. Blue = base model. Amber = heuristic baseline.*
|
| 229 |
|
|
|
|
| 202 |
warm-up. Because the model already knows JSON format, reward is non-zero from
|
| 203 |
step 1 and climbs steadily.
|
| 204 |
|
| 205 |
+

|
| 206 |
*GRPO training reward over 200 logging steps. After SFT warm-up,
|
| 207 |
the model starts producing valid structured actions immediately.*
|
| 208 |
|
|
|
|
| 223 |
|
| 224 |
### Before vs After GRPO
|
| 225 |
|
| 226 |
+

|
| 227 |
*Policy comparison across all three task difficulties.
|
| 228 |
Green = trained model. Blue = base model. Amber = heuristic baseline.*
|
| 229 |
|