Spaces:

roshan5emerald
/

logiflow-rl

Running

roshan5emerald commited on 28 days ago

Commit

86f42f7

verified ·

1 Parent(s): 2cce423

Update HF_MINI_BLOG.md

Files changed (1) hide show

HF_MINI_BLOG.md CHANGED Viewed

@@ -202,7 +202,7 @@ The reward curve below shows the model improving from first steps after the SFT
 warm-up. Because the model already knows JSON format, reward is non-zero from
 step 1 and climbs steadily.
-![Reward Curve](reward_curve.png)
 *GRPO training reward over 200 logging steps. After SFT warm-up,
 the model starts producing valid structured actions immediately.*
@@ -223,7 +223,7 @@ solving the cold-start problem.
 ### Before vs After GRPO
-![Before vs After](before_after_comparison.png)
 *Policy comparison across all three task difficulties.
 Green = trained model. Blue = base model. Amber = heuristic baseline.*

 warm-up. Because the model already knows JSON format, reward is non-zero from
 step 1 and climbs steadily.
+![Reward Curve](outputs/logiflow-grpo-script/artifacts/reward_curve.png)
 *GRPO training reward over 200 logging steps. After SFT warm-up,
 the model starts producing valid structured actions immediately.*
 ### Before vs After GRPO
+![Before vs After](artifacts/before_after_comparison.png)
 *Policy comparison across all three task difficulties.
 Green = trained model. Blue = base model. Amber = heuristic baseline.*