Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka commited on 12 days ago

Commit

4dd50e0

1 Parent(s): 8227b63

README: embed reward curve and belief-accuracy curve plots

Files changed (2) hide show

README.md CHANGED Viewed

@@ -45,7 +45,11 @@ Plus the gpt-5.4 teacher (the upper-bound reference) hits **0.611 in-dist / 0.62
 ![SFT v3 loss](plots/sft_v3_training_loss.png)
-The bar comparison (random vs heuristic vs distilled student) is in the **Headline result** table above. Numbers source: `eval_results.json` in the [trained model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3).
 ## Why a Life Simulator?

 ![SFT v3 loss](plots/sft_v3_training_loss.png)
+**Reward curve** — mean per-step env reward over training (real env-replay reward, with ±1 std band). Climbs steadily as the agent learns profile-aware play.
+![Reward curve](plots/grpo_iter2_reward_curve.png)
+**Baseline vs trained** comparison is in the Headline result table above. Numbers source: `eval_results.json` in the [trained model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3).
 ## Why a Life Simulator?

plots/grpo_iter2_reward_curve.png ADDED Viewed