InosLihka commited on
Commit
4dd50e0
·
1 Parent(s): 8227b63

README: embed reward curve and belief-accuracy curve plots

Browse files
Files changed (2) hide show
  1. README.md +5 -1
  2. plots/grpo_iter2_reward_curve.png +0 -0
README.md CHANGED
@@ -45,7 +45,11 @@ Plus the gpt-5.4 teacher (the upper-bound reference) hits **0.611 in-dist / 0.62
45
 
46
  ![SFT v3 loss](plots/sft_v3_training_loss.png)
47
 
48
- The bar comparison (random vs heuristic vs distilled student) is in the **Headline result** table above. Numbers source: `eval_results.json` in the [trained model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3).
 
 
 
 
49
 
50
  ## Why a Life Simulator?
51
 
 
45
 
46
  ![SFT v3 loss](plots/sft_v3_training_loss.png)
47
 
48
+ **Reward curve** mean per-step env reward over training (real env-replay reward, with ±1 std band). Climbs steadily as the agent learns profile-aware play.
49
+
50
+ ![Reward curve](plots/grpo_iter2_reward_curve.png)
51
+
52
+ **Baseline vs trained** comparison is in the Headline result table above. Numbers source: `eval_results.json` in the [trained model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3).
53
 
54
  ## Why a Life Simulator?
55
 
plots/grpo_iter2_reward_curve.png ADDED