InosLihka commited on
Commit
efe2271
·
1 Parent(s): f2401bf

Embed training plots inline in README with captions

Browse files

Per mentor guidance ('Reviewers spend seconds on each plot'), don't make
judges click through — show the plots in the README itself. Three plots
embedded with one-line captions:
- SFT v3 loss (the headline training run)
- GRPO iter2 loss (proof we ran RL too, before pivoting to AD)
- Baseline vs trained comparison

Files changed (1) hide show
  1. README.md +15 -1
README.md CHANGED
@@ -24,7 +24,21 @@ This is **meta-reinforcement learning** for personalization: the agent isn't tra
24
  - **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1
25
  - **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
26
  - **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
27
- - **Training plots** (loss + reward curves from real runs): [plots/](plots/)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  ## Why a Life Simulator?
30
 
 
24
  - **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1
25
  - **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
26
  - **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
27
+ - **Training plots**: [plots/](plots/) (also embedded below)
28
+
29
+ ## Training evidence
30
+
31
+ **SFT v3 loss curve** — distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 → 0.083 over 525 steps and stays converged. No overfitting.
32
+
33
+ ![SFT v3 loss](plots/sft_v3_training_loss.png)
34
+
35
+ **GRPO iter 2 — historical RL run** before we pivoted to Algorithm Distillation. 400 steps of GRPO on Qwen 2.5-3B + LoRA, real env-replay reward.
36
+
37
+ ![GRPO iter2 loss](plots/grpo_iter2_training_loss.png)
38
+
39
+ **Baseline vs trained agents on the env** (random + heuristic + iter2 trained). The v2-grader version of this plot for the SFT v3 student lands after the running eval finishes.
40
+
41
+ ![Baseline vs trained](plots/grpo_iter2_baseline_vs_trained.png)
42
 
43
  ## Why a Life Simulator?
44