Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka commited on 12 days ago

Commit

efe2271

1 Parent(s): f2401bf

Embed training plots inline in README with captions

Per mentor guidance ('Reviewers spend seconds on each plot'), don't make
judges click through — show the plots in the README itself. Three plots
embedded with one-line captions:
- SFT v3 loss (the headline training run)
- GRPO iter2 loss (proof we ran RL too, before pivoting to AD)
- Baseline vs trained comparison

Files changed (1) hide show

README.md +15 -1

README.md CHANGED Viewed

@@ -24,7 +24,21 @@ This is **meta-reinforcement learning** for personalization: the agent isn't tra
 - **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1
 - **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
 - **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
-- **Training plots** (loss + reward curves from real runs): [plots/](plots/)
 ## Why a Life Simulator?

 - **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1
 - **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
 - **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
+- **Training plots**: [plots/](plots/) (also embedded below)
+## Training evidence
+**SFT v3 loss curve** — distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 → 0.083 over 525 steps and stays converged. No overfitting.
+![SFT v3 loss](plots/sft_v3_training_loss.png)
+**GRPO iter 2 — historical RL run** before we pivoted to Algorithm Distillation. 400 steps of GRPO on Qwen 2.5-3B + LoRA, real env-replay reward.
+![GRPO iter2 loss](plots/grpo_iter2_training_loss.png)
+**Baseline vs trained agents on the env** (random + heuristic + iter2 trained). The v2-grader version of this plot for the SFT v3 student lands after the running eval finishes.
+![Baseline vs trained](plots/grpo_iter2_baseline_vs_trained.png)
 ## Why a Life Simulator?