Spaces:
Sleeping
Sleeping
Embed training plots inline in README with captions
Browse filesPer mentor guidance ('Reviewers spend seconds on each plot'), don't make
judges click through — show the plots in the README itself. Three plots
embedded with one-line captions:
- SFT v3 loss (the headline training run)
- GRPO iter2 loss (proof we ran RL too, before pivoting to AD)
- Baseline vs trained comparison
README.md
CHANGED
|
@@ -24,7 +24,21 @@ This is **meta-reinforcement learning** for personalization: the agent isn't tra
|
|
| 24 |
- **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1
|
| 25 |
- **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
|
| 26 |
- **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
|
| 27 |
-
- **Training plots**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
## Why a Life Simulator?
|
| 30 |
|
|
|
|
| 24 |
- **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1
|
| 25 |
- **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
|
| 26 |
- **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
|
| 27 |
+
- **Training plots**: [plots/](plots/) (also embedded below)
|
| 28 |
+
|
| 29 |
+
## Training evidence
|
| 30 |
+
|
| 31 |
+
**SFT v3 loss curve** — distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 → 0.083 over 525 steps and stays converged. No overfitting.
|
| 32 |
+
|
| 33 |
+

|
| 34 |
+
|
| 35 |
+
**GRPO iter 2 — historical RL run** before we pivoted to Algorithm Distillation. 400 steps of GRPO on Qwen 2.5-3B + LoRA, real env-replay reward.
|
| 36 |
+
|
| 37 |
+

|
| 38 |
+
|
| 39 |
+
**Baseline vs trained agents on the env** (random + heuristic + iter2 trained). The v2-grader version of this plot for the SFT v3 student lands after the running eval finishes.
|
| 40 |
+
|
| 41 |
+

|
| 42 |
|
| 43 |
## Why a Life Simulator?
|
| 44 |
|