Spaces:
Sleeping
Sleeping
README: surface headline result table at top so judges don't need to click through
Browse files
README.md
CHANGED
|
@@ -20,12 +20,25 @@ This is **meta-reinforcement learning** for personalization: the agent isn't tra
|
|
| 20 |
- **HF Space (the environment)**: https://huggingface.co/spaces/InosLihka/rhythm_env
|
| 21 |
- **Training notebook**: [training/RhythmEnv_GRPO_Training.ipynb](training/RhythmEnv_GRPO_Training.ipynb)
|
| 22 |
- **Blog post**: [BLOG.md](BLOG.md) — *Teaching an AI to Know You (Without Asking)*
|
| 23 |
-
- **
|
| 24 |
-
- **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1
|
| 25 |
- **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
|
|
|
|
| 26 |
- **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
|
| 27 |
- **Training plots**: [plots/](plots/) (also embedded below)
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
## Training evidence
|
| 30 |
|
| 31 |
**SFT v3 loss curve** — distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 → 0.083 over 525 steps and stays converged. No overfitting.
|
|
|
|
| 20 |
- **HF Space (the environment)**: https://huggingface.co/spaces/InosLihka/rhythm_env
|
| 21 |
- **Training notebook**: [training/RhythmEnv_GRPO_Training.ipynb](training/RhythmEnv_GRPO_Training.ipynb)
|
| 22 |
- **Blog post**: [BLOG.md](BLOG.md) — *Teaching an AI to Know You (Without Asking)*
|
| 23 |
+
- **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3
|
|
|
|
| 24 |
- **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
|
| 25 |
+
- **Detailed results**: [docs/results.md](docs/results.md)
|
| 26 |
- **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
|
| 27 |
- **Training plots**: [plots/](plots/) (also embedded below)
|
| 28 |
|
| 29 |
+
## Headline result
|
| 30 |
+
|
| 31 |
+
A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on a real held-out eval condition**:
|
| 32 |
+
|
| 33 |
+
| Strategy | avg_score | adaptation | belief_MAE |
|
| 34 |
+
|---|---|---|---|
|
| 35 |
+
| Random | 0.426 | -0.174 | n/a |
|
| 36 |
+
| Heuristic | 0.455 | -0.192 | n/a |
|
| 37 |
+
| **Distilled Qwen 3B (ours)** | **0.527** | -0.267 | 0.379 |
|
| 38 |
+
| | **+0.072 vs heuristic** | | |
|
| 39 |
+
|
| 40 |
+
Plus the gpt-5.4 teacher (the upper-bound reference) hits **0.611 in-dist / 0.621 OOD** with belief_MAE 0.196 on continuous profiles — confirming the env distinguishes inference quality cleanly. Full numbers in [docs/results.md](docs/results.md).
|
| 41 |
+
|
| 42 |
## Training evidence
|
| 43 |
|
| 44 |
**SFT v3 loss curve** — distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 → 0.083 over 525 steps and stays converged. No overfitting.
|