Spaces:

InosLihka
/

rhythm_env

Sleeping

App Files Files Community

InosLihka commited on 15 days ago

Commit

6226884

1 Parent(s): d51061f

README: surface headline result table at top so judges don't need to click through

Browse files

Files changed (1) hide show

README.md +15 -2

README.md CHANGED Viewed

@@ -20,12 +20,25 @@ This is **meta-reinforcement learning** for personalization: the agent isn't tra
 - **HF Space (the environment)**: https://huggingface.co/spaces/InosLihka/rhythm_env
 - **Training notebook**: [training/RhythmEnv_GRPO_Training.ipynb](training/RhythmEnv_GRPO_Training.ipynb)
 - **Blog post**: [BLOG.md](BLOG.md) — *Teaching an AI to Know You (Without Asking)*
-- **Headline results**: [docs/results.md](docs/results.md)
-- **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1
 - **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
 - **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
 - **Training plots**: [plots/](plots/) (also embedded below)
 ## Training evidence
 **SFT v3 loss curve** — distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 → 0.083 over 525 steps and stays converged. No overfitting.

 - **HF Space (the environment)**: https://huggingface.co/spaces/InosLihka/rhythm_env
 - **Training notebook**: [training/RhythmEnv_GRPO_Training.ipynb](training/RhythmEnv_GRPO_Training.ipynb)
 - **Blog post**: [BLOG.md](BLOG.md) — *Teaching an AI to Know You (Without Asking)*
+- **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3
 - **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
+- **Detailed results**: [docs/results.md](docs/results.md)
 - **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
 - **Training plots**: [plots/](plots/) (also embedded below)
+## Headline result
+A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on a real held-out eval condition**:
+| Strategy | avg_score | adaptation | belief_MAE |
+|---|---|---|---|
+| Random | 0.426 | -0.174 | n/a |
+| Heuristic | 0.455 | -0.192 | n/a |
+| **Distilled Qwen 3B (ours)** | **0.527** | -0.267 | 0.379 |
+| | **+0.072 vs heuristic** | | |
+Plus the gpt-5.4 teacher (the upper-bound reference) hits **0.611 in-dist / 0.621 OOD** with belief_MAE 0.196 on continuous profiles — confirming the env distinguishes inference quality cleanly. Full numbers in [docs/results.md](docs/results.md).
 ## Training evidence
 **SFT v3 loss curve** — distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 → 0.083 over 525 steps and stays converged. No overfitting.