InosLihka commited on
Commit
6226884
·
1 Parent(s): d51061f

README: surface headline result table at top so judges don't need to click through

Browse files
Files changed (1) hide show
  1. README.md +15 -2
README.md CHANGED
@@ -20,12 +20,25 @@ This is **meta-reinforcement learning** for personalization: the agent isn't tra
20
  - **HF Space (the environment)**: https://huggingface.co/spaces/InosLihka/rhythm_env
21
  - **Training notebook**: [training/RhythmEnv_GRPO_Training.ipynb](training/RhythmEnv_GRPO_Training.ipynb)
22
  - **Blog post**: [BLOG.md](BLOG.md) — *Teaching an AI to Know You (Without Asking)*
23
- - **Headline results**: [docs/results.md](docs/results.md)
24
- - **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1
25
  - **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
 
26
  - **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
27
  - **Training plots**: [plots/](plots/) (also embedded below)
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ## Training evidence
30
 
31
  **SFT v3 loss curve** — distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 → 0.083 over 525 steps and stays converged. No overfitting.
 
20
  - **HF Space (the environment)**: https://huggingface.co/spaces/InosLihka/rhythm_env
21
  - **Training notebook**: [training/RhythmEnv_GRPO_Training.ipynb](training/RhythmEnv_GRPO_Training.ipynb)
22
  - **Blog post**: [BLOG.md](BLOG.md) — *Teaching an AI to Know You (Without Asking)*
23
+ - **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3
 
24
  - **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
25
+ - **Detailed results**: [docs/results.md](docs/results.md)
26
  - **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
27
  - **Training plots**: [plots/](plots/) (also embedded below)
28
 
29
+ ## Headline result
30
+
31
+ A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on a real held-out eval condition**:
32
+
33
+ | Strategy | avg_score | adaptation | belief_MAE |
34
+ |---|---|---|---|
35
+ | Random | 0.426 | -0.174 | n/a |
36
+ | Heuristic | 0.455 | -0.192 | n/a |
37
+ | **Distilled Qwen 3B (ours)** | **0.527** | -0.267 | 0.379 |
38
+ | | **+0.072 vs heuristic** | | |
39
+
40
+ Plus the gpt-5.4 teacher (the upper-bound reference) hits **0.611 in-dist / 0.621 OOD** with belief_MAE 0.196 on continuous profiles — confirming the env distinguishes inference quality cleanly. Full numbers in [docs/results.md](docs/results.md).
41
+
42
  ## Training evidence
43
 
44
  **SFT v3 loss curve** — distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 → 0.083 over 525 steps and stays converged. No overfitting.