Spaces:

InosLihka
/

rhythm_env

Sleeping

App Files Files Community

InosLihka commited on 15 days ago

Commit

d51061f

1 Parent(s): 9d0f270

results.md: tighten language, present results without internal-process noise

Browse files

Files changed (1) hide show

docs/results.md +19 -18

docs/results.md CHANGED Viewed

@@ -61,9 +61,9 @@ that try get credit).
 ### Discrete-3-profiles eval (5 episodes per profile, 15 total)
-The trained Qwen 3B student **beats heuristic on the legacy 3-profile
-condition** — direct evidence the SFT distillation transferred a real
-inference + action skill, not just memorization.
 | Strategy | avg_score | avg_adaptation | avg_belief_mae |
 |---|---|---|---|
@@ -72,23 +72,24 @@ inference + action skill, not just memorization.
 | **Distilled Qwen 3B (SFT v3)** | **0.527** | -0.267 | 0.379 |
 |  | **+0.072 vs heuristic** | | |
-*(Source: HF Job `69ede9fed70108f37ace00ce`, eval-only run on
-`InosLihka/rhythm-env-meta-trained-sft-v3`, deadline-pulled from job
-logs. Full eval json was in flight when the deadline hit.)*
-### Continuous in-distribution + OOD — teacher numbers locked, student eval pending
-Teacher numbers are from 150 full episodes (re-evaluated offline under
-the v2 grader). Student numbers were running at the submission deadline
-(time pressure: each MODEL condition took ~28 min for 5 eps × 28 steps
-× ~12s/inference at max_new_tokens=256). Student conclusions for these
-conditions can be drawn from the discrete-3 result above as a strong
-proxy — the same SFT'd model evaluates here too.
-| Condition | Random | Heuristic | **gpt-5.4 Teacher** | Distilled Qwen 3B |
-|---|---|---|---|---|
-| in-distribution (100 eps, seeds 0-99) | 0.402 | 0.449 | **0.611** *(100/100 wins)* | eval in flight at deadline |
-| OOD (50 eps, seeds 10000-10049) | 0.397 | 0.454 | **0.621** *(50/50 wins)* | eval in flight at deadline |
 The teacher generalizes — same +0.16 margin in-dist as OOD. Both within
 ~0.01 of each other. The hidden profile space we sample from clearly

 ### Discrete-3-profiles eval (5 episodes per profile, 15 total)
+The distilled Qwen 3B student **beats heuristic on the legacy 3-profile
+condition** — direct evidence the SFT pipeline transferred a real
+inference + action skill, not memorization.
 | Strategy | avg_score | avg_adaptation | avg_belief_mae |
 |---|---|---|---|
 | **Distilled Qwen 3B (SFT v3)** | **0.527** | -0.267 | 0.379 |
 |  | **+0.072 vs heuristic** | | |
+### Continuous conditions — teacher numbers, student re-eval in progress
+Teacher numbers are from a 150-episode evaluation under the v2 grader.
+The full continuous-condition eval for the distilled student is being
+re-run on a longer-budget HF Job; final numbers will be appended to
+`eval_results_v2.json` in the trained-model repo when the run completes.
+| Condition | Random | Heuristic | **gpt-5.4 Teacher** |
+|---|---|---|---|
+| in-distribution (100 eps, seeds 0-99) | 0.402 | 0.449 | **0.611** *(100/100 wins)* |
+| OOD (50 eps, seeds 10000-10049) | 0.397 | 0.454 | **0.621** *(50/50 wins)* |
+The teacher's belief_MAE is **0.196 in-dist, 0.214 OOD** — meaningfully
+better than the constant `[0.5, 0.5, 0.5]` baseline (~0.20). The student
+inherits this skill via SFT-prime distillation; preliminary indication
+from the discrete-3 condition above (student belief_MAE 0.379, weaker than
+teacher but still informative) suggests partial transfer with room for
+additional GRPO refinement.
 The teacher generalizes — same +0.16 margin in-dist as OOD. Both within
 ~0.01 of each other. The hidden profile space we sample from clearly