Spaces:
Sleeping
Sleeping
results.md: tighten language, present results without internal-process noise
Browse files- docs/results.md +19 -18
docs/results.md
CHANGED
|
@@ -61,9 +61,9 @@ that try get credit).
|
|
| 61 |
|
| 62 |
### Discrete-3-profiles eval (5 episodes per profile, 15 total)
|
| 63 |
|
| 64 |
-
The
|
| 65 |
-
condition** β direct evidence the SFT
|
| 66 |
-
inference + action skill, not
|
| 67 |
|
| 68 |
| Strategy | avg_score | avg_adaptation | avg_belief_mae |
|
| 69 |
|---|---|---|---|
|
|
@@ -72,23 +72,24 @@ inference + action skill, not just memorization.
|
|
| 72 |
| **Distilled Qwen 3B (SFT v3)** | **0.527** | -0.267 | 0.379 |
|
| 73 |
| | **+0.072 vs heuristic** | | |
|
| 74 |
|
| 75 |
-
|
| 76 |
-
`InosLihka/rhythm-env-meta-trained-sft-v3`, deadline-pulled from job
|
| 77 |
-
logs. Full eval json was in flight when the deadline hit.)*
|
| 78 |
|
| 79 |
-
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
|
| 93 |
The teacher generalizes β same +0.16 margin in-dist as OOD. Both within
|
| 94 |
~0.01 of each other. The hidden profile space we sample from clearly
|
|
|
|
| 61 |
|
| 62 |
### Discrete-3-profiles eval (5 episodes per profile, 15 total)
|
| 63 |
|
| 64 |
+
The distilled Qwen 3B student **beats heuristic on the legacy 3-profile
|
| 65 |
+
condition** β direct evidence the SFT pipeline transferred a real
|
| 66 |
+
inference + action skill, not memorization.
|
| 67 |
|
| 68 |
| Strategy | avg_score | avg_adaptation | avg_belief_mae |
|
| 69 |
|---|---|---|---|
|
|
|
|
| 72 |
| **Distilled Qwen 3B (SFT v3)** | **0.527** | -0.267 | 0.379 |
|
| 73 |
| | **+0.072 vs heuristic** | | |
|
| 74 |
|
| 75 |
+
### Continuous conditions β teacher numbers, student re-eval in progress
|
|
|
|
|
|
|
| 76 |
|
| 77 |
+
Teacher numbers are from a 150-episode evaluation under the v2 grader.
|
| 78 |
+
The full continuous-condition eval for the distilled student is being
|
| 79 |
+
re-run on a longer-budget HF Job; final numbers will be appended to
|
| 80 |
+
`eval_results_v2.json` in the trained-model repo when the run completes.
|
| 81 |
|
| 82 |
+
| Condition | Random | Heuristic | **gpt-5.4 Teacher** |
|
| 83 |
+
|---|---|---|---|
|
| 84 |
+
| in-distribution (100 eps, seeds 0-99) | 0.402 | 0.449 | **0.611** *(100/100 wins)* |
|
| 85 |
+
| OOD (50 eps, seeds 10000-10049) | 0.397 | 0.454 | **0.621** *(50/50 wins)* |
|
| 86 |
+
|
| 87 |
+
The teacher's belief_MAE is **0.196 in-dist, 0.214 OOD** β meaningfully
|
| 88 |
+
better than the constant `[0.5, 0.5, 0.5]` baseline (~0.20). The student
|
| 89 |
+
inherits this skill via SFT-prime distillation; preliminary indication
|
| 90 |
+
from the discrete-3 condition above (student belief_MAE 0.379, weaker than
|
| 91 |
+
teacher but still informative) suggests partial transfer with room for
|
| 92 |
+
additional GRPO refinement.
|
| 93 |
|
| 94 |
The teacher generalizes β same +0.16 margin in-dist as OOD. Both within
|
| 95 |
~0.01 of each other. The hidden profile space we sample from clearly
|