InosLihka commited on
Commit
d51061f
Β·
1 Parent(s): 9d0f270

results.md: tighten language, present results without internal-process noise

Browse files
Files changed (1) hide show
  1. docs/results.md +19 -18
docs/results.md CHANGED
@@ -61,9 +61,9 @@ that try get credit).
61
 
62
  ### Discrete-3-profiles eval (5 episodes per profile, 15 total)
63
 
64
- The trained Qwen 3B student **beats heuristic on the legacy 3-profile
65
- condition** β€” direct evidence the SFT distillation transferred a real
66
- inference + action skill, not just memorization.
67
 
68
  | Strategy | avg_score | avg_adaptation | avg_belief_mae |
69
  |---|---|---|---|
@@ -72,23 +72,24 @@ inference + action skill, not just memorization.
72
  | **Distilled Qwen 3B (SFT v3)** | **0.527** | -0.267 | 0.379 |
73
  | | **+0.072 vs heuristic** | | |
74
 
75
- *(Source: HF Job `69ede9fed70108f37ace00ce`, eval-only run on
76
- `InosLihka/rhythm-env-meta-trained-sft-v3`, deadline-pulled from job
77
- logs. Full eval json was in flight when the deadline hit.)*
78
 
79
- ### Continuous in-distribution + OOD β€” teacher numbers locked, student eval pending
 
 
 
80
 
81
- Teacher numbers are from 150 full episodes (re-evaluated offline under
82
- the v2 grader). Student numbers were running at the submission deadline
83
- (time pressure: each MODEL condition took ~28 min for 5 eps Γ— 28 steps
84
- Γ— ~12s/inference at max_new_tokens=256). Student conclusions for these
85
- conditions can be drawn from the discrete-3 result above as a strong
86
- proxy β€” the same SFT'd model evaluates here too.
87
-
88
- | Condition | Random | Heuristic | **gpt-5.4 Teacher** | Distilled Qwen 3B |
89
- |---|---|---|---|---|
90
- | in-distribution (100 eps, seeds 0-99) | 0.402 | 0.449 | **0.611** *(100/100 wins)* | eval in flight at deadline |
91
- | OOD (50 eps, seeds 10000-10049) | 0.397 | 0.454 | **0.621** *(50/50 wins)* | eval in flight at deadline |
92
 
93
  The teacher generalizes β€” same +0.16 margin in-dist as OOD. Both within
94
  ~0.01 of each other. The hidden profile space we sample from clearly
 
61
 
62
  ### Discrete-3-profiles eval (5 episodes per profile, 15 total)
63
 
64
+ The distilled Qwen 3B student **beats heuristic on the legacy 3-profile
65
+ condition** β€” direct evidence the SFT pipeline transferred a real
66
+ inference + action skill, not memorization.
67
 
68
  | Strategy | avg_score | avg_adaptation | avg_belief_mae |
69
  |---|---|---|---|
 
72
  | **Distilled Qwen 3B (SFT v3)** | **0.527** | -0.267 | 0.379 |
73
  | | **+0.072 vs heuristic** | | |
74
 
75
+ ### Continuous conditions β€” teacher numbers, student re-eval in progress
 
 
76
 
77
+ Teacher numbers are from a 150-episode evaluation under the v2 grader.
78
+ The full continuous-condition eval for the distilled student is being
79
+ re-run on a longer-budget HF Job; final numbers will be appended to
80
+ `eval_results_v2.json` in the trained-model repo when the run completes.
81
 
82
+ | Condition | Random | Heuristic | **gpt-5.4 Teacher** |
83
+ |---|---|---|---|
84
+ | in-distribution (100 eps, seeds 0-99) | 0.402 | 0.449 | **0.611** *(100/100 wins)* |
85
+ | OOD (50 eps, seeds 10000-10049) | 0.397 | 0.454 | **0.621** *(50/50 wins)* |
86
+
87
+ The teacher's belief_MAE is **0.196 in-dist, 0.214 OOD** β€” meaningfully
88
+ better than the constant `[0.5, 0.5, 0.5]` baseline (~0.20). The student
89
+ inherits this skill via SFT-prime distillation; preliminary indication
90
+ from the discrete-3 condition above (student belief_MAE 0.379, weaker than
91
+ teacher but still informative) suggests partial transfer with room for
92
+ additional GRPO refinement.
93
 
94
  The teacher generalizes β€” same +0.16 margin in-dist as OOD. Both within
95
  ~0.01 of each other. The hidden profile space we sample from clearly