InosLihka Claude Opus 4.7 (1M context) commited on
Commit
839a758
·
1 Parent(s): 7bb9278

handoff: add iter 5 results + whack-a-mole pattern doc

Browse files

iter 5 finished but landed at 0.33 (vs heuristic 0.58). Third distinct
mode collapse mode: 86% SLEEP + constant (0.5, 0.5, 0.5) belief in all 14
eval episodes. Belief MAE looks great (0.110) but it's a measurement
artifact — neutral belief accidentally matches average true profile under
the tighter profile cap.

Pattern across iterations:
iter 1: EXERCISE single-action collapse
iter 2: MEDITATE-EXERCISE 2-cycle collapse
iter 5: SLEEP-spam + neutral-belief collapse

Each fix closes one loophole and opens another. The unaddressed structural
issue (per-step GRPO vs per-episode grader) is the only path forward —
multi-step plan generation from external agent's analysis remains the
single highest-leverage missing fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. docs/handoff_for_next_session.md +38 -2
docs/handoff_for_next_session.md CHANGED
@@ -32,9 +32,45 @@ the proven input layer for production.
32
  | 3 | n/a | 800 steps + 7 fixes | Cancelled before run (stale code) |
33
  | 4 (a100/l40s/h200 attempts) | various | various | Capacity-cancelled or H200/Unsloth incompat |
34
  | **4 (a10g)** | a10g-large | LoRA 16, num_gen 8, 800 steps + further 7 fixes from external bug review | **CANCELLED at step 235 by mistake** based on stale API logs. UI showed it was healthy. ~$2.10 wasted. |
35
- | **5 (a10g)** | a10g-large | LoRA 8, num_gen 4, 500 steps + same 7 fixes as iter 4 | **RUNNING** job `69eda027d70108f37acdf9a7` |
36
 
37
- **Spend: ~$5.60 of $30 budget.**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  ## The two rounds of fixes applied
40
 
 
32
  | 3 | n/a | 800 steps + 7 fixes | Cancelled before run (stale code) |
33
  | 4 (a100/l40s/h200 attempts) | various | various | Capacity-cancelled or H200/Unsloth incompat |
34
  | **4 (a10g)** | a10g-large | LoRA 16, num_gen 8, 800 steps + further 7 fixes from external bug review | **CANCELLED at step 235 by mistake** based on stale API logs. UI showed it was healthy. ~$2.10 wasted. |
35
+ | **5 (a10g)** | a10g-large | LoRA 8, num_gen 4, 500 steps + same 7 fixes as iter 4 | **COMPLETED** but HF marked ERROR (job timeout AFTER upload, same iter-2 pattern). Trained model + plots + eval at `huggingface.co/InosLihka/rhythm-env-meta-trained-iter5` |
36
 
37
+ **Spend: ~$7.10 of $30 budget.**
38
+
39
+ ## ITER 5 VERDICT (the data we have)
40
+
41
+ | Condition | iter1 | iter2 | **iter5** | Heuristic | Random |
42
+ |---|---|---|---|---|---|
43
+ | Continuous in-dist | 0.224 | 0.224 | **0.349** | **0.587** | 0.520 |
44
+ | Continuous OOD | 0.219 | 0.219 | **0.331** | **0.576** | 0.489 |
45
+ | Discrete-3 | 0.232 | 0.217 | **0.339** | **0.579** | 0.547 |
46
+
47
+ Iter 5 is BETTER than iter 1+2 (+50%) but still ~40% below heuristic.
48
+
49
+ **Action distribution iter 5**: 86% SLEEP, 11.7% LEARN, 7 actions used.
50
+ Third distinct mode collapse pattern (after iter1 EXERCISE collapse and iter2
51
+ MEDITATE-EXERCISE 2-cycle collapse).
52
+
53
+ **Belief output iter 5**: ALL 14 eval episodes emit constant `(0.5, 0.5, 0.5)`.
54
+ The belief baseline subtraction fix removed the free-reward attractor but
55
+ didn't create incentive for real learning. The agent emits the math-optimal
56
+ "give up" belief.
57
+
58
+ **Belief MAE iter 5**: 0.110 in-dist (looks great!) — but only because
59
+ constant (0.5, 0.5, 0.5) accidentally matches the average true profile under
60
+ the new tighter profile cap. MEASUREMENT ARTIFACT, NOT LEARNING.
61
+
62
+ ## The whack-a-mole pattern — the structural issue
63
+
64
+ | Iter | Fix | New collapse mode |
65
+ |---|---|---|
66
+ | 1 | none | EXERCISE single-action |
67
+ | 2 | repetition penalty | MEDITATE-EXERCISE 2-cycle |
68
+ | 5 | belief baseline + cycle penalty + grader bias | SLEEP-spam + neutral belief |
69
+
70
+ Each fix closes one loophole and opens another. **The unaddressed structural
71
+ issue is per-step GRPO vs per-episode grader misalignment** — flagged by the
72
+ external agent as "multi-step plan generation" recommendation but never
73
+ implemented. That's the single highest-leverage missing fix.
74
 
75
  ## The two rounds of fixes applied
76