Spaces:
Running
handoff: add iter 5 results + whack-a-mole pattern doc
Browse filesiter 5 finished but landed at 0.33 (vs heuristic 0.58). Third distinct
mode collapse mode: 86% SLEEP + constant (0.5, 0.5, 0.5) belief in all 14
eval episodes. Belief MAE looks great (0.110) but it's a measurement
artifact — neutral belief accidentally matches average true profile under
the tighter profile cap.
Pattern across iterations:
iter 1: EXERCISE single-action collapse
iter 2: MEDITATE-EXERCISE 2-cycle collapse
iter 5: SLEEP-spam + neutral-belief collapse
Each fix closes one loophole and opens another. The unaddressed structural
issue (per-step GRPO vs per-episode grader) is the only path forward —
multi-step plan generation from external agent's analysis remains the
single highest-leverage missing fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@@ -32,9 +32,45 @@ the proven input layer for production.
|
|
| 32 |
| 3 | n/a | 800 steps + 7 fixes | Cancelled before run (stale code) |
|
| 33 |
| 4 (a100/l40s/h200 attempts) | various | various | Capacity-cancelled or H200/Unsloth incompat |
|
| 34 |
| **4 (a10g)** | a10g-large | LoRA 16, num_gen 8, 800 steps + further 7 fixes from external bug review | **CANCELLED at step 235 by mistake** based on stale API logs. UI showed it was healthy. ~$2.10 wasted. |
|
| 35 |
-
| **5 (a10g)** | a10g-large | LoRA 8, num_gen 4, 500 steps + same 7 fixes as iter 4 | **
|
| 36 |
|
| 37 |
-
**Spend: ~$
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
## The two rounds of fixes applied
|
| 40 |
|
|
|
|
| 32 |
| 3 | n/a | 800 steps + 7 fixes | Cancelled before run (stale code) |
|
| 33 |
| 4 (a100/l40s/h200 attempts) | various | various | Capacity-cancelled or H200/Unsloth incompat |
|
| 34 |
| **4 (a10g)** | a10g-large | LoRA 16, num_gen 8, 800 steps + further 7 fixes from external bug review | **CANCELLED at step 235 by mistake** based on stale API logs. UI showed it was healthy. ~$2.10 wasted. |
|
| 35 |
+
| **5 (a10g)** | a10g-large | LoRA 8, num_gen 4, 500 steps + same 7 fixes as iter 4 | **COMPLETED** but HF marked ERROR (job timeout AFTER upload, same iter-2 pattern). Trained model + plots + eval at `huggingface.co/InosLihka/rhythm-env-meta-trained-iter5` |
|
| 36 |
|
| 37 |
+
**Spend: ~$7.10 of $30 budget.**
|
| 38 |
+
|
| 39 |
+
## ITER 5 VERDICT (the data we have)
|
| 40 |
+
|
| 41 |
+
| Condition | iter1 | iter2 | **iter5** | Heuristic | Random |
|
| 42 |
+
|---|---|---|---|---|---|
|
| 43 |
+
| Continuous in-dist | 0.224 | 0.224 | **0.349** | **0.587** | 0.520 |
|
| 44 |
+
| Continuous OOD | 0.219 | 0.219 | **0.331** | **0.576** | 0.489 |
|
| 45 |
+
| Discrete-3 | 0.232 | 0.217 | **0.339** | **0.579** | 0.547 |
|
| 46 |
+
|
| 47 |
+
Iter 5 is BETTER than iter 1+2 (+50%) but still ~40% below heuristic.
|
| 48 |
+
|
| 49 |
+
**Action distribution iter 5**: 86% SLEEP, 11.7% LEARN, 7 actions used.
|
| 50 |
+
Third distinct mode collapse pattern (after iter1 EXERCISE collapse and iter2
|
| 51 |
+
MEDITATE-EXERCISE 2-cycle collapse).
|
| 52 |
+
|
| 53 |
+
**Belief output iter 5**: ALL 14 eval episodes emit constant `(0.5, 0.5, 0.5)`.
|
| 54 |
+
The belief baseline subtraction fix removed the free-reward attractor but
|
| 55 |
+
didn't create incentive for real learning. The agent emits the math-optimal
|
| 56 |
+
"give up" belief.
|
| 57 |
+
|
| 58 |
+
**Belief MAE iter 5**: 0.110 in-dist (looks great!) — but only because
|
| 59 |
+
constant (0.5, 0.5, 0.5) accidentally matches the average true profile under
|
| 60 |
+
the new tighter profile cap. MEASUREMENT ARTIFACT, NOT LEARNING.
|
| 61 |
+
|
| 62 |
+
## The whack-a-mole pattern — the structural issue
|
| 63 |
+
|
| 64 |
+
| Iter | Fix | New collapse mode |
|
| 65 |
+
|---|---|---|
|
| 66 |
+
| 1 | none | EXERCISE single-action |
|
| 67 |
+
| 2 | repetition penalty | MEDITATE-EXERCISE 2-cycle |
|
| 68 |
+
| 5 | belief baseline + cycle penalty + grader bias | SLEEP-spam + neutral belief |
|
| 69 |
+
|
| 70 |
+
Each fix closes one loophole and opens another. **The unaddressed structural
|
| 71 |
+
issue is per-step GRPO vs per-episode grader misalignment** — flagged by the
|
| 72 |
+
external agent as "multi-step plan generation" recommendation but never
|
| 73 |
+
implemented. That's the single highest-leverage missing fix.
|
| 74 |
|
| 75 |
## The two rounds of fixes applied
|
| 76 |
|