Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka Claude Opus 4.7 (1M context) commited on 29 days ago

Commit

e12fc69

1 Parent(s): 64d24b3

docs: iteration journal with hypothesis/result/root-cause/fix per iter

Captures the training journey honestly:
- Iter 0 (pre-existing): single-task training regressed
- Refactor: meta-RL conversion (continuous profiles + belief + adaptation)
- Iter 1: mode collapse to single action (constant rewards = no GRPO gradient)
- Iter 2: mode collapse to 2-cycle (proxy/goal misalignment exposed)
- Iter 3: in flight with 5 architectural fixes + belief-first format

Includes 'Why we missed it' sections for each failure - useful for both
future maintenance and hackathon storytelling. Honest post-mortems are
better submission material than polished success-only writeups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show

docs/iterations.md +272 -0

docs/iterations.md ADDED Viewed

	@@ -0,0 +1,272 @@

+# RhythmEnv Training Journey — Iteration Log
+A structured record of every training iteration: what we expected, what
+happened, what broke, why we missed it, and what we changed next.
+This doubles as raw material for the hackathon blog post. The "Why we missed
+it" sections are deliberately honest — judges and future maintainers benefit
+from the failure post-mortems more than from polished success stories.
+---
+## Iter 0 (pre-existing): Original v1 single-task training
+**Date**: pre-2026-04-25
+**Config**: Qwen 2.5-3B + LoRA r=4, 500 steps, GRPO via Unsloth, 3 hardcoded
+profiles (introvert / extrovert / workaholic), action-only output.
+**What we expected**: Trained agent should beat the heuristic baseline on at
+least 1-2 of the 3 profiles. The env exposed enough information (meter deltas,
+anomaly signals, step history) that a well-trained agent should discover the
+profile from rewards.
+**What we got**:
+| Profile | Heuristic | Trained 500-step | Δ |
+|---|---|---|---|
+| Introvert Morning | 0.765 | 0.617 | **−0.148** ❌ |
+| Extrovert Night Owl | 0.819 | 0.725 | **−0.094** ❌ |
+| Workaholic Stoic | 0.761 | 0.539 | **−0.222** ❌ |
+**Root cause** (identified in retro):
+1. Env was *designed* for meta-learning (3 hidden profiles) but *trained* as
+   single-task RL — no scaffolding to teach the inference skill.
+2. Grader had a `0.30 × meter_balance` term that rewarded random behavior
+   (random has high meter variance by chance).
+3. Only 3 profiles → memorizable, not learnable as a skill.
+4. No explicit "form a model of the user" output → no gradient pushing the
+   model toward inference.
+**The pivot**: redesign rhythm_env as a meta-RL environment.
+---
+## Refactor: meta-RL conversion (2026-04-25)
+Big surgical refactor:
+- **Continuous profile space** via `sample_profile(seed)` — Dirichlet weights
+  + uniform-bounded modifiers. Memorization impossible.
+- **Belief output** added to action format: `ACTION_NAME S M W`.
+- **`belief_accuracy` reward**: MAE-based, range [-0.5, +0.5], compares
+  emitted belief vector to ground-truth `profile_to_belief_vector(profile)`.
+- **Grader rewrite**: dropped `meter_balance` (rewarded random), added
+  `adaptation_score` (got better mid-episode).
+- **Curriculum**: `hint_fraction=0.15` of training samples include true
+  profile vector in prompt as warmup.
+Pre-training baselines (under new grader) — what trained agent must beat:
+| Condition | Heuristic | Random | Adaptation |
+|---|---|---|---|
+| discrete-3-profiles | 0.584 | 0.554 | both negative |
+| **continuous-in-distribution** | **0.587** | 0.516 | both negative |
+| **continuous-OOD** | **0.580** | 0.508 | both negative |
+---
+## Iter 1: First meta-RL training (2026-04-25, $0.50, 200 steps)
+**Hypothesis**: With FAST_MODE preset (200 steps, temp 1.0, beta 0.04,
+weights [0.3, 0.3, 1.0, 1.0], num_generations 4), the agent should at least
+not regress vs random — and we'd see whether the meta-RL signal is strong
+enough to actually learn.
+**Config**: A100 large, 200 steps, num_gen 4, beta 0.04, lr 5e-5,
+LoRA rank 8, hint_fraction 0.15, temp 1.0, max_completion 32.
+**What we got**:
+- final_score 0.224 in-dist, 0.219 OOD — **worse than random** (0.516, 0.508).
+- Action distribution: **99.7% `EXERCISE`** (one episode had a single `LEARN`).
+- Final beliefs all "5 5 5" — the neutral default.
+- belief_accuracy DID climb to +0.43 around step 100-150 before collapsing.
+**Root cause: catastrophic mode collapse**
+The training log told the story:
+| step | reward_std | meaning |
+|---|---|---|
+| 1 | 0.144 | Healthy diversity in 4 completions per prompt |
+| 50 | 0.056 | Diversity shrinking |
+| **100** | **0.000** | **All 4 completions identical → GRPO has zero gradient** |
+| 200 | 0.000 | Policy permanently frozen |
+`format_valid` returned +1.0 for any valid output. `action_legal` returned
++0.5 for any valid action. Both layers gave **the same constant reward
+across all 4 completions in a GRPO group**. GRPO computes advantage as
+`reward - group_mean`, so constant layers contribute exactly zero to the
+gradient. The only learning signal came from `env_reward` and
+`belief_accuracy`.
+When the policy drifted toward the shortest-token action (`EXERCISE`) +
+neutral belief (`5 5 5`), all 4 completions converged to that exact string.
+`reward_std → 0`, gradient → 0, policy frozen.
+**Why we missed it**:
+- I launched 3 review subagents pre-training. The first (correctness/reward
+  bugs) was rejected by the user. That subagent's prompt explicitly asked
+  *"could one layer dominate the total reward and drown out the others?"* —
+  it would have caught the constant-reward issue.
+- My own `pipeline_dryrun.py` tested completion KINDS (perfect/good/garbage)
+  with DIFFERENT random actions per kind. It never tested the case where 4
+  completions for the same prompt are identical valid actions — exactly what
+  GRPO sees during sampling. If it had, the test would have shown
+  `format_valid_std = 0` and I'd have caught this for free.
+- "Constant rewards = no gradient" is a textbook GRPO problem (DeepSeek's
+  R1-Zero paper discusses it). I should have caught it during reward design.
+**Lessons banked**:
+- Constant-output reward layers must be diagnosed during reward design, not
+  discovered through GPU spend.
+- Bug-finding subagents should be non-skippable for any RL setup change.
+- Smoke tests must include "all-identical-completions" as a case.
+---
+## Iter 2: Fix mode collapse (2026-04-26 ~01:00 UTC, $1.50, 400 steps)
+**7 fixes applied** (4 from initial diagnosis + 3 from a re-launched
+correctness review subagent that found additional bugs):
+1. Sampling temperature 1.0 → 1.5 (force diverse rollouts)
+2. Reward weights [0.3, 0.3, 1.0, 1.0] → [0.05, 0.05, 1.5, 3.0] (suppress
+   saturated layers, amplify variable ones)
+3. `action_legal` returns 0 for valid (was +0.5) — pure penalty layer
+4. Explicit repetition penalty in `env_reward` (-0.3 if action would make
+   3+ in a row)
+5. **CRITICAL-2** (subagent): `_grade_episode` `late_quality` normalization
+   was using [-1, +1] but per-step rewards are clamped to [-3, +3]. Fixed.
+6. **MAJOR-3** (subagent): `hint_fraction=0.15` created train-eval
+   distribution shift (eval had no hints). Set to 0.0.
+7. **MAJOR-1** (subagent): seed fallback `i % 50` could create deterministic
+   reward clusters. Hardened to `(i * 17) ^ 0xBEEF`.
+Plus FAST_MODE bumped: 200 → 400 steps.
+**Hypothesis**: With saturated layers suppressed and explicit anti-repetition
+penalty, the agent should escape single-action collapse and produce varied
+behavior. Belief accuracy should continue rising past iter 1's +0.43.
+**What we got**:
+- final_score: **0.224 in-dist, 0.219 OOD** — *literally identical to iter 1*.
+- Action distribution: 54.8% MEDITATE, 45.2% EXERCISE — **no other actions used**.
+- Final beliefs cluster around (0.4-0.6, 0.5, 0.3-0.4) — slightly better than
+  pure neutral.
+- belief_accuracy rolling mean climbed steadily: 0.15 → 0.36. ✅
+- `reward_std` collapsed to 0 at step 200 then **recovered** to 0.06+ after
+  the repetition penalty kicked in. Partial escape from collapse.
+**Root cause: 2-cycle reward hacking**
+The single-action collapse was prevented (good!) but the agent found a new
+hack: alternating MEDITATE and EXERCISE. The repetition penalty caught
+"3+ same in a row" but missed the M-E-M-E-... 2-cycle.
+Deeper issue exposed: **proxy/goal misalignment**. The agent achieved
+high `env_reward` (+1.25 mean by step 400) but low `final_score` (0.22).
+Sample episode final state: `V=1.0, C=1.0, P=0.0, S=1.0, Cn=0.22`.
+The agent maxed Vitality / Cognition / Serenity (which the per-step
+`profile_weighted_reward` rewards via Dirichlet-sampled weights heavy on
+those meters), but ignored Progress (0.0!) and Connection (decayed to 0.22).
+The grader weights Progress 0.25 + Connection 0.15 — agent ignored 40% of
+the score.
+The fundamental issue: profile-weighted per-step reward and the grader
+optimize different things. The agent did exactly what we trained it to do —
+just not what we wanted it to do.
+**Why we missed it**:
+- The repetition penalty was scoped too narrowly (3-in-a-row) without
+  considering N-cycles. A simple "any low-entropy window" check would have
+  covered it.
+- The proxy/goal misalignment was hidden in plain sight: per-step reward
+  shape (profile-weighted) ≠ grader shape (progress + connection +
+  adaptation). I assumed they'd correlate enough.
+- We didn't have a runtime trace exercise (4 completions × specific prompt →
+  group reward → advantage) before submitting iter 2.
+**Lessons banked**:
+- Anti-repetition checks must include window-entropy tests, not just
+  immediate repetition.
+- The training reward MUST be aligned with the eval grader, or the agent
+  optimizes the wrong objective.
+- "Belief output" is useless if it doesn't influence action selection.
+  Belief was emitted as a string AFTER the action — no causal pathway from
+  belief to action.
+---
+## Iter 3: Align reward + restructure format (in flight at time of writing, ~$5 budgeted, 800 steps)
+**5 architectural fixes**:
+1. **Per-step reward grader-alignment** (`_compute_reward`): add
+   profile-INDEPENDENT bias `+0.5 × progress_delta + 0.4 × connection_delta`.
+   The profile-weighted reward still drives belief inference, but the agent
+   now ALWAYS gets penalized for ignoring progress and connection regardless
+   of what the sampled profile weights.
+2. **Belief-first output format** (`S M W ACTION_NAME`): in causal LM
+   generation, tokens generated EARLIER condition LATER tokens. With belief
+   tokens first, the action is now causally conditioned on the belief — making
+   the belief functionally useful for action selection. The previous order
+   ("ACTION S M W") made belief a post-hoc afterthought.
+3. **N-cycle penalty** (`env_reward`): if last 6 actions have ≤2 unique
+   values, -0.4. Closes the M-E alternation loophole AND any longer N-cycle
+   the agent might find.
+4. **New-action exploration bonus** (`env_reward`): +0.2 reward for taking
+   an action that hasn't appeared in the current episode (until 6+ unique
+   actions tried). Pushes the agent to PROBE varied actions early —
+   the canonical meta-RL exploration signal.
+5. **Sparse terminal reward** (env `step()` at done=True): add
+   `(final_score - 0.5) × 5` to the last step's reward. Direct supervision
+   on the actual grader, range [-2.5, +2.5], strong enough to dominate any
+   local reward-hack.
+Plus training config: 400 → 800 steps, num_generations 4 → 8 (lower variance),
+LoRA rank 8 → 16 (more capacity).
+**Hypothesis**: With grader-aligned reward + belief-first format + cycle
+penalty + exploration bonus + terminal supervision, the agent should:
+- Use ≥5 unique actions per episode (varied behavior)
+- Maintain belief_accuracy > +0.30 (don't regress)
+- Beat random in 2/3 conditions on final_score
+- Show positive (or less-negative) adaptation than baselines
+**Result**: TBD when iter 3 completes (~30-40 min after submission).
+---
+## Spend tracker
+| Iter | Cost | Steps | Outcome |
+|---|---|---|---|
+| 1 | ~$0.50 | 200 | Mode collapse to single action |
+| 2 | ~$1.50 | 400 | Mode collapse to 2-cycle |
+| 3 | ~$5 (est) | 800 | TBD |
+| **Subtotal** | **~$7** | | |
+| Budget | $30 | | $23 remaining |
+---
+## What we'll write up regardless of iter 3 outcome
+The iteration journey itself is hackathon material. Even if iter 3 doesn't
+hit the "trained > heuristic" bar, we have:
+1. **Working meta-RL infrastructure** — continuous profile space + belief
+   output + adaptation grader. Novel, defensible.
+2. **Clear post-mortem of failure modes** — most teams won't have this
+   honesty in their writeup.
+3. **Belief learning evidence** — even from iter 2, belief_accuracy +0.36
+   shows the agent IS learning to model users.
+4. **Reward design lessons** — the "constant reward → mode collapse" insight
+   is publishable in itself.
+The blog post should lead with the *thesis* (meta-RL for personalization),
+include the *journey* (iter 1 collapse → iter 2 partial escape → iter 3
+fix), and frame whatever final result we get honestly.