Spaces:
Sleeping
Sleeping
| # RhythmEnv Training Journey β Iteration Log | |
| A structured record of every training iteration: what we expected, what | |
| happened, what broke, why we missed it, and what we changed next. | |
| This doubles as raw material for the hackathon blog post. The "Why we missed | |
| it" sections are deliberately honest β judges and future maintainers benefit | |
| from the failure post-mortems more than from polished success stories. | |
| --- | |
| ## Iter 0 (pre-existing): Original v1 single-task training | |
| **Date**: pre-2026-04-25 | |
| **Config**: Qwen 2.5-3B + LoRA r=4, 500 steps, GRPO via Unsloth, 3 hardcoded | |
| profiles (introvert / extrovert / workaholic), action-only output. | |
| **What we expected**: Trained agent should beat the heuristic baseline on at | |
| least 1-2 of the 3 profiles. The env exposed enough information (meter deltas, | |
| anomaly signals, step history) that a well-trained agent should discover the | |
| profile from rewards. | |
| **What we got**: | |
| | Profile | Heuristic | Trained 500-step | Ξ | | |
| |---|---|---|---| | |
| | Introvert Morning | 0.765 | 0.617 | **β0.148** β | | |
| | Extrovert Night Owl | 0.819 | 0.725 | **β0.094** β | | |
| | Workaholic Stoic | 0.761 | 0.539 | **β0.222** β | | |
| **Root cause** (identified in retro): | |
| 1. Env was *designed* for meta-learning (3 hidden profiles) but *trained* as | |
| single-task RL β no scaffolding to teach the inference skill. | |
| 2. Grader had a `0.30 Γ meter_balance` term that rewarded random behavior | |
| (random has high meter variance by chance). | |
| 3. Only 3 profiles β memorizable, not learnable as a skill. | |
| 4. No explicit "form a model of the user" output β no gradient pushing the | |
| model toward inference. | |
| **The pivot**: redesign rhythm_env as a meta-RL environment. | |
| --- | |
| ## Refactor: meta-RL conversion (2026-04-25) | |
| Big surgical refactor: | |
| - **Continuous profile space** via `sample_profile(seed)` β Dirichlet weights | |
| + uniform-bounded modifiers. Memorization impossible. | |
| - **Belief output** added to action format: `ACTION_NAME S M W`. | |
| - **`belief_accuracy` reward**: MAE-based, range [-0.5, +0.5], compares | |
| emitted belief vector to ground-truth `profile_to_belief_vector(profile)`. | |
| - **Grader rewrite**: dropped `meter_balance` (rewarded random), added | |
| `adaptation_score` (got better mid-episode). | |
| - **Curriculum**: `hint_fraction=0.15` of training samples include true | |
| profile vector in prompt as warmup. | |
| Pre-training baselines (under new grader) β what trained agent must beat: | |
| | Condition | Heuristic | Random | Adaptation | | |
| |---|---|---|---| | |
| | discrete-3-profiles | 0.584 | 0.554 | both negative | | |
| | **continuous-in-distribution** | **0.587** | 0.516 | both negative | | |
| | **continuous-OOD** | **0.580** | 0.508 | both negative | | |
| --- | |
| ## Iter 1: First meta-RL training (2026-04-25, $0.50, 200 steps) | |
| **Hypothesis**: With FAST_MODE preset (200 steps, temp 1.0, beta 0.04, | |
| weights [0.3, 0.3, 1.0, 1.0], num_generations 4), the agent should at least | |
| not regress vs random β and we'd see whether the meta-RL signal is strong | |
| enough to actually learn. | |
| **Config**: A100 large, 200 steps, num_gen 4, beta 0.04, lr 5e-5, | |
| LoRA rank 8, hint_fraction 0.15, temp 1.0, max_completion 32. | |
| **What we got**: | |
| - final_score 0.224 in-dist, 0.219 OOD β **worse than random** (0.516, 0.508). | |
| - Action distribution: **99.7% `EXERCISE`** (one episode had a single `LEARN`). | |
| - Final beliefs all "5 5 5" β the neutral default. | |
| - belief_accuracy DID climb to +0.43 around step 100-150 before collapsing. | |
| **Root cause: catastrophic mode collapse** | |
| The training log told the story: | |
| | step | reward_std | meaning | | |
| |---|---|---| | |
| | 1 | 0.144 | Healthy diversity in 4 completions per prompt | | |
| | 50 | 0.056 | Diversity shrinking | | |
| | **100** | **0.000** | **All 4 completions identical β GRPO has zero gradient** | | |
| | 200 | 0.000 | Policy permanently frozen | | |
| `format_valid` returned +1.0 for any valid output. `action_legal` returned | |
| +0.5 for any valid action. Both layers gave **the same constant reward | |
| across all 4 completions in a GRPO group**. GRPO computes advantage as | |
| `reward - group_mean`, so constant layers contribute exactly zero to the | |
| gradient. The only learning signal came from `env_reward` and | |
| `belief_accuracy`. | |
| When the policy drifted toward the shortest-token action (`EXERCISE`) + | |
| neutral belief (`5 5 5`), all 4 completions converged to that exact string. | |
| `reward_std β 0`, gradient β 0, policy frozen. | |
| **Why we missed it**: | |
| - I launched 3 review subagents pre-training. The first (correctness/reward | |
| bugs) was rejected by the user. That subagent's prompt explicitly asked | |
| *"could one layer dominate the total reward and drown out the others?"* β | |
| it would have caught the constant-reward issue. | |
| - My own `pipeline_dryrun.py` tested completion KINDS (perfect/good/garbage) | |
| with DIFFERENT random actions per kind. It never tested the case where 4 | |
| completions for the same prompt are identical valid actions β exactly what | |
| GRPO sees during sampling. If it had, the test would have shown | |
| `format_valid_std = 0` and I'd have caught this for free. | |
| - "Constant rewards = no gradient" is a textbook GRPO problem (DeepSeek's | |
| R1-Zero paper discusses it). I should have caught it during reward design. | |
| **Lessons banked**: | |
| - Constant-output reward layers must be diagnosed during reward design, not | |
| discovered through GPU spend. | |
| - Bug-finding subagents should be non-skippable for any RL setup change. | |
| - Smoke tests must include "all-identical-completions" as a case. | |
| --- | |
| ## Iter 2: Fix mode collapse (2026-04-26 ~01:00 UTC, $1.50, 400 steps) | |
| **7 fixes applied** (4 from initial diagnosis + 3 from a re-launched | |
| correctness review subagent that found additional bugs): | |
| 1. Sampling temperature 1.0 β 1.5 (force diverse rollouts) | |
| 2. Reward weights [0.3, 0.3, 1.0, 1.0] β [0.05, 0.05, 1.5, 3.0] (suppress | |
| saturated layers, amplify variable ones) | |
| 3. `action_legal` returns 0 for valid (was +0.5) β pure penalty layer | |
| 4. Explicit repetition penalty in `env_reward` (-0.3 if action would make | |
| 3+ in a row) | |
| 5. **CRITICAL-2** (subagent): `_grade_episode` `late_quality` normalization | |
| was using [-1, +1] but per-step rewards are clamped to [-3, +3]. Fixed. | |
| 6. **MAJOR-3** (subagent): `hint_fraction=0.15` created train-eval | |
| distribution shift (eval had no hints). Set to 0.0. | |
| 7. **MAJOR-1** (subagent): seed fallback `i % 50` could create deterministic | |
| reward clusters. Hardened to `(i * 17) ^ 0xBEEF`. | |
| Plus FAST_MODE bumped: 200 β 400 steps. | |
| **Hypothesis**: With saturated layers suppressed and explicit anti-repetition | |
| penalty, the agent should escape single-action collapse and produce varied | |
| behavior. Belief accuracy should continue rising past iter 1's +0.43. | |
| **What we got**: | |
| - final_score: **0.224 in-dist, 0.219 OOD** β *literally identical to iter 1*. | |
| - Action distribution: 54.8% MEDITATE, 45.2% EXERCISE β **no other actions used**. | |
| - Final beliefs cluster around (0.4-0.6, 0.5, 0.3-0.4) β slightly better than | |
| pure neutral. | |
| - belief_accuracy rolling mean climbed steadily: 0.15 β 0.36. β | |
| - `reward_std` collapsed to 0 at step 200 then **recovered** to 0.06+ after | |
| the repetition penalty kicked in. Partial escape from collapse. | |
| **Root cause: 2-cycle reward hacking** | |
| The single-action collapse was prevented (good!) but the agent found a new | |
| hack: alternating MEDITATE and EXERCISE. The repetition penalty caught | |
| "3+ same in a row" but missed the M-E-M-E-... 2-cycle. | |
| Deeper issue exposed: **proxy/goal misalignment**. The agent achieved | |
| high `env_reward` (+1.25 mean by step 400) but low `final_score` (0.22). | |
| Sample episode final state: `V=1.0, C=1.0, P=0.0, S=1.0, Cn=0.22`. | |
| The agent maxed Vitality / Cognition / Serenity (which the per-step | |
| `profile_weighted_reward` rewards via Dirichlet-sampled weights heavy on | |
| those meters), but ignored Progress (0.0!) and Connection (decayed to 0.22). | |
| The grader weights Progress 0.25 + Connection 0.15 β agent ignored 40% of | |
| the score. | |
| The fundamental issue: profile-weighted per-step reward and the grader | |
| optimize different things. The agent did exactly what we trained it to do β | |
| just not what we wanted it to do. | |
| **Why we missed it**: | |
| - The repetition penalty was scoped too narrowly (3-in-a-row) without | |
| considering N-cycles. A simple "any low-entropy window" check would have | |
| covered it. | |
| - The proxy/goal misalignment was hidden in plain sight: per-step reward | |
| shape (profile-weighted) β grader shape (progress + connection + | |
| adaptation). I assumed they'd correlate enough. | |
| - We didn't have a runtime trace exercise (4 completions Γ specific prompt β | |
| group reward β advantage) before submitting iter 2. | |
| **Lessons banked**: | |
| - Anti-repetition checks must include window-entropy tests, not just | |
| immediate repetition. | |
| - The training reward MUST be aligned with the eval grader, or the agent | |
| optimizes the wrong objective. | |
| - "Belief output" is useless if it doesn't influence action selection. | |
| Belief was emitted as a string AFTER the action β no causal pathway from | |
| belief to action. | |
| --- | |
| ## Iter 3: Align reward + restructure format (CANCELLED before run β stale code, $0) | |
| **5 architectural fixes**: | |
| 1. **Per-step reward grader-alignment** (`_compute_reward`): add | |
| profile-INDEPENDENT bias `+0.5 Γ progress_delta + 0.4 Γ connection_delta`. | |
| The profile-weighted reward still drives belief inference, but the agent | |
| now ALWAYS gets penalized for ignoring progress and connection regardless | |
| of what the sampled profile weights. | |
| 2. **Belief-first output format** (`S M W ACTION_NAME`): in causal LM | |
| generation, tokens generated EARLIER condition LATER tokens. With belief | |
| tokens first, the action is now causally conditioned on the belief β making | |
| the belief functionally useful for action selection. The previous order | |
| ("ACTION S M W") made belief a post-hoc afterthought. | |
| 3. **N-cycle penalty** (`env_reward`): if last 6 actions have β€2 unique | |
| values, -0.4. Closes the M-E alternation loophole AND any longer N-cycle | |
| the agent might find. | |
| 4. **New-action exploration bonus** (`env_reward`): +0.2 reward for taking | |
| an action that hasn't appeared in the current episode (until 6+ unique | |
| actions tried). Pushes the agent to PROBE varied actions early β | |
| the canonical meta-RL exploration signal. | |
| 5. **Sparse terminal reward** (env `step()` at done=True): add | |
| `(final_score - 0.5) Γ 5` to the last step's reward. Direct supervision | |
| on the actual grader, range [-2.5, +2.5], strong enough to dominate any | |
| local reward-hack. | |
| Plus training config: 400 β 800 steps, num_generations 4 β 8 (lower variance), | |
| LoRA rank 8 β 16 (more capacity). | |
| **Hypothesis**: With grader-aligned reward + belief-first format + cycle | |
| penalty + exploration bonus + terminal supervision, the agent should: | |
| - Use β₯5 unique actions per episode (varied behavior) | |
| - Maintain belief_accuracy > +0.30 (don't regress) | |
| - Beat random in 2/3 conditions on final_score | |
| - Show positive (or less-negative) adaptation than baselines | |
| **Result**: Iter 3 was never actually launched. Pre-flight inspection of the | |
| HF Space confirmed the cloned snapshot still had stale code, and a re-launched | |
| external review surfaced 7 deeper bugs (see Round 2 below) that needed to | |
| land before any further GPU spend was justified. | |
| --- | |
| ## Round 2 fixes (applied for iter 4+, after external bug review) | |
| External agent surfaced 7 issues that survived all prior reviews. All landed | |
| on `round2` branch and on the HF Space `main` before iter 4 launched: | |
| 1. **Anomalies surfaced in prompt** (`StepRecord` + `format_observation_prompt` | |
| + `inference.py`): per-meter anomaly signals were computed each step but | |
| never made visible to the agent. Agent was supposed to learn from them. | |
| 2. **Belief baseline subtraction** in `belief_accuracy`: reward is now | |
| `similarity β constant_baseline_similarity`. The constant `5 5 5` belief | |
| no longer earns a free +1/step floor. | |
| 3. **Profile weight cap 0.80 β 0.45** in `sample_profile`. Forces every | |
| sampled profile to weight 3+ meters meaningfully (originally to kill the | |
| "single-meter dominant β SLEEP-spam optimal" exploit). | |
| 4. **Scaled-down shaping** in `_compute_reward`: -0.10 / -0.15 / +0.07 | |
| (was -0.30 / -0.40 / +0.20). Reduces noise-floor of shaping vs. the | |
| real signal layers. | |
| 5. **Step-0 belief reward = 0**: agent has no information at step 0, so | |
| penalizing belief-vs-target there just punishes initialization. | |
| 6. **Belief-action coupling reward** (Β±0.15): rewards if the chosen action | |
| matches the agent's emitted belief, penalizes if it contradicts. Forces | |
| the belief to be *causally useful*, not decorative. | |
| 7. **`grader_bias` moved out of `_compute_reward` into `env_reward`**: | |
| keeps per-step env reward pure for inference-signal analysis. The | |
| progress/connection bias still lands in the GRPO advantage, just via | |
| the env-reward layer. | |
| --- | |
| ## Iter 4: Round 2 fixes β partial run, mistakenly cancelled (2026-04-26, ~$2.10, 235/800 steps) | |
| **Config**: a10g-large, LoRA rank 16, num_generations 8, 800 steps, all | |
| Round 1 + Iter 3 architectural fixes + Round 2 (above). | |
| **Hypothesis**: With anomalies in the prompt, baseline subtraction killing | |
| the belief-spam floor, belief-action coupling forcing causal use of belief, | |
| and grader_bias keeping env-reward pure, the agent should show monotonic | |
| belief_accuracy growth without hitting a 2-cycle hack. | |
| **What we got** (from 235-step partial β see `docs/iter4_partial_analysis.txt`): | |
| Working: | |
| - Total reward: -3.4 β +0.39 (climbing) | |
| - format_valid: -1.20 β +0.44 (slow but climbing) | |
| - env_reward: -2.01 β +0.44 (climbing) | |
| - grad_norm normalized to ~10 by step 60 from initial 36+ | |
| - No catastrophic mode collapse | |
| Broken β the unsolved core: | |
| - **`belief_accuracy/mean` flat at -0.10 throughout 235 steps** | |
| - Linear slope: +0.0007 per 100 steps (essentially zero, well under noise) | |
| - Agent emits beliefs SLIGHTLY WORSE than constant baseline | |
| **Why the run ended at 235**: I cancelled the job based on stale HF API | |
| log output that suggested the run was stuck. The HF UI showed it was | |
| healthy. ~$2.10 wasted. Lesson banked: **trust the live UI over the | |
| `/logs` API endpoint**, which lags severely. | |
| **Root-cause hypothesis** (post-mortem analysis): | |
| The profile cap (0.80 β 0.45) and the baseline subtraction interact | |
| negatively. With weights clamped to β€0.45, sampled profiles cluster | |
| toward balanced; `profile_to_belief_vector` (whose `work_pref` axis is | |
| 30%-weighted on the progress reward weight) consequently lands closer to | |
| [0.5, 0.5, 0.5]. The constant `5 5 5` belief already has high cosine | |
| similarity with that target, so after baseline subtraction there is | |
| almost no headroom for the agent to "win" against it. | |
| **Why we missed it**: | |
| - The Round 2 fixes were treated as independent, but #2 (baseline | |
| subtraction) and #3 (profile cap) share the same denominator β the | |
| spread of the belief target distribution. An analytical check on | |
| belief-target stddev under the new cap would have caught it before | |
| spending compute. | |
| - The `grader_bias` term (#7) was the original justification for | |
| needing a tighter profile cap (kill the SLEEP-spam exploit). Once | |
| grader_bias was in env_reward, the cap could have been reverted. | |
| We applied both fixes simultaneously. | |
| --- | |
| ## Iter 5: Identical fixes, smaller config (2026-04-26 ~$2.50, 500 steps) | |
| **Config**: a10g-large, LoRA rank 8, num_generations 4, 500 steps. Same fix | |
| set as iter 4 β Round 1 + Iter 3 architectural + Round 2. | |
| **Result**: Worse than iter 4 partial. 86% SLEEP, agent never emits belief | |
| (`format_valid` stuck at +0.5 = action-only the whole run), `belief_accuracy` | |
| flat at -0.10 (the no-belief penalty score), `reward_std` collapses to 0 | |
| twice during training. final_score 0.349 in-dist, 0.331 OOD. Lower capacity | |
| (LoRA 8 + num_gen 4) made GRPO too noisy to maintain the belief format. | |
| --- | |
| ## The pivot: stop iterating GRPO, look at what we're optimizing | |
| After iter 5, the question wasn't "what's the next reward shaping fix" β | |
| it was "why does no GRPO config beat heuristic?" Reading the model's actual | |
| reasoning answered it: | |
| > *"Last step's socialize gave Vβ0.12 (anomaly β0.06, much worse than | |
| > neutral) β high social drain, suggests low S. Morning DEEP_WORK earlier | |
| > gave bonus cognition (+0.04) β high M..."* | |
| The model **was inferring the profile**. The inference just didn't help its | |
| score. The grader rewarded keeping meters healthy (which a heuristic does | |
| well by reflex) but didn't reward knowing the person. So an agent that did | |
| real inference and an agent that played safe both got the same grade. | |
| The fix: add `belief_accuracy` as 20% of the grade. Heuristic emits no | |
| belief and scores 0 on this component, by design. Now the grader measures | |
| the skill we actually want. | |
| Under the v2 grader, the gpt-5.4 teacher (running with our existing | |
| observation prompt) hits **0.617 vs heuristic 0.449 β a +0.168 margin, | |
| 30/30 head-to-head wins** on the same seeds. | |
| That made the second realization unavoidable: **Algorithm Distillation is | |
| the right recipe** ([Laskin et al. 2022](https://arxiv.org/abs/2210.14215)), | |
| not GRPO from scratch. Small reasoning models (β€3B) need a teacher to | |
| bootstrap. We had access to a frontier teacher; we'd just been ignoring it. | |
| --- | |
| ## Final pipeline: SFT-prime via Algorithm Distillation | |
| 1. **Stage 1 β Teacher rollouts.** gpt-5.4 (Azure AI Foundry) plays 30+ | |
| episodes. Each step: `<reasoning>...</reasoning>` + `S M W ACTION_NAME`. | |
| ~$3 / 30 episodes. | |
| 2. **Stage 2 β SFT prime.** Qwen 2.5-3B + Unsloth + LoRA r=16 fine-tuned on | |
| teacher trajectories. ~25 min on a10g-large, ~$2-3. | |
| 3. **Optional Stage 3 β GRPO refine.** Only if SFT alone misses the bar | |
| (so far it doesn't). | |
| See [`docs/results.md`](results.md) for headline numbers and | |
| [`README.md`](../README.md) for the full pipeline + reproduce instructions. | |
| --- | |
| ## Spend tracker | |
| | Stage | Cost | Outcome | | |
| |---|---|---| | |
| | Iters 1-2 (GRPO from scratch) | ~$2 | Mode collapses; grader-shape lessons | | |
| | Iters 3-4 (round 2 fixes) | ~$3.60 | Inference happens but grader doesn't reward it | | |
| | Iter 5 (smaller config) | ~$2.50 | Confirms low capacity makes things worse | | |
| | Algorithm Distillation pipeline | ~$5.50 | Real result, real story | | |
| | **Total budget used** | **~$13.60** of $30 | | | |
| The 5 GRPO-from-scratch attempts weren't waste β they're what taught us the | |
| grader was the wrong shape. Without them we wouldn't have understood why | |
| naive RL was failing, and we'd have skipped straight to a less defensible | |
| fix. | |
| --- | |
| ## What we'll write up | |
| The story of this submission is the pivot, not the iteration count. | |
| Five rounds of GRPO patches couldn't beat heuristic because the grader | |
| didn't measure inference. Reading the model's reasoning surfaced the | |
| mismatch. Fixing the grader and switching to Algorithm Distillation got | |
| us a real result. The journey is the writeup. | |
| ## OpenEnv Rubric system (refactor complete, post-deadline) | |
| Originally we ran with a custom `_grade_episode` and an honest | |
| acknowledged gap. After the submission deadline we returned and did | |
| the proper refactor (see `server/rubrics.py`): | |
| - 6 `Rubric` subclasses, one per scored axis | |
| (`CrashFreeRubric`, `ProgressRubric`, `ConnectionRubric`, | |
| `AdaptationRubric`, `EfficiencyRubric`, `BeliefAccuracyRubric`) | |
| - Composed via `openenv.core.rubrics.WeightedSum` with weights summing | |
| to 1.0 (matching the original 0.15 / 0.20 / 0.10 / 0.25 / 0.10 / 0.20) | |
| - `_grade_episode` now delegates to `make_grade_rubric(self)(None, None)` | |
| Each sub-rubric reads aggregated episode-end env state via a reference | |
| held in `__init__` β the recommended pattern from RFC 004 for | |
| trajectory-summary scoring on top of the per-(action, observation) | |
| Rubric ABC. | |
| Two new tests in `tests/test_rhythm_env.py` verify that the grader | |
| literally uses `WeightedSum` and that the 6 child rubrics are present | |
| with the expected names (not just functionally equivalent β actually | |
| using the framework primitive). All 52 tests pass. | |