rhythm_env / docs /iterations.md
InosLihka's picture
Refactor grader to use openenv.core.rubrics.WeightedSum + Rubric subclasses
f0ca22d

RhythmEnv Training Journey β€” Iteration Log

A structured record of every training iteration: what we expected, what happened, what broke, why we missed it, and what we changed next.

This doubles as raw material for the hackathon blog post. The "Why we missed it" sections are deliberately honest β€” judges and future maintainers benefit from the failure post-mortems more than from polished success stories.


Iter 0 (pre-existing): Original v1 single-task training

Date: pre-2026-04-25 Config: Qwen 2.5-3B + LoRA r=4, 500 steps, GRPO via Unsloth, 3 hardcoded profiles (introvert / extrovert / workaholic), action-only output.

What we expected: Trained agent should beat the heuristic baseline on at least 1-2 of the 3 profiles. The env exposed enough information (meter deltas, anomaly signals, step history) that a well-trained agent should discover the profile from rewards.

What we got:

Profile Heuristic Trained 500-step Ξ”
Introvert Morning 0.765 0.617 βˆ’0.148 ❌
Extrovert Night Owl 0.819 0.725 βˆ’0.094 ❌
Workaholic Stoic 0.761 0.539 βˆ’0.222 ❌

Root cause (identified in retro):

  1. Env was designed for meta-learning (3 hidden profiles) but trained as single-task RL β€” no scaffolding to teach the inference skill.
  2. Grader had a 0.30 Γ— meter_balance term that rewarded random behavior (random has high meter variance by chance).
  3. Only 3 profiles β†’ memorizable, not learnable as a skill.
  4. No explicit "form a model of the user" output β†’ no gradient pushing the model toward inference.

The pivot: redesign rhythm_env as a meta-RL environment.


Refactor: meta-RL conversion (2026-04-25)

Big surgical refactor:

  • Continuous profile space via sample_profile(seed) β€” Dirichlet weights
    • uniform-bounded modifiers. Memorization impossible.
  • Belief output added to action format: ACTION_NAME S M W.
  • belief_accuracy reward: MAE-based, range [-0.5, +0.5], compares emitted belief vector to ground-truth profile_to_belief_vector(profile).
  • Grader rewrite: dropped meter_balance (rewarded random), added adaptation_score (got better mid-episode).
  • Curriculum: hint_fraction=0.15 of training samples include true profile vector in prompt as warmup.

Pre-training baselines (under new grader) β€” what trained agent must beat:

Condition Heuristic Random Adaptation
discrete-3-profiles 0.584 0.554 both negative
continuous-in-distribution 0.587 0.516 both negative
continuous-OOD 0.580 0.508 both negative

Iter 1: First meta-RL training (2026-04-25, $0.50, 200 steps)

Hypothesis: With FAST_MODE preset (200 steps, temp 1.0, beta 0.04, weights [0.3, 0.3, 1.0, 1.0], num_generations 4), the agent should at least not regress vs random β€” and we'd see whether the meta-RL signal is strong enough to actually learn.

Config: A100 large, 200 steps, num_gen 4, beta 0.04, lr 5e-5, LoRA rank 8, hint_fraction 0.15, temp 1.0, max_completion 32.

What we got:

  • final_score 0.224 in-dist, 0.219 OOD β€” worse than random (0.516, 0.508).
  • Action distribution: 99.7% EXERCISE (one episode had a single LEARN).
  • Final beliefs all "5 5 5" β€” the neutral default.
  • belief_accuracy DID climb to +0.43 around step 100-150 before collapsing.

Root cause: catastrophic mode collapse

The training log told the story:

step reward_std meaning
1 0.144 Healthy diversity in 4 completions per prompt
50 0.056 Diversity shrinking
100 0.000 All 4 completions identical β†’ GRPO has zero gradient
200 0.000 Policy permanently frozen

format_valid returned +1.0 for any valid output. action_legal returned +0.5 for any valid action. Both layers gave the same constant reward across all 4 completions in a GRPO group. GRPO computes advantage as reward - group_mean, so constant layers contribute exactly zero to the gradient. The only learning signal came from env_reward and belief_accuracy.

When the policy drifted toward the shortest-token action (EXERCISE) + neutral belief (5 5 5), all 4 completions converged to that exact string. reward_std β†’ 0, gradient β†’ 0, policy frozen.

Why we missed it:

  • I launched 3 review subagents pre-training. The first (correctness/reward bugs) was rejected by the user. That subagent's prompt explicitly asked "could one layer dominate the total reward and drown out the others?" β€” it would have caught the constant-reward issue.
  • My own pipeline_dryrun.py tested completion KINDS (perfect/good/garbage) with DIFFERENT random actions per kind. It never tested the case where 4 completions for the same prompt are identical valid actions β€” exactly what GRPO sees during sampling. If it had, the test would have shown format_valid_std = 0 and I'd have caught this for free.
  • "Constant rewards = no gradient" is a textbook GRPO problem (DeepSeek's R1-Zero paper discusses it). I should have caught it during reward design.

Lessons banked:

  • Constant-output reward layers must be diagnosed during reward design, not discovered through GPU spend.
  • Bug-finding subagents should be non-skippable for any RL setup change.
  • Smoke tests must include "all-identical-completions" as a case.

Iter 2: Fix mode collapse (2026-04-26 ~01:00 UTC, $1.50, 400 steps)

7 fixes applied (4 from initial diagnosis + 3 from a re-launched correctness review subagent that found additional bugs):

  1. Sampling temperature 1.0 β†’ 1.5 (force diverse rollouts)
  2. Reward weights [0.3, 0.3, 1.0, 1.0] β†’ [0.05, 0.05, 1.5, 3.0] (suppress saturated layers, amplify variable ones)
  3. action_legal returns 0 for valid (was +0.5) β€” pure penalty layer
  4. Explicit repetition penalty in env_reward (-0.3 if action would make 3+ in a row)
  5. CRITICAL-2 (subagent): _grade_episode late_quality normalization was using [-1, +1] but per-step rewards are clamped to [-3, +3]. Fixed.
  6. MAJOR-3 (subagent): hint_fraction=0.15 created train-eval distribution shift (eval had no hints). Set to 0.0.
  7. MAJOR-1 (subagent): seed fallback i % 50 could create deterministic reward clusters. Hardened to (i * 17) ^ 0xBEEF.

Plus FAST_MODE bumped: 200 β†’ 400 steps.

Hypothesis: With saturated layers suppressed and explicit anti-repetition penalty, the agent should escape single-action collapse and produce varied behavior. Belief accuracy should continue rising past iter 1's +0.43.

What we got:

  • final_score: 0.224 in-dist, 0.219 OOD β€” literally identical to iter 1.
  • Action distribution: 54.8% MEDITATE, 45.2% EXERCISE β€” no other actions used.
  • Final beliefs cluster around (0.4-0.6, 0.5, 0.3-0.4) β€” slightly better than pure neutral.
  • belief_accuracy rolling mean climbed steadily: 0.15 β†’ 0.36. βœ…
  • reward_std collapsed to 0 at step 200 then recovered to 0.06+ after the repetition penalty kicked in. Partial escape from collapse.

Root cause: 2-cycle reward hacking

The single-action collapse was prevented (good!) but the agent found a new hack: alternating MEDITATE and EXERCISE. The repetition penalty caught "3+ same in a row" but missed the M-E-M-E-... 2-cycle.

Deeper issue exposed: proxy/goal misalignment. The agent achieved high env_reward (+1.25 mean by step 400) but low final_score (0.22).

Sample episode final state: V=1.0, C=1.0, P=0.0, S=1.0, Cn=0.22.

The agent maxed Vitality / Cognition / Serenity (which the per-step profile_weighted_reward rewards via Dirichlet-sampled weights heavy on those meters), but ignored Progress (0.0!) and Connection (decayed to 0.22). The grader weights Progress 0.25 + Connection 0.15 β€” agent ignored 40% of the score.

The fundamental issue: profile-weighted per-step reward and the grader optimize different things. The agent did exactly what we trained it to do β€” just not what we wanted it to do.

Why we missed it:

  • The repetition penalty was scoped too narrowly (3-in-a-row) without considering N-cycles. A simple "any low-entropy window" check would have covered it.
  • The proxy/goal misalignment was hidden in plain sight: per-step reward shape (profile-weighted) β‰  grader shape (progress + connection + adaptation). I assumed they'd correlate enough.
  • We didn't have a runtime trace exercise (4 completions Γ— specific prompt β†’ group reward β†’ advantage) before submitting iter 2.

Lessons banked:

  • Anti-repetition checks must include window-entropy tests, not just immediate repetition.
  • The training reward MUST be aligned with the eval grader, or the agent optimizes the wrong objective.
  • "Belief output" is useless if it doesn't influence action selection. Belief was emitted as a string AFTER the action β€” no causal pathway from belief to action.

Iter 3: Align reward + restructure format (CANCELLED before run β€” stale code, $0)

5 architectural fixes:

  1. Per-step reward grader-alignment (_compute_reward): add profile-INDEPENDENT bias +0.5 Γ— progress_delta + 0.4 Γ— connection_delta. The profile-weighted reward still drives belief inference, but the agent now ALWAYS gets penalized for ignoring progress and connection regardless of what the sampled profile weights.

  2. Belief-first output format (S M W ACTION_NAME): in causal LM generation, tokens generated EARLIER condition LATER tokens. With belief tokens first, the action is now causally conditioned on the belief β€” making the belief functionally useful for action selection. The previous order ("ACTION S M W") made belief a post-hoc afterthought.

  3. N-cycle penalty (env_reward): if last 6 actions have ≀2 unique values, -0.4. Closes the M-E alternation loophole AND any longer N-cycle the agent might find.

  4. New-action exploration bonus (env_reward): +0.2 reward for taking an action that hasn't appeared in the current episode (until 6+ unique actions tried). Pushes the agent to PROBE varied actions early β€” the canonical meta-RL exploration signal.

  5. Sparse terminal reward (env step() at done=True): add (final_score - 0.5) Γ— 5 to the last step's reward. Direct supervision on the actual grader, range [-2.5, +2.5], strong enough to dominate any local reward-hack.

Plus training config: 400 β†’ 800 steps, num_generations 4 β†’ 8 (lower variance), LoRA rank 8 β†’ 16 (more capacity).

Hypothesis: With grader-aligned reward + belief-first format + cycle penalty + exploration bonus + terminal supervision, the agent should:

  • Use β‰₯5 unique actions per episode (varied behavior)
  • Maintain belief_accuracy > +0.30 (don't regress)
  • Beat random in 2/3 conditions on final_score
  • Show positive (or less-negative) adaptation than baselines

Result: Iter 3 was never actually launched. Pre-flight inspection of the HF Space confirmed the cloned snapshot still had stale code, and a re-launched external review surfaced 7 deeper bugs (see Round 2 below) that needed to land before any further GPU spend was justified.


Round 2 fixes (applied for iter 4+, after external bug review)

External agent surfaced 7 issues that survived all prior reviews. All landed on round2 branch and on the HF Space main before iter 4 launched:

  1. Anomalies surfaced in prompt (StepRecord + format_observation_prompt
    • inference.py): per-meter anomaly signals were computed each step but never made visible to the agent. Agent was supposed to learn from them.
  2. Belief baseline subtraction in belief_accuracy: reward is now similarity βˆ’ constant_baseline_similarity. The constant 5 5 5 belief no longer earns a free +1/step floor.
  3. Profile weight cap 0.80 β†’ 0.45 in sample_profile. Forces every sampled profile to weight 3+ meters meaningfully (originally to kill the "single-meter dominant β†’ SLEEP-spam optimal" exploit).
  4. Scaled-down shaping in _compute_reward: -0.10 / -0.15 / +0.07 (was -0.30 / -0.40 / +0.20). Reduces noise-floor of shaping vs. the real signal layers.
  5. Step-0 belief reward = 0: agent has no information at step 0, so penalizing belief-vs-target there just punishes initialization.
  6. Belief-action coupling reward (Β±0.15): rewards if the chosen action matches the agent's emitted belief, penalizes if it contradicts. Forces the belief to be causally useful, not decorative.
  7. grader_bias moved out of _compute_reward into env_reward: keeps per-step env reward pure for inference-signal analysis. The progress/connection bias still lands in the GRPO advantage, just via the env-reward layer.

Iter 4: Round 2 fixes β€” partial run, mistakenly cancelled (2026-04-26, ~$2.10, 235/800 steps)

Config: a10g-large, LoRA rank 16, num_generations 8, 800 steps, all Round 1 + Iter 3 architectural fixes + Round 2 (above).

Hypothesis: With anomalies in the prompt, baseline subtraction killing the belief-spam floor, belief-action coupling forcing causal use of belief, and grader_bias keeping env-reward pure, the agent should show monotonic belief_accuracy growth without hitting a 2-cycle hack.

What we got (from 235-step partial β€” see docs/iter4_partial_analysis.txt):

Working:

  • Total reward: -3.4 β†’ +0.39 (climbing)
  • format_valid: -1.20 β†’ +0.44 (slow but climbing)
  • env_reward: -2.01 β†’ +0.44 (climbing)
  • grad_norm normalized to ~10 by step 60 from initial 36+
  • No catastrophic mode collapse

Broken β€” the unsolved core:

  • belief_accuracy/mean flat at -0.10 throughout 235 steps
  • Linear slope: +0.0007 per 100 steps (essentially zero, well under noise)
  • Agent emits beliefs SLIGHTLY WORSE than constant baseline

Why the run ended at 235: I cancelled the job based on stale HF API log output that suggested the run was stuck. The HF UI showed it was healthy. ~$2.10 wasted. Lesson banked: trust the live UI over the /logs API endpoint, which lags severely.

Root-cause hypothesis (post-mortem analysis):

The profile cap (0.80 β†’ 0.45) and the baseline subtraction interact negatively. With weights clamped to ≀0.45, sampled profiles cluster toward balanced; profile_to_belief_vector (whose work_pref axis is 30%-weighted on the progress reward weight) consequently lands closer to [0.5, 0.5, 0.5]. The constant 5 5 5 belief already has high cosine similarity with that target, so after baseline subtraction there is almost no headroom for the agent to "win" against it.

Why we missed it:

  • The Round 2 fixes were treated as independent, but #2 (baseline subtraction) and #3 (profile cap) share the same denominator β€” the spread of the belief target distribution. An analytical check on belief-target stddev under the new cap would have caught it before spending compute.
  • The grader_bias term (#7) was the original justification for needing a tighter profile cap (kill the SLEEP-spam exploit). Once grader_bias was in env_reward, the cap could have been reverted. We applied both fixes simultaneously.

Iter 5: Identical fixes, smaller config (2026-04-26 ~$2.50, 500 steps)

Config: a10g-large, LoRA rank 8, num_generations 4, 500 steps. Same fix set as iter 4 β€” Round 1 + Iter 3 architectural + Round 2.

Result: Worse than iter 4 partial. 86% SLEEP, agent never emits belief (format_valid stuck at +0.5 = action-only the whole run), belief_accuracy flat at -0.10 (the no-belief penalty score), reward_std collapses to 0 twice during training. final_score 0.349 in-dist, 0.331 OOD. Lower capacity (LoRA 8 + num_gen 4) made GRPO too noisy to maintain the belief format.


The pivot: stop iterating GRPO, look at what we're optimizing

After iter 5, the question wasn't "what's the next reward shaping fix" β€” it was "why does no GRPO config beat heuristic?" Reading the model's actual reasoning answered it:

"Last step's socialize gave Vβˆ’0.12 (anomaly βˆ’0.06, much worse than neutral) β€” high social drain, suggests low S. Morning DEEP_WORK earlier gave bonus cognition (+0.04) β†’ high M..."

The model was inferring the profile. The inference just didn't help its score. The grader rewarded keeping meters healthy (which a heuristic does well by reflex) but didn't reward knowing the person. So an agent that did real inference and an agent that played safe both got the same grade.

The fix: add belief_accuracy as 20% of the grade. Heuristic emits no belief and scores 0 on this component, by design. Now the grader measures the skill we actually want.

Under the v2 grader, the gpt-5.4 teacher (running with our existing observation prompt) hits 0.617 vs heuristic 0.449 β€” a +0.168 margin, 30/30 head-to-head wins on the same seeds.

That made the second realization unavoidable: Algorithm Distillation is the right recipe (Laskin et al. 2022), not GRPO from scratch. Small reasoning models (≀3B) need a teacher to bootstrap. We had access to a frontier teacher; we'd just been ignoring it.


Final pipeline: SFT-prime via Algorithm Distillation

  1. Stage 1 β€” Teacher rollouts. gpt-5.4 (Azure AI Foundry) plays 30+ episodes. Each step: <reasoning>...</reasoning> + S M W ACTION_NAME. ~$3 / 30 episodes.
  2. Stage 2 β€” SFT prime. Qwen 2.5-3B + Unsloth + LoRA r=16 fine-tuned on teacher trajectories. ~25 min on a10g-large, ~$2-3.
  3. Optional Stage 3 β€” GRPO refine. Only if SFT alone misses the bar (so far it doesn't).

See docs/results.md for headline numbers and README.md for the full pipeline + reproduce instructions.


Spend tracker

Stage Cost Outcome
Iters 1-2 (GRPO from scratch) ~$2 Mode collapses; grader-shape lessons
Iters 3-4 (round 2 fixes) ~$3.60 Inference happens but grader doesn't reward it
Iter 5 (smaller config) ~$2.50 Confirms low capacity makes things worse
Algorithm Distillation pipeline ~$5.50 Real result, real story
Total budget used ~$13.60 of $30

The 5 GRPO-from-scratch attempts weren't waste β€” they're what taught us the grader was the wrong shape. Without them we wouldn't have understood why naive RL was failing, and we'd have skipped straight to a less defensible fix.


What we'll write up

The story of this submission is the pivot, not the iteration count. Five rounds of GRPO patches couldn't beat heuristic because the grader didn't measure inference. Reading the model's reasoning surfaced the mismatch. Fixing the grader and switching to Algorithm Distillation got us a real result. The journey is the writeup.

OpenEnv Rubric system (refactor complete, post-deadline)

Originally we ran with a custom _grade_episode and an honest acknowledged gap. After the submission deadline we returned and did the proper refactor (see server/rubrics.py):

  • 6 Rubric subclasses, one per scored axis (CrashFreeRubric, ProgressRubric, ConnectionRubric, AdaptationRubric, EfficiencyRubric, BeliefAccuracyRubric)
  • Composed via openenv.core.rubrics.WeightedSum with weights summing to 1.0 (matching the original 0.15 / 0.20 / 0.10 / 0.25 / 0.10 / 0.20)
  • _grade_episode now delegates to make_grade_rubric(self)(None, None)

Each sub-rubric reads aggregated episode-end env state via a reference held in __init__ β€” the recommended pattern from RFC 004 for trajectory-summary scoring on top of the per-(action, observation) Rubric ABC.

Two new tests in tests/test_rhythm_env.py verify that the grader literally uses WeightedSum and that the 6 child rubrics are present with the expected names (not just functionally equivalent β€” actually using the framework primitive). All 52 tests pass.