Spaces:
Sleeping
RhythmEnv Training Journey β Iteration Log
A structured record of every training iteration: what we expected, what happened, what broke, why we missed it, and what we changed next.
This doubles as raw material for the hackathon blog post. The "Why we missed it" sections are deliberately honest β judges and future maintainers benefit from the failure post-mortems more than from polished success stories.
Iter 0 (pre-existing): Original v1 single-task training
Date: pre-2026-04-25 Config: Qwen 2.5-3B + LoRA r=4, 500 steps, GRPO via Unsloth, 3 hardcoded profiles (introvert / extrovert / workaholic), action-only output.
What we expected: Trained agent should beat the heuristic baseline on at least 1-2 of the 3 profiles. The env exposed enough information (meter deltas, anomaly signals, step history) that a well-trained agent should discover the profile from rewards.
What we got:
| Profile | Heuristic | Trained 500-step | Ξ |
|---|---|---|---|
| Introvert Morning | 0.765 | 0.617 | β0.148 β |
| Extrovert Night Owl | 0.819 | 0.725 | β0.094 β |
| Workaholic Stoic | 0.761 | 0.539 | β0.222 β |
Root cause (identified in retro):
- Env was designed for meta-learning (3 hidden profiles) but trained as single-task RL β no scaffolding to teach the inference skill.
- Grader had a
0.30 Γ meter_balanceterm that rewarded random behavior (random has high meter variance by chance). - Only 3 profiles β memorizable, not learnable as a skill.
- No explicit "form a model of the user" output β no gradient pushing the model toward inference.
The pivot: redesign rhythm_env as a meta-RL environment.
Refactor: meta-RL conversion (2026-04-25)
Big surgical refactor:
- Continuous profile space via
sample_profile(seed)β Dirichlet weights- uniform-bounded modifiers. Memorization impossible.
- Belief output added to action format:
ACTION_NAME S M W. belief_accuracyreward: MAE-based, range [-0.5, +0.5], compares emitted belief vector to ground-truthprofile_to_belief_vector(profile).- Grader rewrite: dropped
meter_balance(rewarded random), addedadaptation_score(got better mid-episode). - Curriculum:
hint_fraction=0.15of training samples include true profile vector in prompt as warmup.
Pre-training baselines (under new grader) β what trained agent must beat:
| Condition | Heuristic | Random | Adaptation |
|---|---|---|---|
| discrete-3-profiles | 0.584 | 0.554 | both negative |
| continuous-in-distribution | 0.587 | 0.516 | both negative |
| continuous-OOD | 0.580 | 0.508 | both negative |
Iter 1: First meta-RL training (2026-04-25, $0.50, 200 steps)
Hypothesis: With FAST_MODE preset (200 steps, temp 1.0, beta 0.04, weights [0.3, 0.3, 1.0, 1.0], num_generations 4), the agent should at least not regress vs random β and we'd see whether the meta-RL signal is strong enough to actually learn.
Config: A100 large, 200 steps, num_gen 4, beta 0.04, lr 5e-5, LoRA rank 8, hint_fraction 0.15, temp 1.0, max_completion 32.
What we got:
- final_score 0.224 in-dist, 0.219 OOD β worse than random (0.516, 0.508).
- Action distribution: 99.7%
EXERCISE(one episode had a singleLEARN). - Final beliefs all "5 5 5" β the neutral default.
- belief_accuracy DID climb to +0.43 around step 100-150 before collapsing.
Root cause: catastrophic mode collapse
The training log told the story:
| step | reward_std | meaning |
|---|---|---|
| 1 | 0.144 | Healthy diversity in 4 completions per prompt |
| 50 | 0.056 | Diversity shrinking |
| 100 | 0.000 | All 4 completions identical β GRPO has zero gradient |
| 200 | 0.000 | Policy permanently frozen |
format_valid returned +1.0 for any valid output. action_legal returned
+0.5 for any valid action. Both layers gave the same constant reward
across all 4 completions in a GRPO group. GRPO computes advantage as
reward - group_mean, so constant layers contribute exactly zero to the
gradient. The only learning signal came from env_reward and
belief_accuracy.
When the policy drifted toward the shortest-token action (EXERCISE) +
neutral belief (5 5 5), all 4 completions converged to that exact string.
reward_std β 0, gradient β 0, policy frozen.
Why we missed it:
- I launched 3 review subagents pre-training. The first (correctness/reward bugs) was rejected by the user. That subagent's prompt explicitly asked "could one layer dominate the total reward and drown out the others?" β it would have caught the constant-reward issue.
- My own
pipeline_dryrun.pytested completion KINDS (perfect/good/garbage) with DIFFERENT random actions per kind. It never tested the case where 4 completions for the same prompt are identical valid actions β exactly what GRPO sees during sampling. If it had, the test would have shownformat_valid_std = 0and I'd have caught this for free. - "Constant rewards = no gradient" is a textbook GRPO problem (DeepSeek's R1-Zero paper discusses it). I should have caught it during reward design.
Lessons banked:
- Constant-output reward layers must be diagnosed during reward design, not discovered through GPU spend.
- Bug-finding subagents should be non-skippable for any RL setup change.
- Smoke tests must include "all-identical-completions" as a case.
Iter 2: Fix mode collapse (2026-04-26 ~01:00 UTC, $1.50, 400 steps)
7 fixes applied (4 from initial diagnosis + 3 from a re-launched correctness review subagent that found additional bugs):
- Sampling temperature 1.0 β 1.5 (force diverse rollouts)
- Reward weights [0.3, 0.3, 1.0, 1.0] β [0.05, 0.05, 1.5, 3.0] (suppress saturated layers, amplify variable ones)
action_legalreturns 0 for valid (was +0.5) β pure penalty layer- Explicit repetition penalty in
env_reward(-0.3 if action would make 3+ in a row) - CRITICAL-2 (subagent):
_grade_episodelate_qualitynormalization was using [-1, +1] but per-step rewards are clamped to [-3, +3]. Fixed. - MAJOR-3 (subagent):
hint_fraction=0.15created train-eval distribution shift (eval had no hints). Set to 0.0. - MAJOR-1 (subagent): seed fallback
i % 50could create deterministic reward clusters. Hardened to(i * 17) ^ 0xBEEF.
Plus FAST_MODE bumped: 200 β 400 steps.
Hypothesis: With saturated layers suppressed and explicit anti-repetition penalty, the agent should escape single-action collapse and produce varied behavior. Belief accuracy should continue rising past iter 1's +0.43.
What we got:
- final_score: 0.224 in-dist, 0.219 OOD β literally identical to iter 1.
- Action distribution: 54.8% MEDITATE, 45.2% EXERCISE β no other actions used.
- Final beliefs cluster around (0.4-0.6, 0.5, 0.3-0.4) β slightly better than pure neutral.
- belief_accuracy rolling mean climbed steadily: 0.15 β 0.36. β
reward_stdcollapsed to 0 at step 200 then recovered to 0.06+ after the repetition penalty kicked in. Partial escape from collapse.
Root cause: 2-cycle reward hacking
The single-action collapse was prevented (good!) but the agent found a new hack: alternating MEDITATE and EXERCISE. The repetition penalty caught "3+ same in a row" but missed the M-E-M-E-... 2-cycle.
Deeper issue exposed: proxy/goal misalignment. The agent achieved
high env_reward (+1.25 mean by step 400) but low final_score (0.22).
Sample episode final state: V=1.0, C=1.0, P=0.0, S=1.0, Cn=0.22.
The agent maxed Vitality / Cognition / Serenity (which the per-step
profile_weighted_reward rewards via Dirichlet-sampled weights heavy on
those meters), but ignored Progress (0.0!) and Connection (decayed to 0.22).
The grader weights Progress 0.25 + Connection 0.15 β agent ignored 40% of
the score.
The fundamental issue: profile-weighted per-step reward and the grader optimize different things. The agent did exactly what we trained it to do β just not what we wanted it to do.
Why we missed it:
- The repetition penalty was scoped too narrowly (3-in-a-row) without considering N-cycles. A simple "any low-entropy window" check would have covered it.
- The proxy/goal misalignment was hidden in plain sight: per-step reward shape (profile-weighted) β grader shape (progress + connection + adaptation). I assumed they'd correlate enough.
- We didn't have a runtime trace exercise (4 completions Γ specific prompt β group reward β advantage) before submitting iter 2.
Lessons banked:
- Anti-repetition checks must include window-entropy tests, not just immediate repetition.
- The training reward MUST be aligned with the eval grader, or the agent optimizes the wrong objective.
- "Belief output" is useless if it doesn't influence action selection. Belief was emitted as a string AFTER the action β no causal pathway from belief to action.
Iter 3: Align reward + restructure format (CANCELLED before run β stale code, $0)
5 architectural fixes:
Per-step reward grader-alignment (
_compute_reward): add profile-INDEPENDENT bias+0.5 Γ progress_delta + 0.4 Γ connection_delta. The profile-weighted reward still drives belief inference, but the agent now ALWAYS gets penalized for ignoring progress and connection regardless of what the sampled profile weights.Belief-first output format (
S M W ACTION_NAME): in causal LM generation, tokens generated EARLIER condition LATER tokens. With belief tokens first, the action is now causally conditioned on the belief β making the belief functionally useful for action selection. The previous order ("ACTION S M W") made belief a post-hoc afterthought.N-cycle penalty (
env_reward): if last 6 actions have β€2 unique values, -0.4. Closes the M-E alternation loophole AND any longer N-cycle the agent might find.New-action exploration bonus (
env_reward): +0.2 reward for taking an action that hasn't appeared in the current episode (until 6+ unique actions tried). Pushes the agent to PROBE varied actions early β the canonical meta-RL exploration signal.Sparse terminal reward (env
step()at done=True): add(final_score - 0.5) Γ 5to the last step's reward. Direct supervision on the actual grader, range [-2.5, +2.5], strong enough to dominate any local reward-hack.
Plus training config: 400 β 800 steps, num_generations 4 β 8 (lower variance), LoRA rank 8 β 16 (more capacity).
Hypothesis: With grader-aligned reward + belief-first format + cycle penalty + exploration bonus + terminal supervision, the agent should:
- Use β₯5 unique actions per episode (varied behavior)
- Maintain belief_accuracy > +0.30 (don't regress)
- Beat random in 2/3 conditions on final_score
- Show positive (or less-negative) adaptation than baselines
Result: Iter 3 was never actually launched. Pre-flight inspection of the HF Space confirmed the cloned snapshot still had stale code, and a re-launched external review surfaced 7 deeper bugs (see Round 2 below) that needed to land before any further GPU spend was justified.
Round 2 fixes (applied for iter 4+, after external bug review)
External agent surfaced 7 issues that survived all prior reviews. All landed
on round2 branch and on the HF Space main before iter 4 launched:
- Anomalies surfaced in prompt (
StepRecord+format_observation_promptinference.py): per-meter anomaly signals were computed each step but never made visible to the agent. Agent was supposed to learn from them.
- Belief baseline subtraction in
belief_accuracy: reward is nowsimilarity β constant_baseline_similarity. The constant5 5 5belief no longer earns a free +1/step floor. - Profile weight cap 0.80 β 0.45 in
sample_profile. Forces every sampled profile to weight 3+ meters meaningfully (originally to kill the "single-meter dominant β SLEEP-spam optimal" exploit). - Scaled-down shaping in
_compute_reward: -0.10 / -0.15 / +0.07 (was -0.30 / -0.40 / +0.20). Reduces noise-floor of shaping vs. the real signal layers. - Step-0 belief reward = 0: agent has no information at step 0, so penalizing belief-vs-target there just punishes initialization.
- Belief-action coupling reward (Β±0.15): rewards if the chosen action matches the agent's emitted belief, penalizes if it contradicts. Forces the belief to be causally useful, not decorative.
grader_biasmoved out of_compute_rewardintoenv_reward: keeps per-step env reward pure for inference-signal analysis. The progress/connection bias still lands in the GRPO advantage, just via the env-reward layer.
Iter 4: Round 2 fixes β partial run, mistakenly cancelled (2026-04-26, ~$2.10, 235/800 steps)
Config: a10g-large, LoRA rank 16, num_generations 8, 800 steps, all Round 1 + Iter 3 architectural fixes + Round 2 (above).
Hypothesis: With anomalies in the prompt, baseline subtraction killing the belief-spam floor, belief-action coupling forcing causal use of belief, and grader_bias keeping env-reward pure, the agent should show monotonic belief_accuracy growth without hitting a 2-cycle hack.
What we got (from 235-step partial β see docs/iter4_partial_analysis.txt):
Working:
- Total reward: -3.4 β +0.39 (climbing)
- format_valid: -1.20 β +0.44 (slow but climbing)
- env_reward: -2.01 β +0.44 (climbing)
- grad_norm normalized to ~10 by step 60 from initial 36+
- No catastrophic mode collapse
Broken β the unsolved core:
belief_accuracy/meanflat at -0.10 throughout 235 steps- Linear slope: +0.0007 per 100 steps (essentially zero, well under noise)
- Agent emits beliefs SLIGHTLY WORSE than constant baseline
Why the run ended at 235: I cancelled the job based on stale HF API
log output that suggested the run was stuck. The HF UI showed it was
healthy. ~$2.10 wasted. Lesson banked: trust the live UI over the
/logs API endpoint, which lags severely.
Root-cause hypothesis (post-mortem analysis):
The profile cap (0.80 β 0.45) and the baseline subtraction interact
negatively. With weights clamped to β€0.45, sampled profiles cluster
toward balanced; profile_to_belief_vector (whose work_pref axis is
30%-weighted on the progress reward weight) consequently lands closer to
[0.5, 0.5, 0.5]. The constant 5 5 5 belief already has high cosine
similarity with that target, so after baseline subtraction there is
almost no headroom for the agent to "win" against it.
Why we missed it:
- The Round 2 fixes were treated as independent, but #2 (baseline subtraction) and #3 (profile cap) share the same denominator β the spread of the belief target distribution. An analytical check on belief-target stddev under the new cap would have caught it before spending compute.
- The
grader_biasterm (#7) was the original justification for needing a tighter profile cap (kill the SLEEP-spam exploit). Once grader_bias was in env_reward, the cap could have been reverted. We applied both fixes simultaneously.
Iter 5: Identical fixes, smaller config (2026-04-26 ~$2.50, 500 steps)
Config: a10g-large, LoRA rank 8, num_generations 4, 500 steps. Same fix set as iter 4 β Round 1 + Iter 3 architectural + Round 2.
Result: Worse than iter 4 partial. 86% SLEEP, agent never emits belief
(format_valid stuck at +0.5 = action-only the whole run), belief_accuracy
flat at -0.10 (the no-belief penalty score), reward_std collapses to 0
twice during training. final_score 0.349 in-dist, 0.331 OOD. Lower capacity
(LoRA 8 + num_gen 4) made GRPO too noisy to maintain the belief format.
The pivot: stop iterating GRPO, look at what we're optimizing
After iter 5, the question wasn't "what's the next reward shaping fix" β it was "why does no GRPO config beat heuristic?" Reading the model's actual reasoning answered it:
"Last step's socialize gave Vβ0.12 (anomaly β0.06, much worse than neutral) β high social drain, suggests low S. Morning DEEP_WORK earlier gave bonus cognition (+0.04) β high M..."
The model was inferring the profile. The inference just didn't help its score. The grader rewarded keeping meters healthy (which a heuristic does well by reflex) but didn't reward knowing the person. So an agent that did real inference and an agent that played safe both got the same grade.
The fix: add belief_accuracy as 20% of the grade. Heuristic emits no
belief and scores 0 on this component, by design. Now the grader measures
the skill we actually want.
Under the v2 grader, the gpt-5.4 teacher (running with our existing observation prompt) hits 0.617 vs heuristic 0.449 β a +0.168 margin, 30/30 head-to-head wins on the same seeds.
That made the second realization unavoidable: Algorithm Distillation is the right recipe (Laskin et al. 2022), not GRPO from scratch. Small reasoning models (β€3B) need a teacher to bootstrap. We had access to a frontier teacher; we'd just been ignoring it.
Final pipeline: SFT-prime via Algorithm Distillation
- Stage 1 β Teacher rollouts. gpt-5.4 (Azure AI Foundry) plays 30+
episodes. Each step:
<reasoning>...</reasoning>+S M W ACTION_NAME. ~$3 / 30 episodes. - Stage 2 β SFT prime. Qwen 2.5-3B + Unsloth + LoRA r=16 fine-tuned on teacher trajectories. ~25 min on a10g-large, ~$2-3.
- Optional Stage 3 β GRPO refine. Only if SFT alone misses the bar (so far it doesn't).
See docs/results.md for headline numbers and
README.md for the full pipeline + reproduce instructions.
Spend tracker
| Stage | Cost | Outcome |
|---|---|---|
| Iters 1-2 (GRPO from scratch) | ~$2 | Mode collapses; grader-shape lessons |
| Iters 3-4 (round 2 fixes) | ~$3.60 | Inference happens but grader doesn't reward it |
| Iter 5 (smaller config) | ~$2.50 | Confirms low capacity makes things worse |
| Algorithm Distillation pipeline | ~$5.50 | Real result, real story |
| Total budget used | ~$13.60 of $30 |
The 5 GRPO-from-scratch attempts weren't waste β they're what taught us the grader was the wrong shape. Without them we wouldn't have understood why naive RL was failing, and we'd have skipped straight to a less defensible fix.
What we'll write up
The story of this submission is the pivot, not the iteration count. Five rounds of GRPO patches couldn't beat heuristic because the grader didn't measure inference. Reading the model's reasoning surfaced the mismatch. Fixing the grader and switching to Algorithm Distillation got us a real result. The journey is the writeup.
OpenEnv Rubric system (refactor complete, post-deadline)
Originally we ran with a custom _grade_episode and an honest
acknowledged gap. After the submission deadline we returned and did
the proper refactor (see server/rubrics.py):
- 6
Rubricsubclasses, one per scored axis (CrashFreeRubric,ProgressRubric,ConnectionRubric,AdaptationRubric,EfficiencyRubric,BeliefAccuracyRubric) - Composed via
openenv.core.rubrics.WeightedSumwith weights summing to 1.0 (matching the original 0.15 / 0.20 / 0.10 / 0.25 / 0.10 / 0.20) _grade_episodenow delegates tomake_grade_rubric(self)(None, None)
Each sub-rubric reads aggregated episode-end env state via a reference
held in __init__ β the recommended pattern from RFC 004 for
trajectory-summary scoring on top of the per-(action, observation)
Rubric ABC.
Two new tests in tests/test_rhythm_env.py verify that the grader
literally uses WeightedSum and that the 6 child rubrics are present
with the expected names (not just functionally equivalent β actually
using the framework primitive). All 52 tests pass.