Spaces:

InosLihka
/

rhythm_env

Sleeping

App Files Files Community

rhythm_env / docs /iterations.md

InosLihka

Refactor grader to use openenv.core.rubrics.WeightedSum + Rubric subclasses

f0ca22d 11 days ago

preview code

raw

history blame contribute delete

20 kB

	# RhythmEnv Training Journey — Iteration Log

	A structured record of every training iteration: what we expected, what
	happened, what broke, why we missed it, and what we changed next.

	This doubles as raw material for the hackathon blog post. The "Why we missed
	it" sections are deliberately honest — judges and future maintainers benefit
	from the failure post-mortems more than from polished success stories.

	---

	## Iter 0 (pre-existing): Original v1 single-task training

	Date: pre-2026-04-25
	Config: Qwen 2.5-3B + LoRA r=4, 500 steps, GRPO via Unsloth, 3 hardcoded
	profiles (introvert / extrovert / workaholic), action-only output.

	What we expected: Trained agent should beat the heuristic baseline on at
	least 1-2 of the 3 profiles. The env exposed enough information (meter deltas,
	anomaly signals, step history) that a well-trained agent should discover the
	profile from rewards.

	What we got:
	\| Profile \| Heuristic \| Trained 500-step \| Δ \|
	\|---\|---\|---\|---\|
	\| Introvert Morning \| 0.765 \| 0.617 \| −0.148 ❌ \|
	\| Extrovert Night Owl \| 0.819 \| 0.725 \| −0.094 ❌ \|
	\| Workaholic Stoic \| 0.761 \| 0.539 \| −0.222 ❌ \|

	Root cause (identified in retro):
	1. Env was designed for meta-learning (3 hidden profiles) but trained as
	single-task RL — no scaffolding to teach the inference skill.
	2. Grader had a `0.30 × meter_balance` term that rewarded random behavior
	(random has high meter variance by chance).
	3. Only 3 profiles → memorizable, not learnable as a skill.
	4. No explicit "form a model of the user" output → no gradient pushing the
	model toward inference.

	The pivot: redesign rhythm_env as a meta-RL environment.

	---

	## Refactor: meta-RL conversion (2026-04-25)

	Big surgical refactor:
	- Continuous profile space via `sample_profile(seed)` — Dirichlet weights
	+ uniform-bounded modifiers. Memorization impossible.
	- Belief output added to action format: `ACTION_NAME S M W`.
	- `belief_accuracy` reward: MAE-based, range [-0.5, +0.5], compares
	emitted belief vector to ground-truth `profile_to_belief_vector(profile)`.
	- Grader rewrite: dropped `meter_balance` (rewarded random), added
	`adaptation_score` (got better mid-episode).
	- Curriculum: `hint_fraction=0.15` of training samples include true
	profile vector in prompt as warmup.

	Pre-training baselines (under new grader) — what trained agent must beat:

	\| Condition \| Heuristic \| Random \| Adaptation \|
	\|---\|---\|---\|---\|
	\| discrete-3-profiles \| 0.584 \| 0.554 \| both negative \|
	\| continuous-in-distribution \| 0.587 \| 0.516 \| both negative \|
	\| continuous-OOD \| 0.580 \| 0.508 \| both negative \|

	---

	## Iter 1: First meta-RL training (2026-04-25, $0.50, 200 steps)

	Hypothesis: With FAST_MODE preset (200 steps, temp 1.0, beta 0.04,
	weights [0.3, 0.3, 1.0, 1.0], num_generations 4), the agent should at least
	not regress vs random — and we'd see whether the meta-RL signal is strong
	enough to actually learn.

	Config: A100 large, 200 steps, num_gen 4, beta 0.04, lr 5e-5,
	LoRA rank 8, hint_fraction 0.15, temp 1.0, max_completion 32.

	What we got:
	- final_score 0.224 in-dist, 0.219 OOD — worse than random (0.516, 0.508).
	- Action distribution: 99.7% `EXERCISE` (one episode had a single `LEARN`).
	- Final beliefs all "5 5 5" — the neutral default.
	- belief_accuracy DID climb to +0.43 around step 100-150 before collapsing.

	Root cause: catastrophic mode collapse

	The training log told the story:

	\| step \| reward_std \| meaning \|
	\|---\|---\|---\|
	\| 1 \| 0.144 \| Healthy diversity in 4 completions per prompt \|
	\| 50 \| 0.056 \| Diversity shrinking \|
	\| 100 \| 0.000 \| All 4 completions identical → GRPO has zero gradient \|
	\| 200 \| 0.000 \| Policy permanently frozen \|

	`format_valid` returned +1.0 for any valid output. `action_legal` returned
	+0.5 for any valid action. Both layers gave **the same constant reward
	across all 4 completions in a GRPO group**. GRPO computes advantage as
	`reward - group_mean`, so constant layers contribute exactly zero to the
	gradient. The only learning signal came from `env_reward` and
	`belief_accuracy`.

	When the policy drifted toward the shortest-token action (`EXERCISE`) +
	neutral belief (`5 5 5`), all 4 completions converged to that exact string.
	`reward_std → 0`, gradient → 0, policy frozen.

	Why we missed it:
	- I launched 3 review subagents pre-training. The first (correctness/reward
	bugs) was rejected by the user. That subagent's prompt explicitly asked
	"could one layer dominate the total reward and drown out the others?" —
	it would have caught the constant-reward issue.
	- My own `pipeline_dryrun.py` tested completion KINDS (perfect/good/garbage)
	with DIFFERENT random actions per kind. It never tested the case where 4
	completions for the same prompt are identical valid actions — exactly what
	GRPO sees during sampling. If it had, the test would have shown
	`format_valid_std = 0` and I'd have caught this for free.
	- "Constant rewards = no gradient" is a textbook GRPO problem (DeepSeek's
	R1-Zero paper discusses it). I should have caught it during reward design.

	Lessons banked:
	- Constant-output reward layers must be diagnosed during reward design, not
	discovered through GPU spend.
	- Bug-finding subagents should be non-skippable for any RL setup change.
	- Smoke tests must include "all-identical-completions" as a case.

	---

	## Iter 2: Fix mode collapse (2026-04-26 ~01:00 UTC, $1.50, 400 steps)

	7 fixes applied (4 from initial diagnosis + 3 from a re-launched
	correctness review subagent that found additional bugs):

	1. Sampling temperature 1.0 → 1.5 (force diverse rollouts)
	2. Reward weights [0.3, 0.3, 1.0, 1.0] → [0.05, 0.05, 1.5, 3.0] (suppress
	saturated layers, amplify variable ones)
	3. `action_legal` returns 0 for valid (was +0.5) — pure penalty layer
	4. Explicit repetition penalty in `env_reward` (-0.3 if action would make
	3+ in a row)
	5. CRITICAL-2 (subagent): `_grade_episode` `late_quality` normalization
	was using [-1, +1] but per-step rewards are clamped to [-3, +3]. Fixed.
	6. MAJOR-3 (subagent): `hint_fraction=0.15` created train-eval
	distribution shift (eval had no hints). Set to 0.0.
	7. MAJOR-1 (subagent): seed fallback `i % 50` could create deterministic
	reward clusters. Hardened to `(i * 17) ^ 0xBEEF`.

	Plus FAST_MODE bumped: 200 → 400 steps.

	Hypothesis: With saturated layers suppressed and explicit anti-repetition
	penalty, the agent should escape single-action collapse and produce varied
	behavior. Belief accuracy should continue rising past iter 1's +0.43.

	What we got:
	- final_score: 0.224 in-dist, 0.219 OOD — literally identical to iter 1.
	- Action distribution: 54.8% MEDITATE, 45.2% EXERCISE — no other actions used.
	- Final beliefs cluster around (0.4-0.6, 0.5, 0.3-0.4) — slightly better than
	pure neutral.
	- belief_accuracy rolling mean climbed steadily: 0.15 → 0.36. ✅
	- `reward_std` collapsed to 0 at step 200 then recovered to 0.06+ after
	the repetition penalty kicked in. Partial escape from collapse.

	Root cause: 2-cycle reward hacking

	The single-action collapse was prevented (good!) but the agent found a new
	hack: alternating MEDITATE and EXERCISE. The repetition penalty caught
	"3+ same in a row" but missed the M-E-M-E-... 2-cycle.

	Deeper issue exposed: proxy/goal misalignment. The agent achieved
	high `env_reward` (+1.25 mean by step 400) but low `final_score` (0.22).

	Sample episode final state: `V=1.0, C=1.0, P=0.0, S=1.0, Cn=0.22`.

	The agent maxed Vitality / Cognition / Serenity (which the per-step
	`profile_weighted_reward` rewards via Dirichlet-sampled weights heavy on
	those meters), but ignored Progress (0.0!) and Connection (decayed to 0.22).
	The grader weights Progress 0.25 + Connection 0.15 — agent ignored 40% of
	the score.

	The fundamental issue: profile-weighted per-step reward and the grader
	optimize different things. The agent did exactly what we trained it to do —
	just not what we wanted it to do.

	Why we missed it:
	- The repetition penalty was scoped too narrowly (3-in-a-row) without
	considering N-cycles. A simple "any low-entropy window" check would have
	covered it.
	- The proxy/goal misalignment was hidden in plain sight: per-step reward
	shape (profile-weighted) ≠ grader shape (progress + connection +
	adaptation). I assumed they'd correlate enough.
	- We didn't have a runtime trace exercise (4 completions × specific prompt →
	group reward → advantage) before submitting iter 2.

	Lessons banked:
	- Anti-repetition checks must include window-entropy tests, not just
	immediate repetition.
	- The training reward MUST be aligned with the eval grader, or the agent
	optimizes the wrong objective.
	- "Belief output" is useless if it doesn't influence action selection.
	Belief was emitted as a string AFTER the action — no causal pathway from
	belief to action.

	---

	## Iter 3: Align reward + restructure format (CANCELLED before run — stale code, $0)

	5 architectural fixes:

	1. Per-step reward grader-alignment (`_compute_reward`): add
	profile-INDEPENDENT bias `+0.5 × progress_delta + 0.4 × connection_delta`.
	The profile-weighted reward still drives belief inference, but the agent
	now ALWAYS gets penalized for ignoring progress and connection regardless
	of what the sampled profile weights.

	2. Belief-first output format (`S M W ACTION_NAME`): in causal LM
	generation, tokens generated EARLIER condition LATER tokens. With belief
	tokens first, the action is now causally conditioned on the belief — making
	the belief functionally useful for action selection. The previous order
	("ACTION S M W") made belief a post-hoc afterthought.

	3. N-cycle penalty (`env_reward`): if last 6 actions have ≤2 unique
	values, -0.4. Closes the M-E alternation loophole AND any longer N-cycle
	the agent might find.

	4. New-action exploration bonus (`env_reward`): +0.2 reward for taking
	an action that hasn't appeared in the current episode (until 6+ unique
	actions tried). Pushes the agent to PROBE varied actions early —
	the canonical meta-RL exploration signal.

	5. Sparse terminal reward (env `step()` at done=True): add
	`(final_score - 0.5) × 5` to the last step's reward. Direct supervision
	on the actual grader, range [-2.5, +2.5], strong enough to dominate any
	local reward-hack.

	Plus training config: 400 → 800 steps, num_generations 4 → 8 (lower variance),
	LoRA rank 8 → 16 (more capacity).

	Hypothesis: With grader-aligned reward + belief-first format + cycle
	penalty + exploration bonus + terminal supervision, the agent should:
	- Use ≥5 unique actions per episode (varied behavior)
	- Maintain belief_accuracy > +0.30 (don't regress)
	- Beat random in 2/3 conditions on final_score
	- Show positive (or less-negative) adaptation than baselines

	Result: Iter 3 was never actually launched. Pre-flight inspection of the
	HF Space confirmed the cloned snapshot still had stale code, and a re-launched
	external review surfaced 7 deeper bugs (see Round 2 below) that needed to
	land before any further GPU spend was justified.

	---

	## Round 2 fixes (applied for iter 4+, after external bug review)

	External agent surfaced 7 issues that survived all prior reviews. All landed
	on `round2` branch and on the HF Space `main` before iter 4 launched:

	1. Anomalies surfaced in prompt (`StepRecord` + `format_observation_prompt`
	+ `inference.py`): per-meter anomaly signals were computed each step but
	never made visible to the agent. Agent was supposed to learn from them.
	2. Belief baseline subtraction in `belief_accuracy`: reward is now
	`similarity − constant_baseline_similarity`. The constant `5 5 5` belief
	no longer earns a free +1/step floor.
	3. Profile weight cap 0.80 → 0.45 in `sample_profile`. Forces every
	sampled profile to weight 3+ meters meaningfully (originally to kill the
	"single-meter dominant → SLEEP-spam optimal" exploit).
	4. Scaled-down shaping in `_compute_reward`: -0.10 / -0.15 / +0.07
	(was -0.30 / -0.40 / +0.20). Reduces noise-floor of shaping vs. the
	real signal layers.
	5. Step-0 belief reward = 0: agent has no information at step 0, so
	penalizing belief-vs-target there just punishes initialization.
	6. Belief-action coupling reward (±0.15): rewards if the chosen action
	matches the agent's emitted belief, penalizes if it contradicts. Forces
	the belief to be causally useful, not decorative.
	7. `grader_bias` moved out of `_compute_reward` into `env_reward`:
	keeps per-step env reward pure for inference-signal analysis. The
	progress/connection bias still lands in the GRPO advantage, just via
	the env-reward layer.

	---

	## Iter 4: Round 2 fixes — partial run, mistakenly cancelled (2026-04-26, ~$2.10, 235/800 steps)

	Config: a10g-large, LoRA rank 16, num_generations 8, 800 steps, all
	Round 1 + Iter 3 architectural fixes + Round 2 (above).

	Hypothesis: With anomalies in the prompt, baseline subtraction killing
	the belief-spam floor, belief-action coupling forcing causal use of belief,
	and grader_bias keeping env-reward pure, the agent should show monotonic
	belief_accuracy growth without hitting a 2-cycle hack.

	What we got (from 235-step partial — see `docs/iter4_partial_analysis.txt`):

	Working:
	- Total reward: -3.4 → +0.39 (climbing)
	- format_valid: -1.20 → +0.44 (slow but climbing)
	- env_reward: -2.01 → +0.44 (climbing)
	- grad_norm normalized to ~10 by step 60 from initial 36+
	- No catastrophic mode collapse

	Broken — the unsolved core:
	- `belief_accuracy/mean` flat at -0.10 throughout 235 steps
	- Linear slope: +0.0007 per 100 steps (essentially zero, well under noise)
	- Agent emits beliefs SLIGHTLY WORSE than constant baseline

	Why the run ended at 235: I cancelled the job based on stale HF API
	log output that suggested the run was stuck. The HF UI showed it was
	healthy. ~$2.10 wasted. Lesson banked: **trust the live UI over the
	`/logs` API endpoint**, which lags severely.

	Root-cause hypothesis (post-mortem analysis):

	The profile cap (0.80 → 0.45) and the baseline subtraction interact
	negatively. With weights clamped to ≤0.45, sampled profiles cluster
	toward balanced; `profile_to_belief_vector` (whose `work_pref` axis is
	30%-weighted on the progress reward weight) consequently lands closer to
	[0.5, 0.5, 0.5]. The constant `5 5 5` belief already has high cosine
	similarity with that target, so after baseline subtraction there is
	almost no headroom for the agent to "win" against it.

	Why we missed it:
	- The Round 2 fixes were treated as independent, but #2 (baseline
	subtraction) and #3 (profile cap) share the same denominator — the
	spread of the belief target distribution. An analytical check on
	belief-target stddev under the new cap would have caught it before
	spending compute.
	- The `grader_bias` term (#7) was the original justification for
	needing a tighter profile cap (kill the SLEEP-spam exploit). Once
	grader_bias was in env_reward, the cap could have been reverted.
	We applied both fixes simultaneously.

	---

	## Iter 5: Identical fixes, smaller config (2026-04-26 ~$2.50, 500 steps)

	Config: a10g-large, LoRA rank 8, num_generations 4, 500 steps. Same fix
	set as iter 4 — Round 1 + Iter 3 architectural + Round 2.

	Result: Worse than iter 4 partial. 86% SLEEP, agent never emits belief
	(`format_valid` stuck at +0.5 = action-only the whole run), `belief_accuracy`
	flat at -0.10 (the no-belief penalty score), `reward_std` collapses to 0
	twice during training. final_score 0.349 in-dist, 0.331 OOD. Lower capacity
	(LoRA 8 + num_gen 4) made GRPO too noisy to maintain the belief format.

	---

	## The pivot: stop iterating GRPO, look at what we're optimizing

	After iter 5, the question wasn't "what's the next reward shaping fix" —
	it was "why does no GRPO config beat heuristic?" Reading the model's actual
	reasoning answered it:

	> *"Last step's socialize gave V−0.12 (anomaly −0.06, much worse than
	> neutral) — high social drain, suggests low S. Morning DEEP_WORK earlier
	> gave bonus cognition (+0.04) → high M..."*

	The model was inferring the profile. The inference just didn't help its
	score. The grader rewarded keeping meters healthy (which a heuristic does
	well by reflex) but didn't reward knowing the person. So an agent that did
	real inference and an agent that played safe both got the same grade.

	The fix: add `belief_accuracy` as 20% of the grade. Heuristic emits no
	belief and scores 0 on this component, by design. Now the grader measures
	the skill we actually want.

	Under the v2 grader, the gpt-5.4 teacher (running with our existing
	observation prompt) hits **0.617 vs heuristic 0.449 — a +0.168 margin,
	30/30 head-to-head wins** on the same seeds.

	That made the second realization unavoidable: **Algorithm Distillation is
	the right recipe** ([Laskin et al. 2022](https://arxiv.org/abs/2210.14215)),
	not GRPO from scratch. Small reasoning models (≤3B) need a teacher to
	bootstrap. We had access to a frontier teacher; we'd just been ignoring it.

	---

	## Final pipeline: SFT-prime via Algorithm Distillation

	1. Stage 1 — Teacher rollouts. gpt-5.4 (Azure AI Foundry) plays 30+
	episodes. Each step: `<reasoning>...</reasoning>` + `S M W ACTION_NAME`.
	~$3 / 30 episodes.
	2. Stage 2 — SFT prime. Qwen 2.5-3B + Unsloth + LoRA r=16 fine-tuned on
	teacher trajectories. ~25 min on a10g-large, ~$2-3.
	3. Optional Stage 3 — GRPO refine. Only if SFT alone misses the bar
	(so far it doesn't).

	See [`docs/results.md`](results.md) for headline numbers and
	[`README.md`](../README.md) for the full pipeline + reproduce instructions.

	---

	## Spend tracker

	\| Stage \| Cost \| Outcome \|
	\|---\|---\|---\|
	\| Iters 1-2 (GRPO from scratch) \| ~$2 \| Mode collapses; grader-shape lessons \|
	\| Iters 3-4 (round 2 fixes) \| ~$3.60 \| Inference happens but grader doesn't reward it \|
	\| Iter 5 (smaller config) \| ~$2.50 \| Confirms low capacity makes things worse \|
	\| Algorithm Distillation pipeline \| ~$5.50 \| Real result, real story \|
	\| Total budget used \| ~$13.60 of $30 \| \|

	The 5 GRPO-from-scratch attempts weren't waste — they're what taught us the
	grader was the wrong shape. Without them we wouldn't have understood why
	naive RL was failing, and we'd have skipped straight to a less defensible
	fix.

	---

	## What we'll write up

	The story of this submission is the pivot, not the iteration count.
	Five rounds of GRPO patches couldn't beat heuristic because the grader
	didn't measure inference. Reading the model's reasoning surfaced the
	mismatch. Fixing the grader and switching to Algorithm Distillation got
	us a real result. The journey is the writeup.

	## OpenEnv Rubric system (refactor complete, post-deadline)

	Originally we ran with a custom `_grade_episode` and an honest
	acknowledged gap. After the submission deadline we returned and did
	the proper refactor (see `server/rubrics.py`):

	- 6 `Rubric` subclasses, one per scored axis
	(`CrashFreeRubric`, `ProgressRubric`, `ConnectionRubric`,
	`AdaptationRubric`, `EfficiencyRubric`, `BeliefAccuracyRubric`)
	- Composed via `openenv.core.rubrics.WeightedSum` with weights summing
	to 1.0 (matching the original 0.15 / 0.20 / 0.10 / 0.25 / 0.10 / 0.20)
	- `_grade_episode` now delegates to `make_grade_rubric(self)(None, None)`

	Each sub-rubric reads aggregated episode-end env state via a reference
	held in `__init__` — the recommended pattern from RFC 004 for
	trajectory-summary scoring on top of the per-(action, observation)
	Rubric ABC.

	Two new tests in `tests/test_rhythm_env.py` verify that the grader
	literally uses `WeightedSum` and that the 6 child rubrics are present
	with the expected names (not just functionally equivalent — actually
	using the framework primitive). All 52 tests pass.