Spaces:

InosLihka
/

rhythm_env

Sleeping

File size: 19,965 Bytes

# RhythmEnv Training Journey — Iteration Log

A structured record of every training iteration: what we expected, what
happened, what broke, why we missed it, and what we changed next.

This doubles as raw material for the hackathon blog post. The "Why we missed
it" sections are deliberately honest — judges and future maintainers benefit
from the failure post-mortems more than from polished success stories.

---

## Iter 0 (pre-existing): Original v1 single-task training

**Date**: pre-2026-04-25
**Config**: Qwen 2.5-3B + LoRA r=4, 500 steps, GRPO via Unsloth, 3 hardcoded
profiles (introvert / extrovert / workaholic), action-only output.

**What we expected**: Trained agent should beat the heuristic baseline on at
least 1-2 of the 3 profiles. The env exposed enough information (meter deltas,
anomaly signals, step history) that a well-trained agent should discover the
profile from rewards.

**What we got**:
| Profile | Heuristic | Trained 500-step | Δ |
|---|---|---|---|
| Introvert Morning | 0.765 | 0.617 | **−0.148** ❌ |
| Extrovert Night Owl | 0.819 | 0.725 | **−0.094** ❌ |
| Workaholic Stoic | 0.761 | 0.539 | **−0.222** ❌ |

**Root cause** (identified in retro):
1. Env was *designed* for meta-learning (3 hidden profiles) but *trained* as
   single-task RL — no scaffolding to teach the inference skill.
2. Grader had a `0.30 × meter_balance` term that rewarded random behavior
   (random has high meter variance by chance).
3. Only 3 profiles → memorizable, not learnable as a skill.
4. No explicit "form a model of the user" output → no gradient pushing the
   model toward inference.

**The pivot**: redesign rhythm_env as a meta-RL environment.

---

## Refactor: meta-RL conversion (2026-04-25)

Big surgical refactor:
- **Continuous profile space** via `sample_profile(seed)` — Dirichlet weights
  + uniform-bounded modifiers. Memorization impossible.
- **Belief output** added to action format: `ACTION_NAME S M W`.
- **`belief_accuracy` reward**: MAE-based, range [-0.5, +0.5], compares
  emitted belief vector to ground-truth `profile_to_belief_vector(profile)`.
- **Grader rewrite**: dropped `meter_balance` (rewarded random), added
  `adaptation_score` (got better mid-episode).
- **Curriculum**: `hint_fraction=0.15` of training samples include true
  profile vector in prompt as warmup.

Pre-training baselines (under new grader) — what trained agent must beat:

| Condition | Heuristic | Random | Adaptation |
|---|---|---|---|
| discrete-3-profiles | 0.584 | 0.554 | both negative |
| **continuous-in-distribution** | **0.587** | 0.516 | both negative |
| **continuous-OOD** | **0.580** | 0.508 | both negative |

---

## Iter 1: First meta-RL training (2026-04-25, $0.50, 200 steps)

**Hypothesis**: With FAST_MODE preset (200 steps, temp 1.0, beta 0.04,
weights [0.3, 0.3, 1.0, 1.0], num_generations 4), the agent should at least
not regress vs random — and we'd see whether the meta-RL signal is strong
enough to actually learn.

**Config**: A100 large, 200 steps, num_gen 4, beta 0.04, lr 5e-5,
LoRA rank 8, hint_fraction 0.15, temp 1.0, max_completion 32.

**What we got**:
- final_score 0.224 in-dist, 0.219 OOD — **worse than random** (0.516, 0.508).
- Action distribution: **99.7% `EXERCISE`** (one episode had a single `LEARN`).
- Final beliefs all "5 5 5" — the neutral default.
- belief_accuracy DID climb to +0.43 around step 100-150 before collapsing.

**Root cause: catastrophic mode collapse**

The training log told the story:

| step | reward_std | meaning |
|---|---|---|
| 1 | 0.144 | Healthy diversity in 4 completions per prompt |
| 50 | 0.056 | Diversity shrinking |
| **100** | **0.000** | **All 4 completions identical → GRPO has zero gradient** |
| 200 | 0.000 | Policy permanently frozen |

`format_valid` returned +1.0 for any valid output. `action_legal` returned
+0.5 for any valid action. Both layers gave **the same constant reward
across all 4 completions in a GRPO group**. GRPO computes advantage as
`reward - group_mean`, so constant layers contribute exactly zero to the
gradient. The only learning signal came from `env_reward` and
`belief_accuracy`.

When the policy drifted toward the shortest-token action (`EXERCISE`) +
neutral belief (`5 5 5`), all 4 completions converged to that exact string.
`reward_std → 0`, gradient → 0, policy frozen.

**Why we missed it**:
- I launched 3 review subagents pre-training. The first (correctness/reward
  bugs) was rejected by the user. That subagent's prompt explicitly asked
  *"could one layer dominate the total reward and drown out the others?"* —
  it would have caught the constant-reward issue.
- My own `pipeline_dryrun.py` tested completion KINDS (perfect/good/garbage)
  with DIFFERENT random actions per kind. It never tested the case where 4
  completions for the same prompt are identical valid actions — exactly what
  GRPO sees during sampling. If it had, the test would have shown
  `format_valid_std = 0` and I'd have caught this for free.
- "Constant rewards = no gradient" is a textbook GRPO problem (DeepSeek's
  R1-Zero paper discusses it). I should have caught it during reward design.

**Lessons banked**:
- Constant-output reward layers must be diagnosed during reward design, not
  discovered through GPU spend.
- Bug-finding subagents should be non-skippable for any RL setup change.
- Smoke tests must include "all-identical-completions" as a case.

---

## Iter 2: Fix mode collapse (2026-04-26 ~01:00 UTC, $1.50, 400 steps)

**7 fixes applied** (4 from initial diagnosis + 3 from a re-launched
correctness review subagent that found additional bugs):

1. Sampling temperature 1.0 → 1.5 (force diverse rollouts)
2. Reward weights [0.3, 0.3, 1.0, 1.0] → [0.05, 0.05, 1.5, 3.0] (suppress
   saturated layers, amplify variable ones)
3. `action_legal` returns 0 for valid (was +0.5) — pure penalty layer
4. Explicit repetition penalty in `env_reward` (-0.3 if action would make
   3+ in a row)
5. **CRITICAL-2** (subagent): `_grade_episode` `late_quality` normalization
   was using [-1, +1] but per-step rewards are clamped to [-3, +3]. Fixed.
6. **MAJOR-3** (subagent): `hint_fraction=0.15` created train-eval
   distribution shift (eval had no hints). Set to 0.0.
7. **MAJOR-1** (subagent): seed fallback `i % 50` could create deterministic
   reward clusters. Hardened to `(i * 17) ^ 0xBEEF`.

Plus FAST_MODE bumped: 200 → 400 steps.

**Hypothesis**: With saturated layers suppressed and explicit anti-repetition
penalty, the agent should escape single-action collapse and produce varied
behavior. Belief accuracy should continue rising past iter 1's +0.43.

**What we got**:
- final_score: **0.224 in-dist, 0.219 OOD** — *literally identical to iter 1*.
- Action distribution: 54.8% MEDITATE, 45.2% EXERCISE — **no other actions used**.
- Final beliefs cluster around (0.4-0.6, 0.5, 0.3-0.4) — slightly better than
  pure neutral.
- belief_accuracy rolling mean climbed steadily: 0.15 → 0.36. ✅
- `reward_std` collapsed to 0 at step 200 then **recovered** to 0.06+ after
  the repetition penalty kicked in. Partial escape from collapse.

**Root cause: 2-cycle reward hacking**

The single-action collapse was prevented (good!) but the agent found a new
hack: alternating MEDITATE and EXERCISE. The repetition penalty caught
"3+ same in a row" but missed the M-E-M-E-... 2-cycle.

Deeper issue exposed: **proxy/goal misalignment**. The agent achieved
high `env_reward` (+1.25 mean by step 400) but low `final_score` (0.22).

Sample episode final state: `V=1.0, C=1.0, P=0.0, S=1.0, Cn=0.22`.

The agent maxed Vitality / Cognition / Serenity (which the per-step
`profile_weighted_reward` rewards via Dirichlet-sampled weights heavy on
those meters), but ignored Progress (0.0!) and Connection (decayed to 0.22).
The grader weights Progress 0.25 + Connection 0.15 — agent ignored 40% of
the score.

The fundamental issue: profile-weighted per-step reward and the grader
optimize different things. The agent did exactly what we trained it to do —
just not what we wanted it to do.

**Why we missed it**:
- The repetition penalty was scoped too narrowly (3-in-a-row) without
  considering N-cycles. A simple "any low-entropy window" check would have
  covered it.
- The proxy/goal misalignment was hidden in plain sight: per-step reward
  shape (profile-weighted) ≠ grader shape (progress + connection +
  adaptation). I assumed they'd correlate enough.
- We didn't have a runtime trace exercise (4 completions × specific prompt →
  group reward → advantage) before submitting iter 2.

**Lessons banked**:
- Anti-repetition checks must include window-entropy tests, not just
  immediate repetition.
- The training reward MUST be aligned with the eval grader, or the agent
  optimizes the wrong objective.
- "Belief output" is useless if it doesn't influence action selection.
  Belief was emitted as a string AFTER the action — no causal pathway from
  belief to action.

---

## Iter 3: Align reward + restructure format (CANCELLED before run — stale code, $0)

**5 architectural fixes**:

1. **Per-step reward grader-alignment** (`_compute_reward`): add
   profile-INDEPENDENT bias `+0.5 × progress_delta + 0.4 × connection_delta`.
   The profile-weighted reward still drives belief inference, but the agent
   now ALWAYS gets penalized for ignoring progress and connection regardless
   of what the sampled profile weights.

2. **Belief-first output format** (`S M W ACTION_NAME`): in causal LM
   generation, tokens generated EARLIER condition LATER tokens. With belief
   tokens first, the action is now causally conditioned on the belief — making
   the belief functionally useful for action selection. The previous order
   ("ACTION S M W") made belief a post-hoc afterthought.

3. **N-cycle penalty** (`env_reward`): if last 6 actions have ≤2 unique
   values, -0.4. Closes the M-E alternation loophole AND any longer N-cycle
   the agent might find.

4. **New-action exploration bonus** (`env_reward`): +0.2 reward for taking
   an action that hasn't appeared in the current episode (until 6+ unique
   actions tried). Pushes the agent to PROBE varied actions early —
   the canonical meta-RL exploration signal.

5. **Sparse terminal reward** (env `step()` at done=True): add
   `(final_score - 0.5) × 5` to the last step's reward. Direct supervision
   on the actual grader, range [-2.5, +2.5], strong enough to dominate any
   local reward-hack.

Plus training config: 400 → 800 steps, num_generations 4 → 8 (lower variance),
LoRA rank 8 → 16 (more capacity).

**Hypothesis**: With grader-aligned reward + belief-first format + cycle
penalty + exploration bonus + terminal supervision, the agent should:
- Use ≥5 unique actions per episode (varied behavior)
- Maintain belief_accuracy > +0.30 (don't regress)
- Beat random in 2/3 conditions on final_score
- Show positive (or less-negative) adaptation than baselines

**Result**: Iter 3 was never actually launched. Pre-flight inspection of the
HF Space confirmed the cloned snapshot still had stale code, and a re-launched
external review surfaced 7 deeper bugs (see Round 2 below) that needed to
land before any further GPU spend was justified.

---

## Round 2 fixes (applied for iter 4+, after external bug review)

External agent surfaced 7 issues that survived all prior reviews. All landed
on `round2` branch and on the HF Space `main` before iter 4 launched:

1. **Anomalies surfaced in prompt** (`StepRecord` + `format_observation_prompt`
   + `inference.py`): per-meter anomaly signals were computed each step but
   never made visible to the agent. Agent was supposed to learn from them.
2. **Belief baseline subtraction** in `belief_accuracy`: reward is now
   `similarity − constant_baseline_similarity`. The constant `5 5 5` belief
   no longer earns a free +1/step floor.
3. **Profile weight cap 0.80 → 0.45** in `sample_profile`. Forces every
   sampled profile to weight 3+ meters meaningfully (originally to kill the
   "single-meter dominant → SLEEP-spam optimal" exploit).
4. **Scaled-down shaping** in `_compute_reward`: -0.10 / -0.15 / +0.07
   (was -0.30 / -0.40 / +0.20). Reduces noise-floor of shaping vs. the
   real signal layers.
5. **Step-0 belief reward = 0**: agent has no information at step 0, so
   penalizing belief-vs-target there just punishes initialization.
6. **Belief-action coupling reward** (±0.15): rewards if the chosen action
   matches the agent's emitted belief, penalizes if it contradicts. Forces
   the belief to be *causally useful*, not decorative.
7. **`grader_bias` moved out of `_compute_reward` into `env_reward`**:
   keeps per-step env reward pure for inference-signal analysis. The
   progress/connection bias still lands in the GRPO advantage, just via
   the env-reward layer.

---

## Iter 4: Round 2 fixes — partial run, mistakenly cancelled (2026-04-26, ~$2.10, 235/800 steps)

**Config**: a10g-large, LoRA rank 16, num_generations 8, 800 steps, all
Round 1 + Iter 3 architectural fixes + Round 2 (above).

**Hypothesis**: With anomalies in the prompt, baseline subtraction killing
the belief-spam floor, belief-action coupling forcing causal use of belief,
and grader_bias keeping env-reward pure, the agent should show monotonic
belief_accuracy growth without hitting a 2-cycle hack.

**What we got** (from 235-step partial — see `docs/iter4_partial_analysis.txt`):

Working:
- Total reward: -3.4 → +0.39 (climbing)
- format_valid: -1.20 → +0.44 (slow but climbing)
- env_reward: -2.01 → +0.44 (climbing)
- grad_norm normalized to ~10 by step 60 from initial 36+
- No catastrophic mode collapse

Broken — the unsolved core:
- **`belief_accuracy/mean` flat at -0.10 throughout 235 steps**
- Linear slope: +0.0007 per 100 steps (essentially zero, well under noise)
- Agent emits beliefs SLIGHTLY WORSE than constant baseline

**Why the run ended at 235**: I cancelled the job based on stale HF API
log output that suggested the run was stuck. The HF UI showed it was
healthy. ~$2.10 wasted. Lesson banked: **trust the live UI over the
`/logs` API endpoint**, which lags severely.

**Root-cause hypothesis** (post-mortem analysis):

The profile cap (0.80 → 0.45) and the baseline subtraction interact
negatively. With weights clamped to ≤0.45, sampled profiles cluster
toward balanced; `profile_to_belief_vector` (whose `work_pref` axis is
30%-weighted on the progress reward weight) consequently lands closer to
[0.5, 0.5, 0.5]. The constant `5 5 5` belief already has high cosine
similarity with that target, so after baseline subtraction there is
almost no headroom for the agent to "win" against it.

**Why we missed it**:
- The Round 2 fixes were treated as independent, but #2 (baseline
  subtraction) and #3 (profile cap) share the same denominator — the
  spread of the belief target distribution. An analytical check on
  belief-target stddev under the new cap would have caught it before
  spending compute.
- The `grader_bias` term (#7) was the original justification for
  needing a tighter profile cap (kill the SLEEP-spam exploit). Once
  grader_bias was in env_reward, the cap could have been reverted.
  We applied both fixes simultaneously.

---

## Iter 5: Identical fixes, smaller config (2026-04-26 ~$2.50, 500 steps)

**Config**: a10g-large, LoRA rank 8, num_generations 4, 500 steps. Same fix
set as iter 4 — Round 1 + Iter 3 architectural + Round 2.

**Result**: Worse than iter 4 partial. 86% SLEEP, agent never emits belief
(`format_valid` stuck at +0.5 = action-only the whole run), `belief_accuracy`
flat at -0.10 (the no-belief penalty score), `reward_std` collapses to 0
twice during training. final_score 0.349 in-dist, 0.331 OOD. Lower capacity
(LoRA 8 + num_gen 4) made GRPO too noisy to maintain the belief format.

---

## The pivot: stop iterating GRPO, look at what we're optimizing

After iter 5, the question wasn't "what's the next reward shaping fix" —
it was "why does no GRPO config beat heuristic?" Reading the model's actual
reasoning answered it:

> *"Last step's socialize gave V−0.12 (anomaly −0.06, much worse than
> neutral) — high social drain, suggests low S. Morning DEEP_WORK earlier
> gave bonus cognition (+0.04) → high M..."*

The model **was inferring the profile**. The inference just didn't help its
score. The grader rewarded keeping meters healthy (which a heuristic does
well by reflex) but didn't reward knowing the person. So an agent that did
real inference and an agent that played safe both got the same grade.

The fix: add `belief_accuracy` as 20% of the grade. Heuristic emits no
belief and scores 0 on this component, by design. Now the grader measures
the skill we actually want.

Under the v2 grader, the gpt-5.4 teacher (running with our existing
observation prompt) hits **0.617 vs heuristic 0.449 — a +0.168 margin,
30/30 head-to-head wins** on the same seeds.

That made the second realization unavoidable: **Algorithm Distillation is
the right recipe** ([Laskin et al. 2022](https://arxiv.org/abs/2210.14215)),
not GRPO from scratch. Small reasoning models (≤3B) need a teacher to
bootstrap. We had access to a frontier teacher; we'd just been ignoring it.

---

## Final pipeline: SFT-prime via Algorithm Distillation

1. **Stage 1 — Teacher rollouts.** gpt-5.4 (Azure AI Foundry) plays 30+
   episodes. Each step: `<reasoning>...</reasoning>` + `S M W ACTION_NAME`.
   ~$3 / 30 episodes.
2. **Stage 2 — SFT prime.** Qwen 2.5-3B + Unsloth + LoRA r=16 fine-tuned on
   teacher trajectories. ~25 min on a10g-large, ~$2-3.
3. **Optional Stage 3 — GRPO refine.** Only if SFT alone misses the bar
   (so far it doesn't).

See [`docs/results.md`](results.md) for headline numbers and
[`README.md`](../README.md) for the full pipeline + reproduce instructions.

---

## Spend tracker

| Stage | Cost | Outcome |
|---|---|---|
| Iters 1-2 (GRPO from scratch) | ~$2 | Mode collapses; grader-shape lessons |
| Iters 3-4 (round 2 fixes) | ~$3.60 | Inference happens but grader doesn't reward it |
| Iter 5 (smaller config) | ~$2.50 | Confirms low capacity makes things worse |
| Algorithm Distillation pipeline | ~$5.50 | Real result, real story |
| **Total budget used** | **~$13.60** of $30 | |

The 5 GRPO-from-scratch attempts weren't waste — they're what taught us the
grader was the wrong shape. Without them we wouldn't have understood why
naive RL was failing, and we'd have skipped straight to a less defensible
fix.

---

## What we'll write up

The story of this submission is the pivot, not the iteration count.
Five rounds of GRPO patches couldn't beat heuristic because the grader
didn't measure inference. Reading the model's reasoning surfaced the
mismatch. Fixing the grader and switching to Algorithm Distillation got
us a real result. The journey is the writeup.

## OpenEnv Rubric system (refactor complete, post-deadline)

Originally we ran with a custom `_grade_episode` and an honest
acknowledged gap. After the submission deadline we returned and did
the proper refactor (see `server/rubrics.py`):

- 6 `Rubric` subclasses, one per scored axis
  (`CrashFreeRubric`, `ProgressRubric`, `ConnectionRubric`,
  `AdaptationRubric`, `EfficiencyRubric`, `BeliefAccuracyRubric`)
- Composed via `openenv.core.rubrics.WeightedSum` with weights summing
  to 1.0 (matching the original 0.15 / 0.20 / 0.10 / 0.25 / 0.10 / 0.20)
- `_grade_episode` now delegates to `make_grade_rubric(self)(None, None)`

Each sub-rubric reads aggregated episode-end env state via a reference
held in `__init__` — the recommended pattern from RFC 004 for
trajectory-summary scoring on top of the per-(action, observation)
Rubric ABC.

Two new tests in `tests/test_rhythm_env.py` verify that the grader
literally uses `WeightedSum` and that the 6 child rubrics are present
with the expected names (not just functionally equivalent — actually
using the framework primitive). All 52 tests pass.