Spaces:
Sleeping
Sleeping
docs: iteration journal with hypothesis/result/root-cause/fix per iter
Browse filesCaptures the training journey honestly:
- Iter 0 (pre-existing): single-task training regressed
- Refactor: meta-RL conversion (continuous profiles + belief + adaptation)
- Iter 1: mode collapse to single action (constant rewards = no GRPO gradient)
- Iter 2: mode collapse to 2-cycle (proxy/goal misalignment exposed)
- Iter 3: in flight with 5 architectural fixes + belief-first format
Includes 'Why we missed it' sections for each failure - useful for both
future maintenance and hackathon storytelling. Honest post-mortems are
better submission material than polished success-only writeups.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docs/iterations.md +272 -0
docs/iterations.md
ADDED
|
@@ -0,0 +1,272 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# RhythmEnv Training Journey β Iteration Log
|
| 2 |
+
|
| 3 |
+
A structured record of every training iteration: what we expected, what
|
| 4 |
+
happened, what broke, why we missed it, and what we changed next.
|
| 5 |
+
|
| 6 |
+
This doubles as raw material for the hackathon blog post. The "Why we missed
|
| 7 |
+
it" sections are deliberately honest β judges and future maintainers benefit
|
| 8 |
+
from the failure post-mortems more than from polished success stories.
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
## Iter 0 (pre-existing): Original v1 single-task training
|
| 13 |
+
|
| 14 |
+
**Date**: pre-2026-04-25
|
| 15 |
+
**Config**: Qwen 2.5-3B + LoRA r=4, 500 steps, GRPO via Unsloth, 3 hardcoded
|
| 16 |
+
profiles (introvert / extrovert / workaholic), action-only output.
|
| 17 |
+
|
| 18 |
+
**What we expected**: Trained agent should beat the heuristic baseline on at
|
| 19 |
+
least 1-2 of the 3 profiles. The env exposed enough information (meter deltas,
|
| 20 |
+
anomaly signals, step history) that a well-trained agent should discover the
|
| 21 |
+
profile from rewards.
|
| 22 |
+
|
| 23 |
+
**What we got**:
|
| 24 |
+
| Profile | Heuristic | Trained 500-step | Ξ |
|
| 25 |
+
|---|---|---|---|
|
| 26 |
+
| Introvert Morning | 0.765 | 0.617 | **β0.148** β |
|
| 27 |
+
| Extrovert Night Owl | 0.819 | 0.725 | **β0.094** β |
|
| 28 |
+
| Workaholic Stoic | 0.761 | 0.539 | **β0.222** β |
|
| 29 |
+
|
| 30 |
+
**Root cause** (identified in retro):
|
| 31 |
+
1. Env was *designed* for meta-learning (3 hidden profiles) but *trained* as
|
| 32 |
+
single-task RL β no scaffolding to teach the inference skill.
|
| 33 |
+
2. Grader had a `0.30 Γ meter_balance` term that rewarded random behavior
|
| 34 |
+
(random has high meter variance by chance).
|
| 35 |
+
3. Only 3 profiles β memorizable, not learnable as a skill.
|
| 36 |
+
4. No explicit "form a model of the user" output β no gradient pushing the
|
| 37 |
+
model toward inference.
|
| 38 |
+
|
| 39 |
+
**The pivot**: redesign rhythm_env as a meta-RL environment.
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## Refactor: meta-RL conversion (2026-04-25)
|
| 44 |
+
|
| 45 |
+
Big surgical refactor:
|
| 46 |
+
- **Continuous profile space** via `sample_profile(seed)` β Dirichlet weights
|
| 47 |
+
+ uniform-bounded modifiers. Memorization impossible.
|
| 48 |
+
- **Belief output** added to action format: `ACTION_NAME S M W`.
|
| 49 |
+
- **`belief_accuracy` reward**: MAE-based, range [-0.5, +0.5], compares
|
| 50 |
+
emitted belief vector to ground-truth `profile_to_belief_vector(profile)`.
|
| 51 |
+
- **Grader rewrite**: dropped `meter_balance` (rewarded random), added
|
| 52 |
+
`adaptation_score` (got better mid-episode).
|
| 53 |
+
- **Curriculum**: `hint_fraction=0.15` of training samples include true
|
| 54 |
+
profile vector in prompt as warmup.
|
| 55 |
+
|
| 56 |
+
Pre-training baselines (under new grader) β what trained agent must beat:
|
| 57 |
+
|
| 58 |
+
| Condition | Heuristic | Random | Adaptation |
|
| 59 |
+
|---|---|---|---|
|
| 60 |
+
| discrete-3-profiles | 0.584 | 0.554 | both negative |
|
| 61 |
+
| **continuous-in-distribution** | **0.587** | 0.516 | both negative |
|
| 62 |
+
| **continuous-OOD** | **0.580** | 0.508 | both negative |
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## Iter 1: First meta-RL training (2026-04-25, $0.50, 200 steps)
|
| 67 |
+
|
| 68 |
+
**Hypothesis**: With FAST_MODE preset (200 steps, temp 1.0, beta 0.04,
|
| 69 |
+
weights [0.3, 0.3, 1.0, 1.0], num_generations 4), the agent should at least
|
| 70 |
+
not regress vs random β and we'd see whether the meta-RL signal is strong
|
| 71 |
+
enough to actually learn.
|
| 72 |
+
|
| 73 |
+
**Config**: A100 large, 200 steps, num_gen 4, beta 0.04, lr 5e-5,
|
| 74 |
+
LoRA rank 8, hint_fraction 0.15, temp 1.0, max_completion 32.
|
| 75 |
+
|
| 76 |
+
**What we got**:
|
| 77 |
+
- final_score 0.224 in-dist, 0.219 OOD β **worse than random** (0.516, 0.508).
|
| 78 |
+
- Action distribution: **99.7% `EXERCISE`** (one episode had a single `LEARN`).
|
| 79 |
+
- Final beliefs all "5 5 5" β the neutral default.
|
| 80 |
+
- belief_accuracy DID climb to +0.43 around step 100-150 before collapsing.
|
| 81 |
+
|
| 82 |
+
**Root cause: catastrophic mode collapse**
|
| 83 |
+
|
| 84 |
+
The training log told the story:
|
| 85 |
+
|
| 86 |
+
| step | reward_std | meaning |
|
| 87 |
+
|---|---|---|
|
| 88 |
+
| 1 | 0.144 | Healthy diversity in 4 completions per prompt |
|
| 89 |
+
| 50 | 0.056 | Diversity shrinking |
|
| 90 |
+
| **100** | **0.000** | **All 4 completions identical β GRPO has zero gradient** |
|
| 91 |
+
| 200 | 0.000 | Policy permanently frozen |
|
| 92 |
+
|
| 93 |
+
`format_valid` returned +1.0 for any valid output. `action_legal` returned
|
| 94 |
+
+0.5 for any valid action. Both layers gave **the same constant reward
|
| 95 |
+
across all 4 completions in a GRPO group**. GRPO computes advantage as
|
| 96 |
+
`reward - group_mean`, so constant layers contribute exactly zero to the
|
| 97 |
+
gradient. The only learning signal came from `env_reward` and
|
| 98 |
+
`belief_accuracy`.
|
| 99 |
+
|
| 100 |
+
When the policy drifted toward the shortest-token action (`EXERCISE`) +
|
| 101 |
+
neutral belief (`5 5 5`), all 4 completions converged to that exact string.
|
| 102 |
+
`reward_std β 0`, gradient β 0, policy frozen.
|
| 103 |
+
|
| 104 |
+
**Why we missed it**:
|
| 105 |
+
- I launched 3 review subagents pre-training. The first (correctness/reward
|
| 106 |
+
bugs) was rejected by the user. That subagent's prompt explicitly asked
|
| 107 |
+
*"could one layer dominate the total reward and drown out the others?"* β
|
| 108 |
+
it would have caught the constant-reward issue.
|
| 109 |
+
- My own `pipeline_dryrun.py` tested completion KINDS (perfect/good/garbage)
|
| 110 |
+
with DIFFERENT random actions per kind. It never tested the case where 4
|
| 111 |
+
completions for the same prompt are identical valid actions β exactly what
|
| 112 |
+
GRPO sees during sampling. If it had, the test would have shown
|
| 113 |
+
`format_valid_std = 0` and I'd have caught this for free.
|
| 114 |
+
- "Constant rewards = no gradient" is a textbook GRPO problem (DeepSeek's
|
| 115 |
+
R1-Zero paper discusses it). I should have caught it during reward design.
|
| 116 |
+
|
| 117 |
+
**Lessons banked**:
|
| 118 |
+
- Constant-output reward layers must be diagnosed during reward design, not
|
| 119 |
+
discovered through GPU spend.
|
| 120 |
+
- Bug-finding subagents should be non-skippable for any RL setup change.
|
| 121 |
+
- Smoke tests must include "all-identical-completions" as a case.
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
## Iter 2: Fix mode collapse (2026-04-26 ~01:00 UTC, $1.50, 400 steps)
|
| 126 |
+
|
| 127 |
+
**7 fixes applied** (4 from initial diagnosis + 3 from a re-launched
|
| 128 |
+
correctness review subagent that found additional bugs):
|
| 129 |
+
|
| 130 |
+
1. Sampling temperature 1.0 β 1.5 (force diverse rollouts)
|
| 131 |
+
2. Reward weights [0.3, 0.3, 1.0, 1.0] β [0.05, 0.05, 1.5, 3.0] (suppress
|
| 132 |
+
saturated layers, amplify variable ones)
|
| 133 |
+
3. `action_legal` returns 0 for valid (was +0.5) β pure penalty layer
|
| 134 |
+
4. Explicit repetition penalty in `env_reward` (-0.3 if action would make
|
| 135 |
+
3+ in a row)
|
| 136 |
+
5. **CRITICAL-2** (subagent): `_grade_episode` `late_quality` normalization
|
| 137 |
+
was using [-1, +1] but per-step rewards are clamped to [-3, +3]. Fixed.
|
| 138 |
+
6. **MAJOR-3** (subagent): `hint_fraction=0.15` created train-eval
|
| 139 |
+
distribution shift (eval had no hints). Set to 0.0.
|
| 140 |
+
7. **MAJOR-1** (subagent): seed fallback `i % 50` could create deterministic
|
| 141 |
+
reward clusters. Hardened to `(i * 17) ^ 0xBEEF`.
|
| 142 |
+
|
| 143 |
+
Plus FAST_MODE bumped: 200 β 400 steps.
|
| 144 |
+
|
| 145 |
+
**Hypothesis**: With saturated layers suppressed and explicit anti-repetition
|
| 146 |
+
penalty, the agent should escape single-action collapse and produce varied
|
| 147 |
+
behavior. Belief accuracy should continue rising past iter 1's +0.43.
|
| 148 |
+
|
| 149 |
+
**What we got**:
|
| 150 |
+
- final_score: **0.224 in-dist, 0.219 OOD** β *literally identical to iter 1*.
|
| 151 |
+
- Action distribution: 54.8% MEDITATE, 45.2% EXERCISE β **no other actions used**.
|
| 152 |
+
- Final beliefs cluster around (0.4-0.6, 0.5, 0.3-0.4) β slightly better than
|
| 153 |
+
pure neutral.
|
| 154 |
+
- belief_accuracy rolling mean climbed steadily: 0.15 β 0.36. β
|
| 155 |
+
- `reward_std` collapsed to 0 at step 200 then **recovered** to 0.06+ after
|
| 156 |
+
the repetition penalty kicked in. Partial escape from collapse.
|
| 157 |
+
|
| 158 |
+
**Root cause: 2-cycle reward hacking**
|
| 159 |
+
|
| 160 |
+
The single-action collapse was prevented (good!) but the agent found a new
|
| 161 |
+
hack: alternating MEDITATE and EXERCISE. The repetition penalty caught
|
| 162 |
+
"3+ same in a row" but missed the M-E-M-E-... 2-cycle.
|
| 163 |
+
|
| 164 |
+
Deeper issue exposed: **proxy/goal misalignment**. The agent achieved
|
| 165 |
+
high `env_reward` (+1.25 mean by step 400) but low `final_score` (0.22).
|
| 166 |
+
|
| 167 |
+
Sample episode final state: `V=1.0, C=1.0, P=0.0, S=1.0, Cn=0.22`.
|
| 168 |
+
|
| 169 |
+
The agent maxed Vitality / Cognition / Serenity (which the per-step
|
| 170 |
+
`profile_weighted_reward` rewards via Dirichlet-sampled weights heavy on
|
| 171 |
+
those meters), but ignored Progress (0.0!) and Connection (decayed to 0.22).
|
| 172 |
+
The grader weights Progress 0.25 + Connection 0.15 β agent ignored 40% of
|
| 173 |
+
the score.
|
| 174 |
+
|
| 175 |
+
The fundamental issue: profile-weighted per-step reward and the grader
|
| 176 |
+
optimize different things. The agent did exactly what we trained it to do β
|
| 177 |
+
just not what we wanted it to do.
|
| 178 |
+
|
| 179 |
+
**Why we missed it**:
|
| 180 |
+
- The repetition penalty was scoped too narrowly (3-in-a-row) without
|
| 181 |
+
considering N-cycles. A simple "any low-entropy window" check would have
|
| 182 |
+
covered it.
|
| 183 |
+
- The proxy/goal misalignment was hidden in plain sight: per-step reward
|
| 184 |
+
shape (profile-weighted) β grader shape (progress + connection +
|
| 185 |
+
adaptation). I assumed they'd correlate enough.
|
| 186 |
+
- We didn't have a runtime trace exercise (4 completions Γ specific prompt β
|
| 187 |
+
group reward β advantage) before submitting iter 2.
|
| 188 |
+
|
| 189 |
+
**Lessons banked**:
|
| 190 |
+
- Anti-repetition checks must include window-entropy tests, not just
|
| 191 |
+
immediate repetition.
|
| 192 |
+
- The training reward MUST be aligned with the eval grader, or the agent
|
| 193 |
+
optimizes the wrong objective.
|
| 194 |
+
- "Belief output" is useless if it doesn't influence action selection.
|
| 195 |
+
Belief was emitted as a string AFTER the action β no causal pathway from
|
| 196 |
+
belief to action.
|
| 197 |
+
|
| 198 |
+
---
|
| 199 |
+
|
| 200 |
+
## Iter 3: Align reward + restructure format (in flight at time of writing, ~$5 budgeted, 800 steps)
|
| 201 |
+
|
| 202 |
+
**5 architectural fixes**:
|
| 203 |
+
|
| 204 |
+
1. **Per-step reward grader-alignment** (`_compute_reward`): add
|
| 205 |
+
profile-INDEPENDENT bias `+0.5 Γ progress_delta + 0.4 Γ connection_delta`.
|
| 206 |
+
The profile-weighted reward still drives belief inference, but the agent
|
| 207 |
+
now ALWAYS gets penalized for ignoring progress and connection regardless
|
| 208 |
+
of what the sampled profile weights.
|
| 209 |
+
|
| 210 |
+
2. **Belief-first output format** (`S M W ACTION_NAME`): in causal LM
|
| 211 |
+
generation, tokens generated EARLIER condition LATER tokens. With belief
|
| 212 |
+
tokens first, the action is now causally conditioned on the belief β making
|
| 213 |
+
the belief functionally useful for action selection. The previous order
|
| 214 |
+
("ACTION S M W") made belief a post-hoc afterthought.
|
| 215 |
+
|
| 216 |
+
3. **N-cycle penalty** (`env_reward`): if last 6 actions have β€2 unique
|
| 217 |
+
values, -0.4. Closes the M-E alternation loophole AND any longer N-cycle
|
| 218 |
+
the agent might find.
|
| 219 |
+
|
| 220 |
+
4. **New-action exploration bonus** (`env_reward`): +0.2 reward for taking
|
| 221 |
+
an action that hasn't appeared in the current episode (until 6+ unique
|
| 222 |
+
actions tried). Pushes the agent to PROBE varied actions early β
|
| 223 |
+
the canonical meta-RL exploration signal.
|
| 224 |
+
|
| 225 |
+
5. **Sparse terminal reward** (env `step()` at done=True): add
|
| 226 |
+
`(final_score - 0.5) Γ 5` to the last step's reward. Direct supervision
|
| 227 |
+
on the actual grader, range [-2.5, +2.5], strong enough to dominate any
|
| 228 |
+
local reward-hack.
|
| 229 |
+
|
| 230 |
+
Plus training config: 400 β 800 steps, num_generations 4 β 8 (lower variance),
|
| 231 |
+
LoRA rank 8 β 16 (more capacity).
|
| 232 |
+
|
| 233 |
+
**Hypothesis**: With grader-aligned reward + belief-first format + cycle
|
| 234 |
+
penalty + exploration bonus + terminal supervision, the agent should:
|
| 235 |
+
- Use β₯5 unique actions per episode (varied behavior)
|
| 236 |
+
- Maintain belief_accuracy > +0.30 (don't regress)
|
| 237 |
+
- Beat random in 2/3 conditions on final_score
|
| 238 |
+
- Show positive (or less-negative) adaptation than baselines
|
| 239 |
+
|
| 240 |
+
**Result**: TBD when iter 3 completes (~30-40 min after submission).
|
| 241 |
+
|
| 242 |
+
---
|
| 243 |
+
|
| 244 |
+
## Spend tracker
|
| 245 |
+
|
| 246 |
+
| Iter | Cost | Steps | Outcome |
|
| 247 |
+
|---|---|---|---|
|
| 248 |
+
| 1 | ~$0.50 | 200 | Mode collapse to single action |
|
| 249 |
+
| 2 | ~$1.50 | 400 | Mode collapse to 2-cycle |
|
| 250 |
+
| 3 | ~$5 (est) | 800 | TBD |
|
| 251 |
+
| **Subtotal** | **~$7** | | |
|
| 252 |
+
| Budget | $30 | | $23 remaining |
|
| 253 |
+
|
| 254 |
+
---
|
| 255 |
+
|
| 256 |
+
## What we'll write up regardless of iter 3 outcome
|
| 257 |
+
|
| 258 |
+
The iteration journey itself is hackathon material. Even if iter 3 doesn't
|
| 259 |
+
hit the "trained > heuristic" bar, we have:
|
| 260 |
+
|
| 261 |
+
1. **Working meta-RL infrastructure** β continuous profile space + belief
|
| 262 |
+
output + adaptation grader. Novel, defensible.
|
| 263 |
+
2. **Clear post-mortem of failure modes** β most teams won't have this
|
| 264 |
+
honesty in their writeup.
|
| 265 |
+
3. **Belief learning evidence** β even from iter 2, belief_accuracy +0.36
|
| 266 |
+
shows the agent IS learning to model users.
|
| 267 |
+
4. **Reward design lessons** β the "constant reward β mode collapse" insight
|
| 268 |
+
is publishable in itself.
|
| 269 |
+
|
| 270 |
+
The blog post should lead with the *thesis* (meta-RL for personalization),
|
| 271 |
+
include the *journey* (iter 1 collapse β iter 2 partial escape β iter 3
|
| 272 |
+
fix), and frame whatever final result we get honestly.
|