InosLihka Claude Opus 4.7 (1M context) commited on
Commit
e12fc69
Β·
1 Parent(s): 64d24b3

docs: iteration journal with hypothesis/result/root-cause/fix per iter

Browse files

Captures the training journey honestly:
- Iter 0 (pre-existing): single-task training regressed
- Refactor: meta-RL conversion (continuous profiles + belief + adaptation)
- Iter 1: mode collapse to single action (constant rewards = no GRPO gradient)
- Iter 2: mode collapse to 2-cycle (proxy/goal misalignment exposed)
- Iter 3: in flight with 5 architectural fixes + belief-first format

Includes 'Why we missed it' sections for each failure - useful for both
future maintenance and hackathon storytelling. Honest post-mortems are
better submission material than polished success-only writeups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. docs/iterations.md +272 -0
docs/iterations.md ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RhythmEnv Training Journey β€” Iteration Log
2
+
3
+ A structured record of every training iteration: what we expected, what
4
+ happened, what broke, why we missed it, and what we changed next.
5
+
6
+ This doubles as raw material for the hackathon blog post. The "Why we missed
7
+ it" sections are deliberately honest β€” judges and future maintainers benefit
8
+ from the failure post-mortems more than from polished success stories.
9
+
10
+ ---
11
+
12
+ ## Iter 0 (pre-existing): Original v1 single-task training
13
+
14
+ **Date**: pre-2026-04-25
15
+ **Config**: Qwen 2.5-3B + LoRA r=4, 500 steps, GRPO via Unsloth, 3 hardcoded
16
+ profiles (introvert / extrovert / workaholic), action-only output.
17
+
18
+ **What we expected**: Trained agent should beat the heuristic baseline on at
19
+ least 1-2 of the 3 profiles. The env exposed enough information (meter deltas,
20
+ anomaly signals, step history) that a well-trained agent should discover the
21
+ profile from rewards.
22
+
23
+ **What we got**:
24
+ | Profile | Heuristic | Trained 500-step | Ξ” |
25
+ |---|---|---|---|
26
+ | Introvert Morning | 0.765 | 0.617 | **βˆ’0.148** ❌ |
27
+ | Extrovert Night Owl | 0.819 | 0.725 | **βˆ’0.094** ❌ |
28
+ | Workaholic Stoic | 0.761 | 0.539 | **βˆ’0.222** ❌ |
29
+
30
+ **Root cause** (identified in retro):
31
+ 1. Env was *designed* for meta-learning (3 hidden profiles) but *trained* as
32
+ single-task RL β€” no scaffolding to teach the inference skill.
33
+ 2. Grader had a `0.30 Γ— meter_balance` term that rewarded random behavior
34
+ (random has high meter variance by chance).
35
+ 3. Only 3 profiles β†’ memorizable, not learnable as a skill.
36
+ 4. No explicit "form a model of the user" output β†’ no gradient pushing the
37
+ model toward inference.
38
+
39
+ **The pivot**: redesign rhythm_env as a meta-RL environment.
40
+
41
+ ---
42
+
43
+ ## Refactor: meta-RL conversion (2026-04-25)
44
+
45
+ Big surgical refactor:
46
+ - **Continuous profile space** via `sample_profile(seed)` β€” Dirichlet weights
47
+ + uniform-bounded modifiers. Memorization impossible.
48
+ - **Belief output** added to action format: `ACTION_NAME S M W`.
49
+ - **`belief_accuracy` reward**: MAE-based, range [-0.5, +0.5], compares
50
+ emitted belief vector to ground-truth `profile_to_belief_vector(profile)`.
51
+ - **Grader rewrite**: dropped `meter_balance` (rewarded random), added
52
+ `adaptation_score` (got better mid-episode).
53
+ - **Curriculum**: `hint_fraction=0.15` of training samples include true
54
+ profile vector in prompt as warmup.
55
+
56
+ Pre-training baselines (under new grader) β€” what trained agent must beat:
57
+
58
+ | Condition | Heuristic | Random | Adaptation |
59
+ |---|---|---|---|
60
+ | discrete-3-profiles | 0.584 | 0.554 | both negative |
61
+ | **continuous-in-distribution** | **0.587** | 0.516 | both negative |
62
+ | **continuous-OOD** | **0.580** | 0.508 | both negative |
63
+
64
+ ---
65
+
66
+ ## Iter 1: First meta-RL training (2026-04-25, $0.50, 200 steps)
67
+
68
+ **Hypothesis**: With FAST_MODE preset (200 steps, temp 1.0, beta 0.04,
69
+ weights [0.3, 0.3, 1.0, 1.0], num_generations 4), the agent should at least
70
+ not regress vs random β€” and we'd see whether the meta-RL signal is strong
71
+ enough to actually learn.
72
+
73
+ **Config**: A100 large, 200 steps, num_gen 4, beta 0.04, lr 5e-5,
74
+ LoRA rank 8, hint_fraction 0.15, temp 1.0, max_completion 32.
75
+
76
+ **What we got**:
77
+ - final_score 0.224 in-dist, 0.219 OOD β€” **worse than random** (0.516, 0.508).
78
+ - Action distribution: **99.7% `EXERCISE`** (one episode had a single `LEARN`).
79
+ - Final beliefs all "5 5 5" β€” the neutral default.
80
+ - belief_accuracy DID climb to +0.43 around step 100-150 before collapsing.
81
+
82
+ **Root cause: catastrophic mode collapse**
83
+
84
+ The training log told the story:
85
+
86
+ | step | reward_std | meaning |
87
+ |---|---|---|
88
+ | 1 | 0.144 | Healthy diversity in 4 completions per prompt |
89
+ | 50 | 0.056 | Diversity shrinking |
90
+ | **100** | **0.000** | **All 4 completions identical β†’ GRPO has zero gradient** |
91
+ | 200 | 0.000 | Policy permanently frozen |
92
+
93
+ `format_valid` returned +1.0 for any valid output. `action_legal` returned
94
+ +0.5 for any valid action. Both layers gave **the same constant reward
95
+ across all 4 completions in a GRPO group**. GRPO computes advantage as
96
+ `reward - group_mean`, so constant layers contribute exactly zero to the
97
+ gradient. The only learning signal came from `env_reward` and
98
+ `belief_accuracy`.
99
+
100
+ When the policy drifted toward the shortest-token action (`EXERCISE`) +
101
+ neutral belief (`5 5 5`), all 4 completions converged to that exact string.
102
+ `reward_std β†’ 0`, gradient β†’ 0, policy frozen.
103
+
104
+ **Why we missed it**:
105
+ - I launched 3 review subagents pre-training. The first (correctness/reward
106
+ bugs) was rejected by the user. That subagent's prompt explicitly asked
107
+ *"could one layer dominate the total reward and drown out the others?"* β€”
108
+ it would have caught the constant-reward issue.
109
+ - My own `pipeline_dryrun.py` tested completion KINDS (perfect/good/garbage)
110
+ with DIFFERENT random actions per kind. It never tested the case where 4
111
+ completions for the same prompt are identical valid actions β€” exactly what
112
+ GRPO sees during sampling. If it had, the test would have shown
113
+ `format_valid_std = 0` and I'd have caught this for free.
114
+ - "Constant rewards = no gradient" is a textbook GRPO problem (DeepSeek's
115
+ R1-Zero paper discusses it). I should have caught it during reward design.
116
+
117
+ **Lessons banked**:
118
+ - Constant-output reward layers must be diagnosed during reward design, not
119
+ discovered through GPU spend.
120
+ - Bug-finding subagents should be non-skippable for any RL setup change.
121
+ - Smoke tests must include "all-identical-completions" as a case.
122
+
123
+ ---
124
+
125
+ ## Iter 2: Fix mode collapse (2026-04-26 ~01:00 UTC, $1.50, 400 steps)
126
+
127
+ **7 fixes applied** (4 from initial diagnosis + 3 from a re-launched
128
+ correctness review subagent that found additional bugs):
129
+
130
+ 1. Sampling temperature 1.0 β†’ 1.5 (force diverse rollouts)
131
+ 2. Reward weights [0.3, 0.3, 1.0, 1.0] β†’ [0.05, 0.05, 1.5, 3.0] (suppress
132
+ saturated layers, amplify variable ones)
133
+ 3. `action_legal` returns 0 for valid (was +0.5) β€” pure penalty layer
134
+ 4. Explicit repetition penalty in `env_reward` (-0.3 if action would make
135
+ 3+ in a row)
136
+ 5. **CRITICAL-2** (subagent): `_grade_episode` `late_quality` normalization
137
+ was using [-1, +1] but per-step rewards are clamped to [-3, +3]. Fixed.
138
+ 6. **MAJOR-3** (subagent): `hint_fraction=0.15` created train-eval
139
+ distribution shift (eval had no hints). Set to 0.0.
140
+ 7. **MAJOR-1** (subagent): seed fallback `i % 50` could create deterministic
141
+ reward clusters. Hardened to `(i * 17) ^ 0xBEEF`.
142
+
143
+ Plus FAST_MODE bumped: 200 β†’ 400 steps.
144
+
145
+ **Hypothesis**: With saturated layers suppressed and explicit anti-repetition
146
+ penalty, the agent should escape single-action collapse and produce varied
147
+ behavior. Belief accuracy should continue rising past iter 1's +0.43.
148
+
149
+ **What we got**:
150
+ - final_score: **0.224 in-dist, 0.219 OOD** β€” *literally identical to iter 1*.
151
+ - Action distribution: 54.8% MEDITATE, 45.2% EXERCISE β€” **no other actions used**.
152
+ - Final beliefs cluster around (0.4-0.6, 0.5, 0.3-0.4) β€” slightly better than
153
+ pure neutral.
154
+ - belief_accuracy rolling mean climbed steadily: 0.15 β†’ 0.36. βœ…
155
+ - `reward_std` collapsed to 0 at step 200 then **recovered** to 0.06+ after
156
+ the repetition penalty kicked in. Partial escape from collapse.
157
+
158
+ **Root cause: 2-cycle reward hacking**
159
+
160
+ The single-action collapse was prevented (good!) but the agent found a new
161
+ hack: alternating MEDITATE and EXERCISE. The repetition penalty caught
162
+ "3+ same in a row" but missed the M-E-M-E-... 2-cycle.
163
+
164
+ Deeper issue exposed: **proxy/goal misalignment**. The agent achieved
165
+ high `env_reward` (+1.25 mean by step 400) but low `final_score` (0.22).
166
+
167
+ Sample episode final state: `V=1.0, C=1.0, P=0.0, S=1.0, Cn=0.22`.
168
+
169
+ The agent maxed Vitality / Cognition / Serenity (which the per-step
170
+ `profile_weighted_reward` rewards via Dirichlet-sampled weights heavy on
171
+ those meters), but ignored Progress (0.0!) and Connection (decayed to 0.22).
172
+ The grader weights Progress 0.25 + Connection 0.15 β€” agent ignored 40% of
173
+ the score.
174
+
175
+ The fundamental issue: profile-weighted per-step reward and the grader
176
+ optimize different things. The agent did exactly what we trained it to do β€”
177
+ just not what we wanted it to do.
178
+
179
+ **Why we missed it**:
180
+ - The repetition penalty was scoped too narrowly (3-in-a-row) without
181
+ considering N-cycles. A simple "any low-entropy window" check would have
182
+ covered it.
183
+ - The proxy/goal misalignment was hidden in plain sight: per-step reward
184
+ shape (profile-weighted) β‰  grader shape (progress + connection +
185
+ adaptation). I assumed they'd correlate enough.
186
+ - We didn't have a runtime trace exercise (4 completions Γ— specific prompt β†’
187
+ group reward β†’ advantage) before submitting iter 2.
188
+
189
+ **Lessons banked**:
190
+ - Anti-repetition checks must include window-entropy tests, not just
191
+ immediate repetition.
192
+ - The training reward MUST be aligned with the eval grader, or the agent
193
+ optimizes the wrong objective.
194
+ - "Belief output" is useless if it doesn't influence action selection.
195
+ Belief was emitted as a string AFTER the action β€” no causal pathway from
196
+ belief to action.
197
+
198
+ ---
199
+
200
+ ## Iter 3: Align reward + restructure format (in flight at time of writing, ~$5 budgeted, 800 steps)
201
+
202
+ **5 architectural fixes**:
203
+
204
+ 1. **Per-step reward grader-alignment** (`_compute_reward`): add
205
+ profile-INDEPENDENT bias `+0.5 Γ— progress_delta + 0.4 Γ— connection_delta`.
206
+ The profile-weighted reward still drives belief inference, but the agent
207
+ now ALWAYS gets penalized for ignoring progress and connection regardless
208
+ of what the sampled profile weights.
209
+
210
+ 2. **Belief-first output format** (`S M W ACTION_NAME`): in causal LM
211
+ generation, tokens generated EARLIER condition LATER tokens. With belief
212
+ tokens first, the action is now causally conditioned on the belief β€” making
213
+ the belief functionally useful for action selection. The previous order
214
+ ("ACTION S M W") made belief a post-hoc afterthought.
215
+
216
+ 3. **N-cycle penalty** (`env_reward`): if last 6 actions have ≀2 unique
217
+ values, -0.4. Closes the M-E alternation loophole AND any longer N-cycle
218
+ the agent might find.
219
+
220
+ 4. **New-action exploration bonus** (`env_reward`): +0.2 reward for taking
221
+ an action that hasn't appeared in the current episode (until 6+ unique
222
+ actions tried). Pushes the agent to PROBE varied actions early β€”
223
+ the canonical meta-RL exploration signal.
224
+
225
+ 5. **Sparse terminal reward** (env `step()` at done=True): add
226
+ `(final_score - 0.5) Γ— 5` to the last step's reward. Direct supervision
227
+ on the actual grader, range [-2.5, +2.5], strong enough to dominate any
228
+ local reward-hack.
229
+
230
+ Plus training config: 400 β†’ 800 steps, num_generations 4 β†’ 8 (lower variance),
231
+ LoRA rank 8 β†’ 16 (more capacity).
232
+
233
+ **Hypothesis**: With grader-aligned reward + belief-first format + cycle
234
+ penalty + exploration bonus + terminal supervision, the agent should:
235
+ - Use β‰₯5 unique actions per episode (varied behavior)
236
+ - Maintain belief_accuracy > +0.30 (don't regress)
237
+ - Beat random in 2/3 conditions on final_score
238
+ - Show positive (or less-negative) adaptation than baselines
239
+
240
+ **Result**: TBD when iter 3 completes (~30-40 min after submission).
241
+
242
+ ---
243
+
244
+ ## Spend tracker
245
+
246
+ | Iter | Cost | Steps | Outcome |
247
+ |---|---|---|---|
248
+ | 1 | ~$0.50 | 200 | Mode collapse to single action |
249
+ | 2 | ~$1.50 | 400 | Mode collapse to 2-cycle |
250
+ | 3 | ~$5 (est) | 800 | TBD |
251
+ | **Subtotal** | **~$7** | | |
252
+ | Budget | $30 | | $23 remaining |
253
+
254
+ ---
255
+
256
+ ## What we'll write up regardless of iter 3 outcome
257
+
258
+ The iteration journey itself is hackathon material. Even if iter 3 doesn't
259
+ hit the "trained > heuristic" bar, we have:
260
+
261
+ 1. **Working meta-RL infrastructure** β€” continuous profile space + belief
262
+ output + adaptation grader. Novel, defensible.
263
+ 2. **Clear post-mortem of failure modes** β€” most teams won't have this
264
+ honesty in their writeup.
265
+ 3. **Belief learning evidence** β€” even from iter 2, belief_accuracy +0.36
266
+ shows the agent IS learning to model users.
267
+ 4. **Reward design lessons** β€” the "constant reward β†’ mode collapse" insight
268
+ is publishable in itself.
269
+
270
+ The blog post should lead with the *thesis* (meta-RL for personalization),
271
+ include the *journey* (iter 1 collapse β†’ iter 2 partial escape β†’ iter 3
272
+ fix), and frame whatever final result we get honestly.