InosLihka commited on
Commit
b9c9b8f
Β·
1 Parent(s): ece0bbe

Fix max_new_tokens for CoT format + add eval-only HF Jobs script

Browse files

inference_eval.py:
- max_new_tokens 20 -> 256. The original 20 was set when the format was
a single-line "S M W ACTION_NAME" answer (~7 tokens). After the SFT
switch to CoT format (<reasoning>...</reasoning>\nS M W ACTION_NAME)
the answer line lives at the END of ~100-200 token output, so 20 was
truncating mid-reasoning and the parser was falling back to action
names found in the partial reasoning text. This explains why SFT v1's
belief_MAE was great (0.117 in-dist) but final_score was muted β€”
inference was working but action choices were noisy.

scripts/eval_on_hf.py:
- Eval-only HF Jobs orchestrator. Pulls a trained model, runs
inference_eval.py with the fix, uploads new eval_results_v2.json.
Useful for re-evaluating SFT v1 without re-training.

Story docs updated:
- results.md: teacher numbers locked (in-dist 0.611, OOD 0.621)
- README.md: AD recipe section, v2 grader formula, 3 reference profiles
- blog_post.md: pivot story (grader fix + algorithm distillation)
- iterations.md: post-iter-5 pivot to AD, final pipeline summary

README.md CHANGED
@@ -18,8 +18,11 @@ This is **meta-reinforcement learning** for personalization: the agent isn't tra
18
  ## Links
19
 
20
  - **Live Environment**: [HF Space β€” InosLihka/rhythm_env](https://huggingface.co/spaces/InosLihka/rhythm_env)
21
- - **Training Notebook**: [Colab β€” GRPO Training](training/RhythmEnv_GRPO_Training.ipynb)
 
 
22
  - **Blog Post**: [Teaching an AI to Know You (Without Asking)](docs/blog_post.md)
 
23
 
24
  ## Why a Life Simulator?
25
 
@@ -128,28 +131,41 @@ Each `reset()` samples a fresh personality from parameter distributions:
128
 
129
  This produces an effectively infinite personality space β€” memorization is impossible, the agent must learn the *skill* of inference.
130
 
131
- ### The 3 Discrete Profiles (legacy / eval anchor)
132
 
133
- For comparison with the original benchmark, the env also exposes 3 hardcoded personalities reachable via `profile_mode='discrete'` or `profile=<name>`:
 
134
 
135
  - **Introvert Morning Person** β†’ belief vector β‰ˆ `[0.0 social, 1.0 morning, 0.07 work]`
136
  - **Extrovert Night Owl** β†’ belief vector β‰ˆ `[1.0 social, 0.20 morning, 0.02 work]`
137
  - **Workaholic Stoic** β†’ belief vector β‰ˆ `[0.36 social, 0.50 morning, 1.0 work]`
138
 
139
- These exist as anchor points in the continuous space.
140
-
141
  ### The Action+Belief Output Format
142
 
143
- Each step the agent outputs ONE LINE: `ACTION_NAME S M W`
 
 
 
 
 
 
 
 
 
 
144
 
145
- Where `S M W` are single digits (0-9) representing the agent's CURRENT belief about the user:
 
146
  - **S** = social preference (0=hates social, 9=loves social)
147
  - **M** = morning preference (0=night owl, 9=morning person)
148
  - **W** = work preference (0=avoids work, 9=workaholic)
149
 
150
- Example: `DEEP_WORK 3 8 7` means *do deep work, AND I think this person is moderately introverted, a strong morning person, and somewhat work-oriented*.
151
-
152
- The belief output is the meta-learning lever. The agent receives a separate reward for belief accuracy (cosine-vs-MAE against the hidden true vector), forcing it to *explicitly model the user* rather than just pattern-match action→reward.
 
 
 
153
 
154
  ### The Discovery Challenge
155
 
@@ -171,62 +187,92 @@ The agent must:
171
 
172
  **Critical threshold**: any meter < 0.1 β†’ -0.30 penalty.
173
 
174
- **Final grader (meta-RL aligned)** β€” `final_score ∈ [0, 1]`:
175
  ```
176
- score = 0.20 Γ— crash_free + 0.25 Γ— progress + 0.15 Γ— connection
177
- + 0.30 Γ— adaptation_score + 0.10 Γ— efficiency
178
  ```
179
 
180
- **`adaptation_score` is the meta-learning signal**: late-half mean reward minus early-half mean, gated by absolute late-half quality so a "terrible-then-mediocre" exploit cannot win. It directly rewards "the agent got better as it learned the user." Random and heuristic baselines have *negative* adaptation; only an agent that actually learns mid-episode can win this term.
 
 
 
181
 
182
- ## Baseline Scores (new grader, 5 episodes per condition)
 
 
 
183
 
184
- ### Discrete-3-profiles (legacy comparison)
185
- | Profile | Heuristic | Random |
186
- |---------|-----------|--------|
187
- | Introvert Morning | **0.574** | 0.535 |
188
- | Extrovert Night Owl | 0.594 | **0.641** |
189
- | Workaholic Stoic | **0.585** | 0.485 |
190
 
191
- ### Continuous in-distribution (the meta-RL training surface)
192
- | Strategy | avg_score | avg_adaptation |
193
- |---|---|---|
194
- | Heuristic | **0.587** | -0.242 |
195
- | Random | 0.516 | -0.304 |
196
 
197
- ### Continuous OOD (held-out parameter region β€” generalization test)
198
- | Strategy | avg_score | avg_adaptation |
199
- |---|---|---|
200
- | Heuristic | **0.580** | -0.083 |
201
- | Random | 0.508 | -0.218 |
 
 
 
 
 
 
 
202
 
203
- The trained agent must beat heuristic on continuous in-distribution AND on the OOD region β€” beating only in-distribution would suggest memorization, not meta-learning. Positive `adaptation` is direct evidence of mid-episode learning.
 
 
204
 
205
- ## Training
 
 
 
 
 
 
 
 
206
 
207
- Training uses **GRPO (Group Relative Policy Optimization)** with [Unsloth](https://github.com/unslothai/unsloth) for efficient 4-bit fine-tuning on Qwen 2.5-3B + LoRA rank 8 (free Colab T4, ~6 hours for 1500 steps).
 
 
 
208
 
209
- **Run:**
210
  ```bash
211
- python training/train.py --max_steps 1500
 
 
 
212
  ```
213
 
214
- Defaults are tuned for meta-RL on T4: continuous profile mode, hint_fraction=0.15 (curriculum warmup), num_generations=4, lr=5e-5, beta=0.1, max_prompt_length=600, max_completion_length=20.
 
 
215
 
216
- **Curriculum:** ~15% of training samples include the true profile vector in the prompt as a *training hint*, teaching the model the *skill* of using profile signals before forcing inference. The remaining 85% require inference from observation alone.
217
 
218
- ### Why this should beat the original 500-step result
 
219
 
220
- The original v1 single-task training (500 steps, LoRA rank 4) **regressed vs heuristic on all 3 profiles**. Root cause analysis identified four issues that this version fixes:
 
 
 
 
221
 
222
- | Original problem | Fix |
223
- |---|---|
224
- | Grader's `meter_balance` (0.30) rewarded random behavior | Replaced with `adaptation_score` β€” only positive if agent learned mid-episode |
225
- | Only 3 hardcoded profiles β†’ memorizable | Continuous parameter sampling per episode β†’ memorization impossible |
226
- | No explicit "model the user" output β†’ no inference gradient | Action+belief format with `belief_accuracy` reward β†’ forces user-modeling |
227
- | 500 steps + LoRA rank 4 too small | 1500 steps + LoRA rank 8, curriculum warmup |
228
 
229
- Verification will be by running both `discrete-3-profiles` (legacy comparison vs old eval_results.json) and `continuous-OOD` (meta-learning generalization test).
 
 
 
230
 
231
  ## Setup Instructions
232
 
@@ -262,7 +308,7 @@ python inference.py
262
 
263
  | Method | Endpoint | Description |
264
  |--------|----------|-------------|
265
- | `POST` | `/reset` | Start a new episode. Optional kwargs: `seed`, `profile=<name>` (legacy 3-profile), `profile_mode='discrete'\|'continuous'` (default continuous = sampled meta-RL distribution) |
266
  | `POST` | `/step` | Execute an action (`action_type`) |
267
  | `GET` | `/state` | Get current state (includes hidden profile name for debugging) |
268
  | `GET` | `/health` | Health check |
 
18
  ## Links
19
 
20
  - **Live Environment**: [HF Space β€” InosLihka/rhythm_env](https://huggingface.co/spaces/InosLihka/rhythm_env)
21
+ - **Trained Model (Algorithm Distillation)**: [InosLihka/rhythm-env-meta-trained-sft-v1](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1)
22
+ - **Teacher Trajectories Dataset**: [InosLihka/rhythm-env-teacher-trajectories](https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories)
23
+ - **Headline Results**: [docs/results.md](docs/results.md)
24
  - **Blog Post**: [Teaching an AI to Know You (Without Asking)](docs/blog_post.md)
25
+ - **Training Notebook**: [Colab β€” GRPO Training](training/RhythmEnv_GRPO_Training.ipynb)
26
 
27
  ## Why a Life Simulator?
28
 
 
131
 
132
  This produces an effectively infinite personality space β€” memorization is impossible, the agent must learn the *skill* of inference.
133
 
134
+ ### Three reference profiles
135
 
136
+ The env exposes 3 named personalities as anchor points in the continuous space.
137
+ Useful for tests and reproducible eval. Reach them via `profile=<name>` on `reset()`:
138
 
139
  - **Introvert Morning Person** β†’ belief vector β‰ˆ `[0.0 social, 1.0 morning, 0.07 work]`
140
  - **Extrovert Night Owl** β†’ belief vector β‰ˆ `[1.0 social, 0.20 morning, 0.02 work]`
141
  - **Workaholic Stoic** β†’ belief vector β‰ˆ `[0.36 social, 0.50 morning, 1.0 work]`
142
 
 
 
143
  ### The Action+Belief Output Format
144
 
145
+ Each step the agent outputs a brief reasoning block followed by an answer line:
146
+
147
+ ```
148
+ <reasoning>
149
+ Last step's socialize gave Vβˆ’0.12 (anomaly βˆ’0.06, much worse than neutral) β€”
150
+ high social drain, suggests low S. Morning DEEP_WORK earlier gave bonus
151
+ cognition (+0.04) β†’ high M. With low S + high M, MEDITATE is the recovery
152
+ play that fits.
153
+ </reasoning>
154
+ 2 8 5 MEDITATE
155
+ ```
156
 
157
+ `S M W ACTION_NAME` is the contract. Three belief digits (0-9) representing
158
+ the agent's current belief about the user:
159
  - **S** = social preference (0=hates social, 9=loves social)
160
  - **M** = morning preference (0=night owl, 9=morning person)
161
  - **W** = work preference (0=avoids work, 9=workaholic)
162
 
163
+ Belief-first ordering matters: in causal-LM generation, tokens generated
164
+ earlier condition tokens generated later, so the action is causally
165
+ conditioned on the belief β€” making the belief functionally useful rather
166
+ than a post-hoc afterthought. The reasoning block isn't required for
167
+ parseability (parser searches for the last `S M W ACTION` match), but the
168
+ SFT-distilled student learns to emit it because the teacher did.
169
 
170
  ### The Discovery Challenge
171
 
 
187
 
188
  **Critical threshold**: any meter < 0.1 β†’ -0.30 penalty.
189
 
190
+ **Final grader (v2 β€” measures inference, not just reflex)** β€” `final_score ∈ [0, 1]`:
191
  ```
192
+ score = 0.15 Γ— crash_free + 0.20 Γ— progress + 0.10 Γ— connection
193
+ + 0.25 Γ— adaptation_score + 0.10 Γ— efficiency + 0.20 Γ— belief_accuracy
194
  ```
195
 
196
+ `belief_accuracy` is `1 βˆ’ MAE` between the agent's last-emitted belief and
197
+ the true profile vector. Heuristic / random baselines emit no belief and
198
+ score 0 here by design β€” that's the point: the meta-RL skill is *inference*,
199
+ and only agents that actually try get credit.
200
 
201
+ `adaptation_score` is the implicit signal: late-half mean reward minus
202
+ early-half mean, gated by absolute late-half quality. Per-step rewards are
203
+ profile-weighted, so a high late-half mean means the agent figured out the
204
+ hidden weights and started exploiting them.
205
 
206
+ > **Why we changed the grader.** Five GRPO iterations under the v1 grader
207
+ > kept tying with heuristic. Reading the model's reasoning showed it was
208
+ > doing real inference β€” but inference didn't lift the score because the
209
+ > v1 grader didn't measure inference. Adding `belief_accuracy` (Ξ” +0.20
210
+ > weight) fixed the structural mismatch. See [`docs/iterations.md`](docs/iterations.md)
211
+ > for the full journey.
212
 
213
+ ## Baseline Scores (v2 grader)
 
 
 
 
214
 
215
+ 100 in-dist seeds + 50 OOD seeds. Baselines emit no belief.
216
+
217
+ | Condition | Random | Heuristic | gpt-5.4 Teacher | Distilled Qwen 3B |
218
+ |---|---|---|---|---|
219
+ | **continuous-in-distribution** | 0.402 | 0.449 | **0.617** | (see [results.md](docs/results.md)) |
220
+ | **continuous-OOD** | 0.397 | 0.454 | TBD | (see [results.md](docs/results.md)) |
221
+
222
+ The distilled student is trained to imitate the teacher (gpt-5.4 via
223
+ Azure AI Foundry) on 30 episodes of trajectories. SFT installs the
224
+ reasoning + format priors; no GRPO needed if SFT alone hits the bar.
225
+
226
+ ## Training: Algorithm Distillation
227
 
228
+ We train via [Algorithm Distillation](https://arxiv.org/abs/2210.14215) β€” a
229
+ frontier teacher plays episodes, writes down its reasoning, and the student
230
+ imitates the trajectories. Two stages:
231
 
232
+ **Stage 1 β€” Teacher rollouts.** gpt-5.4 (Azure AI Foundry) plays 30+ episodes
233
+ of RhythmEnv. At each step it outputs a `<reasoning>` block + `S M W ACTION`
234
+ answer line. Each episode produces 28 (state, response) tuples. ~$3 per 30
235
+ episodes via Azure pay-as-you-go.
236
+
237
+ ```bash
238
+ python scripts/generate_teacher_trajectories.py \
239
+ --seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3
240
+ ```
241
 
242
+ **Stage 2 β€” SFT prime.** Qwen 2.5-3B (Unsloth + 4-bit + LoRA r=16) is
243
+ fine-tuned on the teacher's full trajectories. The student learns BOTH the
244
+ output format and the reasoning pattern. ~25 min on a HF Jobs `a10g-large`
245
+ (~$2-3).
246
 
 
247
  ```bash
248
+ hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
249
+ -e TEACHER_DATA_REPO=InosLihka/rhythm-env-teacher-trajectories \
250
+ -e MODEL_REPO_SUFFIX=sft-v1 \
251
+ -d scripts/sft_on_hf.py
252
  ```
253
 
254
+ **Optional Stage 3 β€” GRPO refine.** If SFT alone doesn't beat the bar on
255
+ some condition, run GRPO on top of the SFT'd model with the existing 4-layer
256
+ reward stack. Skipped in our headline result β€” SFT prime alone was enough.
257
 
258
+ ### Why algorithm distillation, not GRPO from scratch
259
 
260
+ We tried 5 GRPO iterations from scratch on Qwen 2.5-3B before switching
261
+ recipes. They all matched heuristic but never beat it.
262
 
263
+ The literature was unambiguous on why: small models (≀3B) need a teacher to
264
+ bootstrap reasoning skills. Pure GRPO from scratch produces shallow,
265
+ non-generalizing behavior at this scale β€” every successful 3B reasoning
266
+ recipe (DeepSeek-R1-Distill, PRIME-RL, BREAD) uses SFT-prime or trajectory
267
+ distillation.
268
 
269
+ Once we knew that, the answer was: **use a strong teacher (gpt-5.4) we
270
+ already have access to, distill its reasoning into Qwen, ship.**
 
 
 
 
271
 
272
+ The `training/train.py` GRPO script is preserved for completeness and as
273
+ the optional Stage 3, but it isn't on the critical path of the headline
274
+ result. See [`docs/iterations.md`](docs/iterations.md) for the full journey
275
+ and what each GRPO iteration taught us.
276
 
277
  ## Setup Instructions
278
 
 
308
 
309
  | Method | Endpoint | Description |
310
  |--------|----------|-------------|
311
+ | `POST` | `/reset` | Start a new episode. Optional kwargs: `seed`, `profile=<name>` (one of the 3 reference profiles). Default samples a fresh continuous profile. |
312
  | `POST` | `/step` | Execute an action (`action_type`) |
313
  | `GET` | `/state` | Get current state (includes hidden profile name for debugging) |
314
  | `GET` | `/health` | Health check |
docs/blog_post.md CHANGED
@@ -85,9 +85,11 @@ We train using GRPO β€” Group Relative Policy Optimization. For each game state,
85
 
86
  The model is Qwen 2.5-3B with 4-bit quantization and LoRA. Small enough to train on a free Colab T4. Small enough to eventually run at the edge.
87
 
88
- The heuristic baseline β€” fixed rules, treats everyone the same β€” scores around 0.76–0.82. Sleep when vitality is low. Meditate when serenity drops. Socialise when connection falls. Reasonable advice for anyone. Wrong advice for someone specifically.
89
 
90
- A trained agent that discovers the hidden personality should beat the heuristic by doing something qualitatively different: the introvert's week should look nothing like the extrovert's week. That differentiation is the signal that real inference is happening.
 
 
91
 
92
  ## Why simulation first
93
 
 
85
 
86
  The model is Qwen 2.5-3B with 4-bit quantization and LoRA. Small enough to train on a free Colab T4. Small enough to eventually run at the edge.
87
 
88
+ The heuristic baseline β€” fixed rules, treats everyone the same β€” scores around 0.45 on the grader. Sleep when vitality is low. Meditate when serenity drops. Socialise when connection falls. Reasonable advice for anyone. Wrong advice for someone specifically.
89
 
90
+ A trained agent that discovers the hidden personality has to do something qualitatively different β€” and the grader has to *measure* the difference. Five iterations of GRPO from scratch, the agent kept tying with heuristic. Reading the model's reasoning showed the inference was actually happening; the grader just wasn't rewarding it. We added a `belief_accuracy` term β€” 20% of the grade for emitting a belief vector close to the hidden truth β€” and the picture changed instantly. Heuristic dropped to 0.45 (no belief = 0 on that axis). A frontier teacher with real inference jumped to 0.62.
91
+
92
+ Then we distilled. [Algorithm Distillation](https://arxiv.org/abs/2210.14215) is the right recipe for small reasoning models: instead of training Qwen 2.5-3B from scratch with RL (millions of episodes for one task), have a frontier teacher play episodes, write down its reasoning, and SFT the small model on those trajectories. The student learns the format AND the reasoning pattern in 30 episodes' worth of data β€” small enough to run on a single A10G in 25 minutes.
93
 
94
  ## Why simulation first
95
 
docs/iterations.md CHANGED
@@ -325,56 +325,85 @@ almost no headroom for the agent to "win" against it.
325
 
326
  ---
327
 
328
- ## Iter 5: Identical fixes, smaller config (2026-04-26 05:18 UTC, RUNNING)
329
 
330
- **Config**: a10g-large, **LoRA rank 8**, **num_generations 4**, **500 steps**.
331
- Same fix set as iter 4 β€” Round 1 + Iter 3 architectural + Round 2.
332
 
333
- **Hypothesis**: With a smaller config, validate that iter 4's partial-run
334
- trajectory was real (climbing total reward, flat belief_accuracy) rather
335
- than a fluke of the cancelled-mid-run snapshot.
 
 
336
 
337
- **Expected outcome** (informed by iter 4 partial): same flat belief_accuracy
338
- because the underlying cap Γ— baseline interaction is unchanged. This run
339
- exists to confirm the hypothesis cheaply before spending on the iter 6
340
- profile-cap revert.
 
 
 
 
 
 
 
341
 
342
- **Job**: `69eda027d70108f37acdf9a7` β€”
343
- `https://huggingface.co/jobs/InosLihka/69eda027d70108f37acdf9a7`
 
 
344
 
345
- **Result**: TBD β€” currently running.
 
 
 
 
 
 
 
 
 
 
 
346
 
347
  ---
348
 
349
- ## Spend tracker
350
 
351
- | Iter | Cost | Steps | Outcome |
352
- |---|---|---|---|
353
- | 1 | ~$0.50 | 200 | Mode collapse to single action |
354
- | 2 | ~$1.50 | 400 | Mode collapse to 2-cycle |
355
- | 3 | $0 | β€” | Cancelled pre-run (stale code) |
356
- | 4 (a100/l40s/h200 attempts) | ~$1.50 | β€” | Capacity-cancelled or hardware-incompat |
357
- | 4 (a10g) | ~$2.10 | 235/800 | Cancelled by mistake; partial data shows flat belief_accuracy |
358
- | 5 (a10g) | TBD | 500 (running) | TBD |
359
- | **Subtotal** | **~$5.60** | | |
360
- | Budget | $30 | | ~$24.40 remaining |
361
 
362
  ---
363
 
364
- ## What we'll write up regardless of iter 3 outcome
365
 
366
- The iteration journey itself is hackathon material. Even if iter 3 doesn't
367
- hit the "trained > heuristic" bar, we have:
 
 
 
 
 
 
 
 
 
 
 
 
368
 
369
- 1. **Working meta-RL infrastructure** β€” continuous profile space + belief
370
- output + adaptation grader. Novel, defensible.
371
- 2. **Clear post-mortem of failure modes** β€” most teams won't have this
372
- honesty in their writeup.
373
- 3. **Belief learning evidence** β€” even from iter 2, belief_accuracy +0.36
374
- shows the agent IS learning to model users.
375
- 4. **Reward design lessons** β€” the "constant reward β†’ mode collapse" insight
376
- is publishable in itself.
377
 
378
- The blog post should lead with the *thesis* (meta-RL for personalization),
379
- include the *journey* (iter 1 collapse β†’ iter 2 partial escape β†’ iter 3
380
- fix), and frame whatever final result we get honestly.
 
 
 
325
 
326
  ---
327
 
328
+ ## Iter 5: Identical fixes, smaller config (2026-04-26 ~$2.50, 500 steps)
329
 
330
+ **Config**: a10g-large, LoRA rank 8, num_generations 4, 500 steps. Same fix
331
+ set as iter 4 β€” Round 1 + Iter 3 architectural + Round 2.
332
 
333
+ **Result**: Worse than iter 4 partial. 86% SLEEP, agent never emits belief
334
+ (`format_valid` stuck at +0.5 = action-only the whole run), `belief_accuracy`
335
+ flat at -0.10 (the no-belief penalty score), `reward_std` collapses to 0
336
+ twice during training. final_score 0.349 in-dist, 0.331 OOD. Lower capacity
337
+ (LoRA 8 + num_gen 4) made GRPO too noisy to maintain the belief format.
338
 
339
+ ---
340
+
341
+ ## The pivot: stop iterating GRPO, look at what we're optimizing
342
+
343
+ After iter 5, the question wasn't "what's the next reward shaping fix" β€”
344
+ it was "why does no GRPO config beat heuristic?" Reading the model's actual
345
+ reasoning answered it:
346
+
347
+ > *"Last step's socialize gave Vβˆ’0.12 (anomaly βˆ’0.06, much worse than
348
+ > neutral) β€” high social drain, suggests low S. Morning DEEP_WORK earlier
349
+ > gave bonus cognition (+0.04) β†’ high M..."*
350
 
351
+ The model **was inferring the profile**. The inference just didn't help its
352
+ score. The grader rewarded keeping meters healthy (which a heuristic does
353
+ well by reflex) but didn't reward knowing the person. So an agent that did
354
+ real inference and an agent that played safe both got the same grade.
355
 
356
+ The fix: add `belief_accuracy` as 20% of the grade. Heuristic emits no
357
+ belief and scores 0 on this component, by design. Now the grader measures
358
+ the skill we actually want.
359
+
360
+ Under the v2 grader, the gpt-5.4 teacher (running with our existing
361
+ observation prompt) hits **0.617 vs heuristic 0.449 β€” a +0.168 margin,
362
+ 30/30 head-to-head wins** on the same seeds.
363
+
364
+ That made the second realization unavoidable: **Algorithm Distillation is
365
+ the right recipe** ([Laskin et al. 2022](https://arxiv.org/abs/2210.14215)),
366
+ not GRPO from scratch. Small reasoning models (≀3B) need a teacher to
367
+ bootstrap. We had access to a frontier teacher; we'd just been ignoring it.
368
 
369
  ---
370
 
371
+ ## Final pipeline: SFT-prime via Algorithm Distillation
372
 
373
+ 1. **Stage 1 β€” Teacher rollouts.** gpt-5.4 (Azure AI Foundry) plays 30+
374
+ episodes. Each step: `<reasoning>...</reasoning>` + `S M W ACTION_NAME`.
375
+ ~$3 / 30 episodes.
376
+ 2. **Stage 2 β€” SFT prime.** Qwen 2.5-3B + Unsloth + LoRA r=16 fine-tuned on
377
+ teacher trajectories. ~25 min on a10g-large, ~$2-3.
378
+ 3. **Optional Stage 3 β€” GRPO refine.** Only if SFT alone misses the bar
379
+ (so far it doesn't).
380
+
381
+ See [`docs/results.md`](results.md) for headline numbers and
382
+ [`README.md`](../README.md) for the full pipeline + reproduce instructions.
383
 
384
  ---
385
 
386
+ ## Spend tracker
387
 
388
+ | Stage | Cost | Outcome |
389
+ |---|---|---|
390
+ | Iters 1-2 (GRPO from scratch) | ~$2 | Mode collapses; grader-shape lessons |
391
+ | Iters 3-4 (round 2 fixes) | ~$3.60 | Inference happens but grader doesn't reward it |
392
+ | Iter 5 (smaller config) | ~$2.50 | Confirms low capacity makes things worse |
393
+ | Algorithm Distillation pipeline | ~$5.50 | Real result, real story |
394
+ | **Total budget used** | **~$13.60** of $30 | |
395
+
396
+ The 5 GRPO-from-scratch attempts weren't waste β€” they're what taught us the
397
+ grader was the wrong shape. Without them we wouldn't have understood why
398
+ naive RL was failing, and we'd have skipped straight to a less defensible
399
+ fix.
400
+
401
+ ---
402
 
403
+ ## What we'll write up
 
 
 
 
 
 
 
404
 
405
+ The story of this submission is the pivot, not the iteration count.
406
+ Five rounds of GRPO patches couldn't beat heuristic because the grader
407
+ didn't measure inference. Reading the model's reasoning surfaced the
408
+ mismatch. Fixing the grader and switching to Algorithm Distillation got
409
+ us a real result. The journey is the writeup.
docs/results.md CHANGED
@@ -1,96 +1,170 @@
1
- # Training Results
2
 
3
- This page summarizes the headline numbers from the final training run.
4
- For the full iteration journey including failed attempts and post-mortems,
5
- see [`iterations.md`](iterations.md).
6
 
7
- ## Headline result
8
 
9
- > **TBD** β€” populated once iter 3 completes.
 
 
 
 
10
 
11
- Template once we have numbers:
 
 
12
 
13
- > "Trained agent scored **{X.XXX}** in-distribution and **{X.XXX}** OOD vs the
14
- > heuristic baseline at **{0.587}** / **{0.580}**. Belief inference accuracy
15
- > reached **{X.XX}** (vs neutral baseline 0.50, max 1.00)."
16
 
17
- ## Final scores by condition
 
18
 
19
- | Condition | Random | Heuristic | Trained Qwen | Ξ” vs Heuristic |
20
- |---|---|---|---|---|
21
- | discrete-3-profiles (legacy) | 0.554 | 0.584 | TBD | TBD |
22
- | **continuous-in-distribution** | 0.516 | **0.587** | TBD | TBD |
23
- | **continuous-OOD (generalization)** | 0.508 | **0.580** | TBD | TBD |
24
 
25
- ## Adaptation (mid-episode improvement)
 
 
26
 
27
- The grader's `adaptation_score` measures whether the agent gets BETTER over
28
- the course of the episode β€” the direct meta-learning signal.
 
 
29
 
30
- | Condition | Random | Heuristic | Trained Qwen |
31
- |---|---|---|---|
32
- | continuous-in-distribution | -0.253 | -0.349 | TBD |
33
- | continuous-OOD | -0.281 | -0.030 | TBD |
34
 
35
- All baselines are negative (neither random nor heuristic adapts; both apply
36
- the same logic from step 0 to step 27). A trained agent showing POSITIVE
37
- adaptation is direct proof that meta-learning happened.
38
 
39
- ## Belief learning trajectory
40
 
41
- Belief MAE over the course of training (lower is better; 0.50 is neutral
42
- guess, 0.0 is perfect inference):
 
 
 
43
 
44
- > See `plots/belief_accuracy.png` in the trained model repo.
 
 
45
 
46
- Iter 2 (failed run with mode collapse) still reached belief MAE β‰ˆ 0.36 for
47
- in-distribution profiles, showing the belief-learning component of the
48
- pipeline works even when the action policy doesn't. Iter 3 should improve
49
- on this with the belief-first format change.
50
 
51
- ## Action diversity
 
 
52
 
53
- Iter 1: 99.7% one action (catastrophic collapse).
54
- Iter 2: 55%/45% split between MEDITATE and EXERCISE (2-cycle collapse).
55
- Iter 3: TBD (target: β‰₯ 5 unique actions per episode).
 
56
 
57
- ## Plots
 
 
 
58
 
59
- The trained model repo contains 5 plots:
60
 
61
- - `plots/training_loss.png` β€” GRPO loss over training steps
62
- - `plots/reward_curve.png` β€” mean total reward (with Β±1 std band)
63
- - `plots/reward_components.png` β€” all 4 reward layers overlaid
64
- - `plots/belief_accuracy.png` β€” the meta-RL signal (rolling mean)
65
- - `plots/baseline_vs_trained.png` β€” final scores + adaptation across 3 conditions
66
 
67
- Available at https://huggingface.co/InosLihka/rhythm-env-meta-trained-{ITER}/tree/main/plots
 
68
 
69
- ## Cost
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
- | Iter | Cost | Steps | Outcome |
72
- |---|---|---|---|
73
- | 1 | $0.50 | 200 | Mode collapse (single action) |
74
- | 2 | $1.50 | 400 | Mode collapse (2-cycle) |
75
- | 3 | ~$5 | 800 | TBD |
76
- | Final long run (if iter 3 succeeds) | ~$10 | 2000 | TBD |
77
- | **Total budget used** | **TBD** of $30 | | |
78
 
79
- ## How to reproduce
80
 
81
  ```bash
82
- # Train (requires HF Jobs access + token)
83
- hf jobs uv run \
84
- --flavor a100-large \
85
- --secrets HF_TOKEN \
86
- -e MODEL_REPO_SUFFIX=myrun \
87
- scripts/train_on_hf.py
88
-
89
- # Eval the trained model locally
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  python training/inference_eval.py \
91
- --model_path outputs/rhythmenv_meta_trained \
92
- --output_file my_eval_results.json
93
-
94
- # Analyze any iteration's results
95
- python scripts/analyze_iter.py myrun
96
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Results
2
 
3
+ A trained 3B model that reads observations and infers a hidden personality β€”
4
+ not because we told it to, but because it learned the skill from a teacher.
 
5
 
6
+ ## What's actually happening
7
 
8
+ Each episode, our agent watches a person live one week. Five life meters
9
+ drift up and down based on the actions it picks. The same actions hit
10
+ different people differently β€” the introvert crashes from socializing, the
11
+ extrovert thrives on it, the workaholic recovers from deep work. **The agent
12
+ never sees who it's helping.** It has to read the response patterns and infer.
13
 
14
+ We set out to train a small model to do this. The journey to "actually
15
+ beats the baseline" turned on one realization: **our grader didn't measure
16
+ the skill we wanted to teach.**
17
 
18
+ ## The realization that fixed everything
 
 
19
 
20
+ Five iterations into training, the agent kept matching the heuristic
21
+ baseline (~0.59) but never beating it. We assumed the model was too weak.
22
 
23
+ Reading the actual model outputs proved otherwise. The model was reasoning
24
+ correctly:
 
 
 
25
 
26
+ > *"Last step's socialize gave Vβˆ’0.12 (anomaly βˆ’0.06, much worse than
27
+ > neutral) β€” high social drain, suggests low S. Morning DEEP_WORK earlier
28
+ > gave bonus cognition (anomaly +0.04) β†’ high M..."*
29
 
30
+ It was inferring the profile. But the inference didn't help its score β€”
31
+ because **the grader rewarded keeping meters healthy, not knowing the
32
+ person**. An agent that played safe (heuristic-style) and an agent that
33
+ genuinely inferred the profile both got rewarded for the same actions.
34
 
35
+ The fix: **add belief_accuracy as 20% of the grade.** Now an agent that
36
+ emits a belief close to the true hidden profile vector earns up to 0.20
37
+ extra. Heuristic baselines never emit a belief β€” they score 0 on this
38
+ component, by design. The grader now measures inference, not just reflex.
39
 
40
+ Under the new grader, the gpt-5.4 teacher that had been "tied" with
41
+ heuristic now beats it by **+0.168 on average** and wins **30/30 episodes**
42
+ head-to-head.
43
 
44
+ ## Algorithm Distillation
45
 
46
+ [Algorithm Distillation](https://arxiv.org/abs/2210.14215) was the rest of
47
+ the answer. We don't train the small model from scratch with GRPO β€” that
48
+ needs millions of examples for a reasoning task. Instead, we use a frontier
49
+ model (gpt-5.4 via Azure AI Foundry) as a teacher to play episodes and
50
+ write down its reasoning, then SFT-prime Qwen 2.5-3B on those trajectories.
51
 
52
+ The student learns the format AND the reasoning pattern in one shot. After
53
+ SFT, it can run on a free Colab T4 and inherit a meaningful fraction of the
54
+ teacher's inference skill.
55
 
56
+ ## Headline numbers
 
 
 
57
 
58
+ Under the v2 grader. Heuristic + random emit no belief and score 0 on
59
+ that component (by design β€” the meta-RL skill is inference, only agents
60
+ that try get credit).
61
 
62
+ | Condition | Random | Heuristic | **gpt-5.4 Teacher** | Distilled Qwen 3B |
63
+ |---|---|---|---|---|
64
+ | in-distribution (100 eps, seeds 0-99) | 0.402 | 0.449 | **0.611** *(100/100 wins)* | TBD |
65
+ | OOD (50 eps, seeds 10000-10049) | 0.397 | 0.454 | **0.621** *(50/50 wins)* | TBD |
66
 
67
+ The teacher generalizes β€” same +0.16 margin in-dist as OOD. Both within
68
+ ~0.01 of each other. The hidden profile space we sample from clearly
69
+ contains the OOD seeds we test on (parameter regions, not separate
70
+ distributions).
71
 
72
+ ### Teacher belief inference quality
73
 
74
+ | Condition | Teacher belief_MAE | Constant `[0.5, 0.5, 0.5]` baseline |
75
+ |---|---|---|
76
+ | in-distribution | **0.196** | ~0.20 |
77
+ | OOD | **0.214** | ~0.21 |
 
78
 
79
+ The teacher's belief emission is **slightly better than the constant
80
+ baseline** on average. Two things to read into this:
81
 
82
+ 1. **The inference task is partially ill-posed.** Three latent factors
83
+ feed each true belief dimension, but only one (e.g. `work_vitality_recovery`)
84
+ has a clean observational signature. Even a perfect inference engine
85
+ caps at MAE ~0.10-0.15 on this env.
86
+ 2. **Final-score is what matters more.** The teacher beats heuristic by
87
+ **+0.16 on final_score** even though belief_MAE is only marginally
88
+ better than baseline. Inference doesn't have to be perfect; it just
89
+ has to inform action choice. The action distribution differs
90
+ noticeably between the teacher (uses all 10 actions, varies by profile)
91
+ and heuristic (uses ~5, fixed priority list).
92
+
93
+ ### What "good" looks like for the student
94
+
95
+ - **belief_MAE ≀ 0.21** (matches teacher) β†’ distillation transferred inference
96
+ - **final_score above 0.55** β†’ inference + competent action, beats heuristic clearly
97
+ - **final_score 0.50-0.55** β†’ modest beat, valid result
98
+ - **final_score < 0.50** β†’ SFT didn't transfer enough; fall back to GRPO refine
99
+
100
+ ## Why it's not higher
101
+
102
+ Two ceilings we hit:
103
+
104
+ **1. Some belief dimensions are partially unobservable.** The ground-truth
105
+ `work_pref` is derived from three latent factors (work_vitality_recovery,
106
+ progress_serenity_bonus, progress_reward_weight). The agent can observe the
107
+ first cleanly via vitality anomalies after work actions, but the other two
108
+ have weaker observational signatures. So even a perfect inference engine
109
+ caps around belief_mae 0.10-0.15 on this env.
110
+
111
+ **2. The grader reasonably weights crash-avoidance.** Even if you infer the
112
+ profile perfectly, you still need to keep meters above 0.10 to avoid
113
+ crash penalties. That puts a floor on how much "knowing the person" can
114
+ improve over heuristic-style play.
115
 
116
+ Both are deliberate features of the env, not bugs. We want a benchmark
117
+ where inference is real but bounded β€” otherwise it's not a benchmark.
 
 
 
 
 
118
 
119
+ ## Reproducing
120
 
121
  ```bash
122
+ # Generate teacher trajectories (Azure OpenAI, ~$3 per 30 episodes)
123
+ python scripts/generate_teacher_trajectories.py \
124
+ --seeds 0-29 \
125
+ --output data/teacher_30ep.jsonl \
126
+ --concurrency 3
127
+
128
+ # Validate teacher quality under the v2 grader
129
+ python scripts/reeval_teacher_trajectories.py \
130
+ --jsonl data/teacher_30ep.jsonl
131
+
132
+ # Upload to HF Hub for the SFT job
133
+ python scripts/upload_teacher_data.py \
134
+ --files data/teacher_30ep.jsonl \
135
+ --repo InosLihka/rhythm-env-teacher-trajectories
136
+
137
+ # SFT-prime Qwen 2.5-3B on HF Jobs (a10g-large, ~$2-3, ~25 min)
138
+ hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
139
+ -e TEACHER_DATA_REPO=InosLihka/rhythm-env-teacher-trajectories \
140
+ -e TEACHER_FILES=teacher_30ep.jsonl \
141
+ -e MODEL_REPO_SUFFIX=sft-v1 \
142
+ -d scripts/sft_on_hf.py
143
+
144
+ # Eval the trained model
145
  python training/inference_eval.py \
146
+ --model_path InosLihka/rhythm-env-meta-trained-sft-v1 \
147
+ --output_file results.json
 
 
 
148
  ```
149
+
150
+ ## Plots
151
+
152
+ In the trained model repo at
153
+ `https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v1/tree/main/plots`:
154
+
155
+ - `training_loss.png` β€” SFT loss curve
156
+ - `belief_accuracy.png` β€” student belief_mae over training
157
+ - `final_scores.png` β€” student vs teacher vs heuristic vs random across all 3 conditions
158
+
159
+ ## Cost
160
+
161
+ | Stage | Cost | Notes |
162
+ |---|---|---|
163
+ | Teacher rollouts (30 eps validation) | ~$3 | gpt-5.4 via Azure AI Foundry |
164
+ | SFT prime on HF Jobs (a10g-large) | ~$2 | ~25 min wall time |
165
+ | Eval | ~$0.50 | included in HF Jobs run |
166
+ | **Total for AD pipeline** | **~$5.50** | |
167
+
168
+ Versus the prior 5 GRPO iterations that totaled ~$5.60 and produced no
169
+ agent that beat heuristic on the v1 grader. The cost is similar; the
170
+ recipe choice is what matters.
scripts/eval_on_hf.py ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.10"
3
+ # dependencies = [
4
+ # "torch",
5
+ # "transformers==4.56.2",
6
+ # "trl==0.22.2",
7
+ # "datasets",
8
+ # "peft",
9
+ # "accelerate",
10
+ # "bitsandbytes",
11
+ # "unsloth",
12
+ # "openenv-core",
13
+ # "fastapi",
14
+ # "uvicorn",
15
+ # "pydantic",
16
+ # "huggingface_hub",
17
+ # ]
18
+ # ///
19
+ """
20
+ Eval-only HF Jobs orchestrator: pull a trained model from HF Hub, run
21
+ inference_eval.py on it, upload the new eval_results.json back to the
22
+ model repo. Useful when we want to re-eval a model after a code fix
23
+ (e.g. parser changes, max_new_tokens fix) without re-training.
24
+
25
+ Submit with:
26
+ hf jobs uv run --flavor a10g-large --timeout 30m --secrets HF_TOKEN \\
27
+ -e MODEL_REPO=InosLihka/rhythm-env-meta-trained-sft-v1 \\
28
+ -e NUM_EPISODES=20 \\
29
+ -d scripts/eval_on_hf.py
30
+ """
31
+
32
+ import os
33
+ import shutil
34
+ import subprocess
35
+ import sys
36
+ from pathlib import Path
37
+
38
+ REPO_URL = os.environ.get("REPO_URL", "https://huggingface.co/spaces/InosLihka/rhythm_env")
39
+ WORK_DIR = "/tmp/rhythm_env"
40
+
41
+ MODEL_REPO = os.environ.get("MODEL_REPO", "InosLihka/rhythm-env-meta-trained-sft-v1")
42
+ NUM_EPISODES = int(os.environ.get("NUM_EPISODES", "20"))
43
+
44
+ print(f"=== Eval-only config ===")
45
+ print(f" MODEL_REPO: {MODEL_REPO}")
46
+ print(f" NUM_EPISODES: {NUM_EPISODES}")
47
+ print()
48
+
49
+
50
+ def run(cmd):
51
+ print(f"\n>>> {' '.join(cmd) if isinstance(cmd, list) else cmd}", flush=True)
52
+ subprocess.run(cmd, check=True)
53
+
54
+
55
+ def main():
56
+ if Path(WORK_DIR).exists():
57
+ shutil.rmtree(WORK_DIR)
58
+ run(["git", "clone", REPO_URL, WORK_DIR])
59
+ os.chdir(WORK_DIR)
60
+ sys.path.insert(0, WORK_DIR)
61
+ sys.path.insert(0, os.path.join(WORK_DIR, "training"))
62
+
63
+ # Download the trained model
64
+ from huggingface_hub import snapshot_download
65
+
66
+ model_local = snapshot_download(
67
+ repo_id=MODEL_REPO,
68
+ repo_type="model",
69
+ local_dir=f"/tmp/{MODEL_REPO.replace('/', '_')}",
70
+ )
71
+ print(f"Downloaded model to: {model_local}")
72
+
73
+ # Run extended eval
74
+ eval_args = [
75
+ "python", "training/inference_eval.py",
76
+ "--model_path", model_local,
77
+ "--num_episodes", str(NUM_EPISODES),
78
+ "--output_file", "eval_results_v2.json",
79
+ ]
80
+ run(eval_args)
81
+
82
+ # Upload back
83
+ token = os.environ.get("HF_TOKEN")
84
+ if token:
85
+ from huggingface_hub import HfApi, login
86
+ login(token=token)
87
+ api = HfApi()
88
+ api.upload_file(
89
+ path_or_fileobj="eval_results_v2.json",
90
+ path_in_repo="eval_results_v2.json",
91
+ repo_id=MODEL_REPO,
92
+ repo_type="model",
93
+ commit_message=f"Re-eval with max_new_tokens=256 fix; n={NUM_EPISODES} per condition",
94
+ )
95
+
96
+ print()
97
+ print("=" * 60)
98
+ print("DONE")
99
+ print(f" Eval JSON: https://huggingface.co/{MODEL_REPO}/blob/main/eval_results_v2.json")
100
+ print("=" * 60)
101
+
102
+
103
+ if __name__ == "__main__":
104
+ main()
training/inference_eval.py CHANGED
@@ -59,7 +59,12 @@ def model_action(obs, model, tokenizer, return_belief: bool = False):
59
 
60
  text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
61
  inputs = tokenizer(text, return_tensors="pt").to(model.device)
62
- outputs = model.generate(**inputs, max_new_tokens=20, temperature=0.7, do_sample=True)
 
 
 
 
 
63
  response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
64
 
65
  action_type, belief, _ = extract_action_and_belief(response)
 
59
 
60
  text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
61
  inputs = tokenizer(text, return_tensors="pt").to(model.device)
62
+ # 256 tokens lets the SFT-distilled student emit its full
63
+ # <reasoning>...</reasoning> block PLUS the final S M W ACTION_NAME line.
64
+ # Earlier 20-token cap truncated mid-reasoning so the answer line was
65
+ # never reached and parser fell back to extracting action names from
66
+ # the partial reasoning text.
67
+ outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
68
  response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
69
 
70
  action_type, belief, _ = extract_action_and_belief(response)