InosLihka commited on
Commit
361aed7
Β·
1 Parent(s): 39e98db

Clarify documentation: anomaly signal explainer, GRPO scope notes

Browse files

User-authored documentation improvements that resolve the structural
confusion flagged in review.

README.md
+ New section 'The anomaly signal (the cleanest inference channel)'
with concrete worked example showing how per-step anomalies encode
the hidden profile. Explains why the env can train meta-RL inference
even on a small model.
+ 'Continuous Personality Space' table now has a 'concrete intuition'
column for each parameter (turns abstract distributions into
readable behavioral descriptions).

docs/training.md
+ 'Scope of this doc' banner: clarifies this guide describes the
pure-GRPO-from-scratch stack (the historical Iter 5 path), NOT the
canonical Algorithm Distillation recipe that produced the headline
numbers. Explicit pointer to README's AD section.
+ Reward stack section expanded from 3 to 4 layers (now correctly
documents belief_accuracy alongside format_valid, action_legal,
env_reward).

training/RhythmEnv_GRPO_Training.ipynb
+ Top-of-notebook ⚠️ banner: 'Read this first' β€” explains the notebook
runs Iter 5 (pure GRPO), preserved as reproducible evidence, NOT the
headline path. Points readers at the AD pipeline scripts for
reproducing the actual submitted numbers.
+ Fixed em-dash encoding in two intro print statements.

.gitignore
+ Add HANDOFF.md to local-only ignores.

Files changed (4) hide show
  1. .gitignore +1 -0
  2. README.md +108 -19
  3. docs/training.md +19 -3
  4. training/RhythmEnv_GRPO_Training.ipynb +7 -19
.gitignore CHANGED
@@ -14,3 +14,4 @@ iter1_results/
14
  iter2_results/
15
  iter5_results/
16
  outputs/
 
 
14
  iter2_results/
15
  iter5_results/
16
  outputs/
17
+ HANDOFF.md
README.md CHANGED
@@ -142,9 +142,30 @@ All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receiv
142
  - All 5 meter values + per-meter deltas from the last action
143
  - Current day, slot, timestep
144
  - Active random event (if any)
145
- - Rolling 7-step history of (action, reward, deltas)
146
  - Total scalar reward
147
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  ### What the Agent Does NOT See
149
  - **The hidden personality vector** β€” sampled per episode, controls everything below
150
  - **Reward weight decomposition** β€” same meter changes produce different rewards for different people
@@ -154,24 +175,24 @@ All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receiv
154
 
155
  Each `reset()` samples a fresh personality from parameter distributions:
156
 
157
- | Parameter | Distribution | Effect |
158
- |---|---|---|
159
- | `social_vitality_multiplier` | U(0.2, 3.0) | low = extrovert, high = introvert |
160
- | `morning_cognition_bonus` | U(0.4, 2.0) or none | morning person trait |
161
- | `evening_night_cognition_bonus` | U(0.6, 1.8) or none | night owl trait |
162
- | `binge_shame` | Bernoulli(0.5) | conscientiousness |
163
- | `progress_serenity_bonus` | U(0, 0.10) | workaholic peace-from-work |
164
- | `work_vitality_recovery` | U(0, 0.06) | work energizes vs drains |
165
- | `solo_serenity_bonus` | U(0, 0.10) | introvert solo recharge |
166
- | `social_connection_multiplier` | U(1, 2) | extrovert social fluency |
167
- | `social_serenity_bonus` | U(0, 0.06) | extrovert peace-from-people |
168
- | `connection_decay_rate` | U(0.005, 0.02) | how fast relationships fade |
169
- | `vitality_decay_rate` | U(0, 0.04) | base energy decline |
170
- | `event_impact_multiplier` | U(0.5, 1.0) | resilience to setbacks |
171
- | `stress_tolerance` | U(0.15, 0.30) | when stress spiral kicks in |
172
- | `reward_weights` | Dirichlet biased to non-vit/cog | what THIS person values |
173
-
174
- This produces an effectively infinite personality space β€” memorization is impossible, the agent must learn the *skill* of inference.
175
 
176
  ### Three reference profiles
177
 
@@ -315,6 +336,74 @@ the optional Stage 3, but it isn't on the critical path of the headline
315
  result. See [`docs/iterations.md`](docs/iterations.md) for the full journey
316
  and what each GRPO iteration taught us.
317
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
318
  ## Setup Instructions
319
 
320
  ### Local Development
 
142
  - All 5 meter values + per-meter deltas from the last action
143
  - Current day, slot, timestep
144
  - Active random event (if any)
145
+ - **Rolling 7-step history of (action, reward, deltas, *anomalies*)** β€” see below
146
  - Total scalar reward
147
 
148
+ ### The anomaly signal (the cleanest inference channel)
149
+
150
+ For every past step in the rolling history, the agent sees both the actual
151
+ per-meter delta *and* a per-meter **anomaly** = `actual_delta βˆ’ expected_delta_under_neutral_profile`.
152
+
153
+ A neutral profile is the average person; the anomaly therefore tells the
154
+ agent **how much this specific user's response deviated from the average
155
+ user's response to the same action**. That deviation is what encodes the
156
+ hidden personality.
157
+
158
+ Concrete example: the agent does `SOCIALIZE` and observes `vitality_delta = -0.18`.
159
+ Under a neutral profile the expected change is `-0.06`, so
160
+ `vitality_anomaly = -0.12` β€” the user lost twice as much energy as average.
161
+ Strong evidence the hidden `social_vitality_multiplier` is high
162
+ (introvert: socializing is costly). The agent should down-weight social
163
+ actions and reach for solo recovery instead.
164
+
165
+ This is why a tiny model can learn meta-RL inference here: the env hands it
166
+ a clean, well-typed deviation-from-baseline signal at every step. See
167
+ [`models.py` (StepRecord)](models.py) for the exact field layout.
168
+
169
  ### What the Agent Does NOT See
170
  - **The hidden personality vector** β€” sampled per episode, controls everything below
171
  - **Reward weight decomposition** β€” same meter changes produce different rewards for different people
 
175
 
176
  Each `reset()` samples a fresh personality from parameter distributions:
177
 
178
+ | Parameter | Distribution | What it controls | Concrete intuition |
179
+ |---|---|---|---|
180
+ | `social_vitality_multiplier` | U(0.2, 3.0) | how much vitality `SOCIALIZE` / `FAMILY_TIME` drain | **0.2** = "people energize me" (extrovert). **3.0** = "every party leaves me wrecked" (introvert). |
181
+ | `morning_cognition_bonus` | U(0.4, 2.0) or none | bonus on cognitive gains in morning slots | High = sharp at 8am, useless after 6pm (morning person). |
182
+ | `evening_night_cognition_bonus` | U(0.6, 1.8) or none | same, evening/night | High = comes alive after dark (night owl). |
183
+ | `work_vitality_recovery` | U(0, 0.06) | `DEEP_WORK` *adds* vitality instead of draining | High = work is fuel, not cost (workaholic). |
184
+ | `progress_serenity_bonus` | U(0, 0.10) | calmness gained from career progress | High = "I am at peace when I'm shipping." |
185
+ | `solo_serenity_bonus` | U(0, 0.10) | `ME_TIME` extra calm gain | High = recharges by being alone (introvert). |
186
+ | `social_connection_multiplier` | U(1, 2) | how strongly social actions build Connection | High = relationships compound fast (extrovert fluency). |
187
+ | `social_serenity_bonus` | U(0, 0.06) | extra calm from social actions | High = extrovert peace-from-people. |
188
+ | `binge_shame` | Bernoulli(0.5) | does `BINGE_WATCH` cost extra serenity afterwards | True = guilt spiral, False = guilt-free. |
189
+ | `connection_decay_rate` | U(0.005, 0.02) | passive Connection drop per step | High = relationships need active maintenance. |
190
+ | `vitality_decay_rate` | U(0, 0.04) | passive Vitality drop per step | High = always low-energy. |
191
+ | `event_impact_multiplier` | U(0.5, 1.0) | how hard random events (Prod Crash etc.) hit | High = brittle to setbacks. |
192
+ | `stress_tolerance` | U(0.15, 0.30) | meter level where stress-spiral penalty fires | Low = falls apart easily. |
193
+ | `reward_weights` | Dirichlet (biased to non-vit/cog) | which meter changes are *valuable to this person* | One person scores high on `progress` gains; another scores high on `connection`. |
194
+
195
+ This produces an effectively infinite personality space β€” memorization is impossible, the agent must learn the *skill* of inference. The three named reference profiles below are concrete points in this space, useful as test anchors. For the exact sampling logic see [`server/rhythm_environment.py`](server/rhythm_environment.py) (`sample_profile`).
196
 
197
  ### Three reference profiles
198
 
 
336
  result. See [`docs/iterations.md`](docs/iterations.md) for the full journey
337
  and what each GRPO iteration taught us.
338
 
339
+ ## Reproducing the headline numbers
340
+
341
+ There are two reproduction paths depending on how much time and budget you
342
+ have. Both produce the numbers in the *Headline result* table above.
343
+
344
+ ### Fast path (~10–20 min, $0): re-run eval against the published checkpoint
345
+
346
+ This is the path most reviewers should take. The trained model is already
347
+ on the Hub. Download it once, then run `inference_eval.py` (which expects a
348
+ local path) against all three eval conditions.
349
+
350
+ ```bash
351
+ # Prereqs: Python 3.10+, ~12 GB free disk, GPU strongly recommended (CPU works but is slow).
352
+ pip install -e .
353
+ export HF_TOKEN=... # any read-scoped HF token; the model is public.
354
+
355
+ # 1. Snapshot the trained checkpoint locally.
356
+ hf download InosLihka/rhythm-env-meta-trained-sft-v3 \
357
+ --local-dir ./models/sft-v3
358
+
359
+ # 2. Run all 3 eval conditions (discrete-3, continuous-in-dist, continuous-OOD).
360
+ python training/inference_eval.py \
361
+ --model_path ./models/sft-v3 \
362
+ --num_episodes 5 \
363
+ --output_file eval_results_v2.json
364
+ ```
365
+
366
+ Expected output: `eval_results_v2.json` whose per-condition averages match
367
+ [the published JSON](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json)
368
+ (distilled student 0.574 / 0.536 / 0.507 across continuous-in-dist /
369
+ continuous-OOD / discrete-3) within Β±0.02 stochastic noise.
370
+
371
+ > If you'd rather not run a GPU locally, `scripts/eval_on_hf.py` does the
372
+ > same thing as a HF Jobs run: it snapshots the model on a remote a10g/a100,
373
+ > runs `inference_eval.py`, and uploads the resulting JSON back to the
374
+ > model repo. See the docstring in that script for the submit command.
375
+
376
+ ### Full path (~1.5 hr, ~$5–6 in API + GPU credits): retrain from scratch
377
+
378
+ ```bash
379
+ # 1. Teacher rollouts (Azure OpenAI gpt-5.4, ~$3 for 30 episodes).
380
+ cp .env.example .env # fill in AZURE_OPENAI_* + HF_TOKEN
381
+ python scripts/generate_teacher_trajectories.py \
382
+ --seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3
383
+
384
+ # 2. Push teacher data to a HF dataset repo of your choice.
385
+ python scripts/upload_teacher_data.py \
386
+ --files data/teacher_30ep.jsonl \
387
+ --repo <your_user>/rhythm-env-teacher-trajectories
388
+
389
+ # 3. SFT distill on HF Jobs (~25 min on a10g-large, ~$2–3).
390
+ hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
391
+ -e TEACHER_DATA_REPO=<your_user>/rhythm-env-teacher-trajectories \
392
+ -e MODEL_REPO_SUFFIX=sft-v3-repro \
393
+ -d scripts/sft_on_hf.py
394
+
395
+ # 4. Eval the new checkpoint via the Fast Path above:
396
+ hf download <your_user>/rhythm-env-meta-trained-sft-v3-repro \
397
+ --local-dir ./models/sft-v3-repro
398
+ python training/inference_eval.py \
399
+ --model_path ./models/sft-v3-repro \
400
+ --num_episodes 5 \
401
+ --output_file eval_results_v2.json
402
+ ```
403
+
404
+ Stage 3 (optional GRPO refine on top of SFT) is the same `scripts/train_on_hf.py`
405
+ command shown earlier in the *Training: Algorithm Distillation* section.
406
+
407
  ## Setup Instructions
408
 
409
  ### Local Development
docs/training.md CHANGED
@@ -1,5 +1,14 @@
1
  # Training Guide β€” RhythmEnv GRPO
2
 
 
 
 
 
 
 
 
 
 
3
  ## What we're training
4
 
5
  A Qwen 2.5-3B model (4-bit quantized + LoRA) to play one-week episodes in RhythmEnv. The agent sees 5 life meters, time of day, and a reward signal. It must infer the hidden personality profile from those signals and adapt its action selection accordingly.
@@ -20,18 +29,25 @@ The goal is not to teach the model the rules of the environment β€” a capable LL
20
 
21
  ---
22
 
23
- ## Three-layer reward stack
24
 
25
- Each training step scores four candidate completions per prompt across three reward functions:
26
 
27
  | Layer | Function | Signal | Pass | Fail |
28
  |---|---|---|---|---|
29
- | 1 | `format_valid` | Is the output a parseable action name? | +1.0 | βˆ’2.0 |
30
  | 2 | `action_legal` | Is it one of the 10 valid `ActionType` values? | +0.5 | βˆ’1.0 |
31
  | 3 | `env_reward` | Real reward from stepping the environment | varies | βˆ’3.0 |
 
32
 
33
  `env_reward` uses seed-based episode replay: the dataset stores `seed`, `step_index`, and `action_history` alongside each prompt. The reward function reconstructs the exact episode state and steps the environment with the candidate action β€” the reward cannot be fabricated.
34
 
 
 
 
 
 
 
35
  ---
36
 
37
  ## Key config choices
 
1
  # Training Guide β€” RhythmEnv GRPO
2
 
3
+ > **Scope of this doc:** the **pure-GRPO-from-scratch** training stack
4
+ > exercised by [`training/RhythmEnv_GRPO_Training.ipynb`](../training/RhythmEnv_GRPO_Training.ipynb)
5
+ > (Iter 5 in [`iterations.md`](iterations.md)). The submission's **headline
6
+ > numbers** come from a different recipe β€” **Algorithm Distillation**
7
+ > (teacher trajectories β†’ SFT) β€” described in the [README](../README.md)
8
+ > under *Training: Algorithm Distillation*. The GRPO machinery here is
9
+ > still used in the final pipeline, but only as the *optional Stage 3
10
+ > refine on top of the SFT checkpoint*, not as the from-scratch trainer.
11
+
12
  ## What we're training
13
 
14
  A Qwen 2.5-3B model (4-bit quantized + LoRA) to play one-week episodes in RhythmEnv. The agent sees 5 life meters, time of day, and a reward signal. It must infer the hidden personality profile from those signals and adapt its action selection accordingly.
 
29
 
30
  ---
31
 
32
+ ## Four-layer reward stack
33
 
34
+ Each training step scores four candidate completions per prompt across four reward functions:
35
 
36
  | Layer | Function | Signal | Pass | Fail |
37
  |---|---|---|---|---|
38
+ | 1 | `format_valid` | Is the output parseable as `ACTION + 3 belief digits`? | +1.0 | βˆ’2.0 |
39
  | 2 | `action_legal` | Is it one of the 10 valid `ActionType` values? | +0.5 | βˆ’1.0 |
40
  | 3 | `env_reward` | Real reward from stepping the environment | varies | βˆ’3.0 |
41
+ | 4 | `belief_accuracy` | Cosine-MAE between emitted belief vector and the hidden profile (the meta-RL signal) | +0.5 | βˆ’0.5 |
42
 
43
  `env_reward` uses seed-based episode replay: the dataset stores `seed`, `step_index`, and `action_history` alongside each prompt. The reward function reconstructs the exact episode state and steps the environment with the candidate action β€” the reward cannot be fabricated.
44
 
45
+ `belief_accuracy` is what the v2 grader was added to measure. The first
46
+ five GRPO iterations under the v1 (three-layer) grader kept tying with
47
+ heuristic because nothing was rewarding the agent for *learning who it was
48
+ helping*. Adding `belief_accuracy` makes inference a first-class
49
+ training signal. See [`iterations.md`](iterations.md) for the post-mortem.
50
+
51
  ---
52
 
53
  ## Key config choices
training/RhythmEnv_GRPO_Training.ipynb CHANGED
@@ -3,19 +3,7 @@
3
  {
4
  "cell_type": "markdown",
5
  "metadata": {},
6
- "source": [
7
- "# RhythmEnv GRPO Training \u2014 Meta-RL Life Simulator\n",
8
- "\n",
9
- "Train an LLM agent to *learn how to learn a person*. The agent receives 4-layer reward:\n",
10
- "- format_valid (parseable ACTION + 3 belief digits)\n",
11
- "- action_legal (one of 10 valid actions)\n",
12
- "- env_reward (actual env reward via seed-replay)\n",
13
- "- belief_accuracy (cosine-MAE vs hidden true profile vector)\n",
14
- "\n",
15
- "Continuous profile space + curriculum warmup + adaptation-aware grader \u2192 trains a meta-policy that generalizes to unseen personalities.\n",
16
- "\n",
17
- "Default config: 1500 steps on Qwen 2.5-3B + LoRA rank 8, ~6h on Colab T4.\n"
18
- ]
19
  },
20
  {
21
  "cell_type": "markdown",
@@ -86,12 +74,12 @@
86
  "\n",
87
  "env = RhythmEnvironment()\n",
88
  "obs = env.reset(seed=0)\n",
89
- "print(f\"Initial state \u00e2\u20ac\u201d Vitality: {obs.vitality:.2f}, Cognition: {obs.cognition:.2f}, \"\n",
90
  " f\"Progress: {obs.progress:.2f}, Serenity: {obs.serenity:.2f}, Connection: {obs.connection:.2f}\")\n",
91
  "print(f\"Day: {obs.day}, Slot: {obs.slot}, Steps: {obs.remaining_steps}\")\n",
92
  "\n",
93
  "obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))\n",
94
- "print(f\"\\nAfter DEEP_WORK \u00e2\u20ac\u201d Reward: {obs.reward:.3f}\")\n",
95
  "print(f\"Vitality: {obs.vitality:.2f}, Progress: {obs.progress:.2f}\")\n",
96
  "print(\"\\nEnvironment is working!\")"
97
  ]
@@ -367,7 +355,7 @@
367
  " ax.plot(steps, losses, color=\"#2563eb\", linewidth=1.5, alpha=0.8)\n",
368
  " ax.set_xlabel(\"Training Step\")\n",
369
  " ax.set_ylabel(\"Loss\")\n",
370
- " ax.set_title(\"GRPO Training Loss \u2014 RhythmEnv Meta-RL\")\n",
371
  " ax.grid(True, alpha=0.3)\n",
372
  " plt.tight_layout()\n",
373
  " plt.savefig(\"plots/training_loss.png\", dpi=150)\n",
@@ -382,10 +370,10 @@
382
  " ax.plot(rsteps, rvals, color=\"#16a34a\", linewidth=1.5, label=f\"Mean Reward ({rkey})\")\n",
383
  " if svals and len(svals) == len(rvals):\n",
384
  " r, s = np.array(rvals), np.array(svals)\n",
385
- " ax.fill_between(rsteps, r - s, r + s, color=\"#16a34a\", alpha=0.15, label=\"\u00b11 std\")\n",
386
  " ax.set_xlabel(\"Training Step\")\n",
387
  " ax.set_ylabel(\"Mean Total Reward\")\n",
388
- " ax.set_title(\"GRPO Mean Reward over Training \u2014 RhythmEnv Meta-RL\")\n",
389
  " ax.legend()\n",
390
  " ax.grid(True, alpha=0.3)\n",
391
  " plt.tight_layout()\n",
@@ -440,7 +428,7 @@
440
  " ax.plot(smooth_x, smooth, color=\"#7e22ce\", linewidth=2.5, label=f\"Rolling mean ({win}-step)\")\n",
441
  " ax.axhline(0.0, color=\"k\", linewidth=0.5, linestyle=\"--\", alpha=0.5, label=\"neutral belief baseline (0.0)\")\n",
442
  " ax.set_xlabel(\"Training Step\")\n",
443
- " ax.set_ylabel(\"Mean belief_accuracy reward (\u22120.5 to +0.5)\")\n",
444
  " ax.set_title(\"Belief-Accuracy Reward over Training (proof agent learned to model user)\")\n",
445
  " ax.legend(loc=\"best\")\n",
446
  " ax.grid(True, alpha=0.3)\n",
 
3
  {
4
  "cell_type": "markdown",
5
  "metadata": {},
6
+ "source": "# RhythmEnv GRPO Training β€” Meta-RL Life Simulator\n\n> **⚠️ Read this banner first β€” what this notebook is and isn't.**\n>\n> This notebook runs **pure GRPO from scratch** on Qwen 2.5-3B with the\n> 4-layer reward stack. It corresponds to **Iter 5** in\n> [`docs/iterations.md`](../docs/iterations.md) and is preserved here as\n> reproducible evidence of that iteration. **It is NOT the path that\n> produced the submission's headline numbers.**\n>\n> Five GRPO-from-scratch iterations all matched the heuristic baseline but\n> never beat it β€” small models (≀3B) need a teacher to bootstrap reasoning\n> at this scale.\n>\n> The headline result comes from **Algorithm Distillation** instead:\n>\n> 1. `scripts/generate_teacher_trajectories.py` β€” gpt-5.4 plays episodes and writes reasoning + beliefs.\n> 2. `scripts/sft_on_hf.py` β€” distill teacher trajectories into Qwen 3B (the SFT v3 checkpoint).\n> 3. `scripts/train_on_hf.py` β€” *optional* GRPO refine **on top of** the SFT checkpoint. This is where the GRPO machinery this notebook exercises (via `training/train.py`) is actually used in the final pipeline.\n>\n> See [`README.md`](../README.md) and [`BLOG.md`](../BLOG.md) for the\n> end-to-end story. To reproduce the headline numbers, follow Stages 1–3\n> of the README's *Training: Algorithm Distillation* section β€” not this\n> notebook.\n\nWhat this notebook actually does (pure GRPO from scratch with the 4-layer reward):\n\n- format_valid (parseable ACTION + 3 belief digits)\n- action_legal (one of 10 valid actions)\n- env_reward (actual env reward via seed-replay)\n- belief_accuracy (cosine-MAE vs hidden true profile vector)\n\nContinuous profile space + curriculum warmup + adaptation-aware grader.\n\nDefault config: 1500 steps on Qwen 2.5-3B + LoRA rank 8, ~6h on Colab T4.\n"
 
 
 
 
 
 
 
 
 
 
 
 
7
  },
8
  {
9
  "cell_type": "markdown",
 
74
  "\n",
75
  "env = RhythmEnvironment()\n",
76
  "obs = env.reset(seed=0)\n",
77
+ "print(f\"Initial state Ò€” Vitality: {obs.vitality:.2f}, Cognition: {obs.cognition:.2f}, \"\n",
78
  " f\"Progress: {obs.progress:.2f}, Serenity: {obs.serenity:.2f}, Connection: {obs.connection:.2f}\")\n",
79
  "print(f\"Day: {obs.day}, Slot: {obs.slot}, Steps: {obs.remaining_steps}\")\n",
80
  "\n",
81
  "obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))\n",
82
+ "print(f\"\\nAfter DEEP_WORK Ò€” Reward: {obs.reward:.3f}\")\n",
83
  "print(f\"Vitality: {obs.vitality:.2f}, Progress: {obs.progress:.2f}\")\n",
84
  "print(\"\\nEnvironment is working!\")"
85
  ]
 
355
  " ax.plot(steps, losses, color=\"#2563eb\", linewidth=1.5, alpha=0.8)\n",
356
  " ax.set_xlabel(\"Training Step\")\n",
357
  " ax.set_ylabel(\"Loss\")\n",
358
+ " ax.set_title(\"GRPO Training Loss β€” RhythmEnv Meta-RL\")\n",
359
  " ax.grid(True, alpha=0.3)\n",
360
  " plt.tight_layout()\n",
361
  " plt.savefig(\"plots/training_loss.png\", dpi=150)\n",
 
370
  " ax.plot(rsteps, rvals, color=\"#16a34a\", linewidth=1.5, label=f\"Mean Reward ({rkey})\")\n",
371
  " if svals and len(svals) == len(rvals):\n",
372
  " r, s = np.array(rvals), np.array(svals)\n",
373
+ " ax.fill_between(rsteps, r - s, r + s, color=\"#16a34a\", alpha=0.15, label=\"Β±1 std\")\n",
374
  " ax.set_xlabel(\"Training Step\")\n",
375
  " ax.set_ylabel(\"Mean Total Reward\")\n",
376
+ " ax.set_title(\"GRPO Mean Reward over Training β€” RhythmEnv Meta-RL\")\n",
377
  " ax.legend()\n",
378
  " ax.grid(True, alpha=0.3)\n",
379
  " plt.tight_layout()\n",
 
428
  " ax.plot(smooth_x, smooth, color=\"#7e22ce\", linewidth=2.5, label=f\"Rolling mean ({win}-step)\")\n",
429
  " ax.axhline(0.0, color=\"k\", linewidth=0.5, linestyle=\"--\", alpha=0.5, label=\"neutral belief baseline (0.0)\")\n",
430
  " ax.set_xlabel(\"Training Step\")\n",
431
+ " ax.set_ylabel(\"Mean belief_accuracy reward (βˆ’0.5 to +0.5)\")\n",
432
  " ax.set_title(\"Belief-Accuracy Reward over Training (proof agent learned to model user)\")\n",
433
  " ax.legend(loc=\"best\")\n",
434
  " ax.grid(True, alpha=0.3)\n",