Spaces:
Sleeping
Clarify documentation: anomaly signal explainer, GRPO scope notes
Browse filesUser-authored documentation improvements that resolve the structural
confusion flagged in review.
README.md
+ New section 'The anomaly signal (the cleanest inference channel)'
with concrete worked example showing how per-step anomalies encode
the hidden profile. Explains why the env can train meta-RL inference
even on a small model.
+ 'Continuous Personality Space' table now has a 'concrete intuition'
column for each parameter (turns abstract distributions into
readable behavioral descriptions).
docs/training.md
+ 'Scope of this doc' banner: clarifies this guide describes the
pure-GRPO-from-scratch stack (the historical Iter 5 path), NOT the
canonical Algorithm Distillation recipe that produced the headline
numbers. Explicit pointer to README's AD section.
+ Reward stack section expanded from 3 to 4 layers (now correctly
documents belief_accuracy alongside format_valid, action_legal,
env_reward).
training/RhythmEnv_GRPO_Training.ipynb
+ Top-of-notebook β οΈ banner: 'Read this first' β explains the notebook
runs Iter 5 (pure GRPO), preserved as reproducible evidence, NOT the
headline path. Points readers at the AD pipeline scripts for
reproducing the actual submitted numbers.
+ Fixed em-dash encoding in two intro print statements.
.gitignore
+ Add HANDOFF.md to local-only ignores.
- .gitignore +1 -0
- README.md +108 -19
- docs/training.md +19 -3
- training/RhythmEnv_GRPO_Training.ipynb +7 -19
|
@@ -14,3 +14,4 @@ iter1_results/
|
|
| 14 |
iter2_results/
|
| 15 |
iter5_results/
|
| 16 |
outputs/
|
|
|
|
|
|
| 14 |
iter2_results/
|
| 15 |
iter5_results/
|
| 16 |
outputs/
|
| 17 |
+
HANDOFF.md
|
|
@@ -142,9 +142,30 @@ All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receiv
|
|
| 142 |
- All 5 meter values + per-meter deltas from the last action
|
| 143 |
- Current day, slot, timestep
|
| 144 |
- Active random event (if any)
|
| 145 |
-
- Rolling 7-step history of (action, reward, deltas)
|
| 146 |
- Total scalar reward
|
| 147 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
### What the Agent Does NOT See
|
| 149 |
- **The hidden personality vector** β sampled per episode, controls everything below
|
| 150 |
- **Reward weight decomposition** β same meter changes produce different rewards for different people
|
|
@@ -154,24 +175,24 @@ All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receiv
|
|
| 154 |
|
| 155 |
Each `reset()` samples a fresh personality from parameter distributions:
|
| 156 |
|
| 157 |
-
| Parameter | Distribution |
|
| 158 |
-
|---|---|---|
|
| 159 |
-
| `social_vitality_multiplier` | U(0.2, 3.0) |
|
| 160 |
-
| `morning_cognition_bonus` | U(0.4, 2.0) or none | morning
|
| 161 |
-
| `evening_night_cognition_bonus` | U(0.6, 1.8) or none | night
|
| 162 |
-
| `
|
| 163 |
-
| `progress_serenity_bonus` | U(0, 0.10) |
|
| 164 |
-
| `
|
| 165 |
-
| `
|
| 166 |
-
| `
|
| 167 |
-
| `
|
| 168 |
-
| `connection_decay_rate` | U(0.005, 0.02) |
|
| 169 |
-
| `vitality_decay_rate` | U(0, 0.04) |
|
| 170 |
-
| `event_impact_multiplier` | U(0.5, 1.0) |
|
| 171 |
-
| `stress_tolerance` | U(0.15, 0.30) |
|
| 172 |
-
| `reward_weights` | Dirichlet biased to non-vit/cog |
|
| 173 |
-
|
| 174 |
-
This produces an effectively infinite personality space β memorization is impossible, the agent must learn the *skill* of inference.
|
| 175 |
|
| 176 |
### Three reference profiles
|
| 177 |
|
|
@@ -315,6 +336,74 @@ the optional Stage 3, but it isn't on the critical path of the headline
|
|
| 315 |
result. See [`docs/iterations.md`](docs/iterations.md) for the full journey
|
| 316 |
and what each GRPO iteration taught us.
|
| 317 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 318 |
## Setup Instructions
|
| 319 |
|
| 320 |
### Local Development
|
|
|
|
| 142 |
- All 5 meter values + per-meter deltas from the last action
|
| 143 |
- Current day, slot, timestep
|
| 144 |
- Active random event (if any)
|
| 145 |
+
- **Rolling 7-step history of (action, reward, deltas, *anomalies*)** β see below
|
| 146 |
- Total scalar reward
|
| 147 |
|
| 148 |
+
### The anomaly signal (the cleanest inference channel)
|
| 149 |
+
|
| 150 |
+
For every past step in the rolling history, the agent sees both the actual
|
| 151 |
+
per-meter delta *and* a per-meter **anomaly** = `actual_delta β expected_delta_under_neutral_profile`.
|
| 152 |
+
|
| 153 |
+
A neutral profile is the average person; the anomaly therefore tells the
|
| 154 |
+
agent **how much this specific user's response deviated from the average
|
| 155 |
+
user's response to the same action**. That deviation is what encodes the
|
| 156 |
+
hidden personality.
|
| 157 |
+
|
| 158 |
+
Concrete example: the agent does `SOCIALIZE` and observes `vitality_delta = -0.18`.
|
| 159 |
+
Under a neutral profile the expected change is `-0.06`, so
|
| 160 |
+
`vitality_anomaly = -0.12` β the user lost twice as much energy as average.
|
| 161 |
+
Strong evidence the hidden `social_vitality_multiplier` is high
|
| 162 |
+
(introvert: socializing is costly). The agent should down-weight social
|
| 163 |
+
actions and reach for solo recovery instead.
|
| 164 |
+
|
| 165 |
+
This is why a tiny model can learn meta-RL inference here: the env hands it
|
| 166 |
+
a clean, well-typed deviation-from-baseline signal at every step. See
|
| 167 |
+
[`models.py` (StepRecord)](models.py) for the exact field layout.
|
| 168 |
+
|
| 169 |
### What the Agent Does NOT See
|
| 170 |
- **The hidden personality vector** β sampled per episode, controls everything below
|
| 171 |
- **Reward weight decomposition** β same meter changes produce different rewards for different people
|
|
|
|
| 175 |
|
| 176 |
Each `reset()` samples a fresh personality from parameter distributions:
|
| 177 |
|
| 178 |
+
| Parameter | Distribution | What it controls | Concrete intuition |
|
| 179 |
+
|---|---|---|---|
|
| 180 |
+
| `social_vitality_multiplier` | U(0.2, 3.0) | how much vitality `SOCIALIZE` / `FAMILY_TIME` drain | **0.2** = "people energize me" (extrovert). **3.0** = "every party leaves me wrecked" (introvert). |
|
| 181 |
+
| `morning_cognition_bonus` | U(0.4, 2.0) or none | bonus on cognitive gains in morning slots | High = sharp at 8am, useless after 6pm (morning person). |
|
| 182 |
+
| `evening_night_cognition_bonus` | U(0.6, 1.8) or none | same, evening/night | High = comes alive after dark (night owl). |
|
| 183 |
+
| `work_vitality_recovery` | U(0, 0.06) | `DEEP_WORK` *adds* vitality instead of draining | High = work is fuel, not cost (workaholic). |
|
| 184 |
+
| `progress_serenity_bonus` | U(0, 0.10) | calmness gained from career progress | High = "I am at peace when I'm shipping." |
|
| 185 |
+
| `solo_serenity_bonus` | U(0, 0.10) | `ME_TIME` extra calm gain | High = recharges by being alone (introvert). |
|
| 186 |
+
| `social_connection_multiplier` | U(1, 2) | how strongly social actions build Connection | High = relationships compound fast (extrovert fluency). |
|
| 187 |
+
| `social_serenity_bonus` | U(0, 0.06) | extra calm from social actions | High = extrovert peace-from-people. |
|
| 188 |
+
| `binge_shame` | Bernoulli(0.5) | does `BINGE_WATCH` cost extra serenity afterwards | True = guilt spiral, False = guilt-free. |
|
| 189 |
+
| `connection_decay_rate` | U(0.005, 0.02) | passive Connection drop per step | High = relationships need active maintenance. |
|
| 190 |
+
| `vitality_decay_rate` | U(0, 0.04) | passive Vitality drop per step | High = always low-energy. |
|
| 191 |
+
| `event_impact_multiplier` | U(0.5, 1.0) | how hard random events (Prod Crash etc.) hit | High = brittle to setbacks. |
|
| 192 |
+
| `stress_tolerance` | U(0.15, 0.30) | meter level where stress-spiral penalty fires | Low = falls apart easily. |
|
| 193 |
+
| `reward_weights` | Dirichlet (biased to non-vit/cog) | which meter changes are *valuable to this person* | One person scores high on `progress` gains; another scores high on `connection`. |
|
| 194 |
+
|
| 195 |
+
This produces an effectively infinite personality space β memorization is impossible, the agent must learn the *skill* of inference. The three named reference profiles below are concrete points in this space, useful as test anchors. For the exact sampling logic see [`server/rhythm_environment.py`](server/rhythm_environment.py) (`sample_profile`).
|
| 196 |
|
| 197 |
### Three reference profiles
|
| 198 |
|
|
|
|
| 336 |
result. See [`docs/iterations.md`](docs/iterations.md) for the full journey
|
| 337 |
and what each GRPO iteration taught us.
|
| 338 |
|
| 339 |
+
## Reproducing the headline numbers
|
| 340 |
+
|
| 341 |
+
There are two reproduction paths depending on how much time and budget you
|
| 342 |
+
have. Both produce the numbers in the *Headline result* table above.
|
| 343 |
+
|
| 344 |
+
### Fast path (~10β20 min, $0): re-run eval against the published checkpoint
|
| 345 |
+
|
| 346 |
+
This is the path most reviewers should take. The trained model is already
|
| 347 |
+
on the Hub. Download it once, then run `inference_eval.py` (which expects a
|
| 348 |
+
local path) against all three eval conditions.
|
| 349 |
+
|
| 350 |
+
```bash
|
| 351 |
+
# Prereqs: Python 3.10+, ~12 GB free disk, GPU strongly recommended (CPU works but is slow).
|
| 352 |
+
pip install -e .
|
| 353 |
+
export HF_TOKEN=... # any read-scoped HF token; the model is public.
|
| 354 |
+
|
| 355 |
+
# 1. Snapshot the trained checkpoint locally.
|
| 356 |
+
hf download InosLihka/rhythm-env-meta-trained-sft-v3 \
|
| 357 |
+
--local-dir ./models/sft-v3
|
| 358 |
+
|
| 359 |
+
# 2. Run all 3 eval conditions (discrete-3, continuous-in-dist, continuous-OOD).
|
| 360 |
+
python training/inference_eval.py \
|
| 361 |
+
--model_path ./models/sft-v3 \
|
| 362 |
+
--num_episodes 5 \
|
| 363 |
+
--output_file eval_results_v2.json
|
| 364 |
+
```
|
| 365 |
+
|
| 366 |
+
Expected output: `eval_results_v2.json` whose per-condition averages match
|
| 367 |
+
[the published JSON](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json)
|
| 368 |
+
(distilled student 0.574 / 0.536 / 0.507 across continuous-in-dist /
|
| 369 |
+
continuous-OOD / discrete-3) within Β±0.02 stochastic noise.
|
| 370 |
+
|
| 371 |
+
> If you'd rather not run a GPU locally, `scripts/eval_on_hf.py` does the
|
| 372 |
+
> same thing as a HF Jobs run: it snapshots the model on a remote a10g/a100,
|
| 373 |
+
> runs `inference_eval.py`, and uploads the resulting JSON back to the
|
| 374 |
+
> model repo. See the docstring in that script for the submit command.
|
| 375 |
+
|
| 376 |
+
### Full path (~1.5 hr, ~$5β6 in API + GPU credits): retrain from scratch
|
| 377 |
+
|
| 378 |
+
```bash
|
| 379 |
+
# 1. Teacher rollouts (Azure OpenAI gpt-5.4, ~$3 for 30 episodes).
|
| 380 |
+
cp .env.example .env # fill in AZURE_OPENAI_* + HF_TOKEN
|
| 381 |
+
python scripts/generate_teacher_trajectories.py \
|
| 382 |
+
--seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3
|
| 383 |
+
|
| 384 |
+
# 2. Push teacher data to a HF dataset repo of your choice.
|
| 385 |
+
python scripts/upload_teacher_data.py \
|
| 386 |
+
--files data/teacher_30ep.jsonl \
|
| 387 |
+
--repo <your_user>/rhythm-env-teacher-trajectories
|
| 388 |
+
|
| 389 |
+
# 3. SFT distill on HF Jobs (~25 min on a10g-large, ~$2β3).
|
| 390 |
+
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
|
| 391 |
+
-e TEACHER_DATA_REPO=<your_user>/rhythm-env-teacher-trajectories \
|
| 392 |
+
-e MODEL_REPO_SUFFIX=sft-v3-repro \
|
| 393 |
+
-d scripts/sft_on_hf.py
|
| 394 |
+
|
| 395 |
+
# 4. Eval the new checkpoint via the Fast Path above:
|
| 396 |
+
hf download <your_user>/rhythm-env-meta-trained-sft-v3-repro \
|
| 397 |
+
--local-dir ./models/sft-v3-repro
|
| 398 |
+
python training/inference_eval.py \
|
| 399 |
+
--model_path ./models/sft-v3-repro \
|
| 400 |
+
--num_episodes 5 \
|
| 401 |
+
--output_file eval_results_v2.json
|
| 402 |
+
```
|
| 403 |
+
|
| 404 |
+
Stage 3 (optional GRPO refine on top of SFT) is the same `scripts/train_on_hf.py`
|
| 405 |
+
command shown earlier in the *Training: Algorithm Distillation* section.
|
| 406 |
+
|
| 407 |
## Setup Instructions
|
| 408 |
|
| 409 |
### Local Development
|
|
@@ -1,5 +1,14 @@
|
|
| 1 |
# Training Guide β RhythmEnv GRPO
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
## What we're training
|
| 4 |
|
| 5 |
A Qwen 2.5-3B model (4-bit quantized + LoRA) to play one-week episodes in RhythmEnv. The agent sees 5 life meters, time of day, and a reward signal. It must infer the hidden personality profile from those signals and adapt its action selection accordingly.
|
|
@@ -20,18 +29,25 @@ The goal is not to teach the model the rules of the environment β a capable LL
|
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
-
##
|
| 24 |
|
| 25 |
-
Each training step scores four candidate completions per prompt across
|
| 26 |
|
| 27 |
| Layer | Function | Signal | Pass | Fail |
|
| 28 |
|---|---|---|---|---|
|
| 29 |
-
| 1 | `format_valid` | Is the output
|
| 30 |
| 2 | `action_legal` | Is it one of the 10 valid `ActionType` values? | +0.5 | β1.0 |
|
| 31 |
| 3 | `env_reward` | Real reward from stepping the environment | varies | β3.0 |
|
|
|
|
| 32 |
|
| 33 |
`env_reward` uses seed-based episode replay: the dataset stores `seed`, `step_index`, and `action_history` alongside each prompt. The reward function reconstructs the exact episode state and steps the environment with the candidate action β the reward cannot be fabricated.
|
| 34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
---
|
| 36 |
|
| 37 |
## Key config choices
|
|
|
|
| 1 |
# Training Guide β RhythmEnv GRPO
|
| 2 |
|
| 3 |
+
> **Scope of this doc:** the **pure-GRPO-from-scratch** training stack
|
| 4 |
+
> exercised by [`training/RhythmEnv_GRPO_Training.ipynb`](../training/RhythmEnv_GRPO_Training.ipynb)
|
| 5 |
+
> (Iter 5 in [`iterations.md`](iterations.md)). The submission's **headline
|
| 6 |
+
> numbers** come from a different recipe β **Algorithm Distillation**
|
| 7 |
+
> (teacher trajectories β SFT) β described in the [README](../README.md)
|
| 8 |
+
> under *Training: Algorithm Distillation*. The GRPO machinery here is
|
| 9 |
+
> still used in the final pipeline, but only as the *optional Stage 3
|
| 10 |
+
> refine on top of the SFT checkpoint*, not as the from-scratch trainer.
|
| 11 |
+
|
| 12 |
## What we're training
|
| 13 |
|
| 14 |
A Qwen 2.5-3B model (4-bit quantized + LoRA) to play one-week episodes in RhythmEnv. The agent sees 5 life meters, time of day, and a reward signal. It must infer the hidden personality profile from those signals and adapt its action selection accordingly.
|
|
|
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
+
## Four-layer reward stack
|
| 33 |
|
| 34 |
+
Each training step scores four candidate completions per prompt across four reward functions:
|
| 35 |
|
| 36 |
| Layer | Function | Signal | Pass | Fail |
|
| 37 |
|---|---|---|---|---|
|
| 38 |
+
| 1 | `format_valid` | Is the output parseable as `ACTION + 3 belief digits`? | +1.0 | β2.0 |
|
| 39 |
| 2 | `action_legal` | Is it one of the 10 valid `ActionType` values? | +0.5 | β1.0 |
|
| 40 |
| 3 | `env_reward` | Real reward from stepping the environment | varies | β3.0 |
|
| 41 |
+
| 4 | `belief_accuracy` | Cosine-MAE between emitted belief vector and the hidden profile (the meta-RL signal) | +0.5 | β0.5 |
|
| 42 |
|
| 43 |
`env_reward` uses seed-based episode replay: the dataset stores `seed`, `step_index`, and `action_history` alongside each prompt. The reward function reconstructs the exact episode state and steps the environment with the candidate action β the reward cannot be fabricated.
|
| 44 |
|
| 45 |
+
`belief_accuracy` is what the v2 grader was added to measure. The first
|
| 46 |
+
five GRPO iterations under the v1 (three-layer) grader kept tying with
|
| 47 |
+
heuristic because nothing was rewarding the agent for *learning who it was
|
| 48 |
+
helping*. Adding `belief_accuracy` makes inference a first-class
|
| 49 |
+
training signal. See [`iterations.md`](iterations.md) for the post-mortem.
|
| 50 |
+
|
| 51 |
---
|
| 52 |
|
| 53 |
## Key config choices
|
|
@@ -3,19 +3,7 @@
|
|
| 3 |
{
|
| 4 |
"cell_type": "markdown",
|
| 5 |
"metadata": {},
|
| 6 |
-
"source": [
|
| 7 |
-
"# RhythmEnv GRPO Training \u2014 Meta-RL Life Simulator\n",
|
| 8 |
-
"\n",
|
| 9 |
-
"Train an LLM agent to *learn how to learn a person*. The agent receives 4-layer reward:\n",
|
| 10 |
-
"- format_valid (parseable ACTION + 3 belief digits)\n",
|
| 11 |
-
"- action_legal (one of 10 valid actions)\n",
|
| 12 |
-
"- env_reward (actual env reward via seed-replay)\n",
|
| 13 |
-
"- belief_accuracy (cosine-MAE vs hidden true profile vector)\n",
|
| 14 |
-
"\n",
|
| 15 |
-
"Continuous profile space + curriculum warmup + adaptation-aware grader \u2192 trains a meta-policy that generalizes to unseen personalities.\n",
|
| 16 |
-
"\n",
|
| 17 |
-
"Default config: 1500 steps on Qwen 2.5-3B + LoRA rank 8, ~6h on Colab T4.\n"
|
| 18 |
-
]
|
| 19 |
},
|
| 20 |
{
|
| 21 |
"cell_type": "markdown",
|
|
@@ -86,12 +74,12 @@
|
|
| 86 |
"\n",
|
| 87 |
"env = RhythmEnvironment()\n",
|
| 88 |
"obs = env.reset(seed=0)\n",
|
| 89 |
-
"print(f\"Initial state
|
| 90 |
" f\"Progress: {obs.progress:.2f}, Serenity: {obs.serenity:.2f}, Connection: {obs.connection:.2f}\")\n",
|
| 91 |
"print(f\"Day: {obs.day}, Slot: {obs.slot}, Steps: {obs.remaining_steps}\")\n",
|
| 92 |
"\n",
|
| 93 |
"obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))\n",
|
| 94 |
-
"print(f\"\\nAfter DEEP_WORK
|
| 95 |
"print(f\"Vitality: {obs.vitality:.2f}, Progress: {obs.progress:.2f}\")\n",
|
| 96 |
"print(\"\\nEnvironment is working!\")"
|
| 97 |
]
|
|
@@ -367,7 +355,7 @@
|
|
| 367 |
" ax.plot(steps, losses, color=\"#2563eb\", linewidth=1.5, alpha=0.8)\n",
|
| 368 |
" ax.set_xlabel(\"Training Step\")\n",
|
| 369 |
" ax.set_ylabel(\"Loss\")\n",
|
| 370 |
-
" ax.set_title(\"GRPO Training Loss
|
| 371 |
" ax.grid(True, alpha=0.3)\n",
|
| 372 |
" plt.tight_layout()\n",
|
| 373 |
" plt.savefig(\"plots/training_loss.png\", dpi=150)\n",
|
|
@@ -382,10 +370,10 @@
|
|
| 382 |
" ax.plot(rsteps, rvals, color=\"#16a34a\", linewidth=1.5, label=f\"Mean Reward ({rkey})\")\n",
|
| 383 |
" if svals and len(svals) == len(rvals):\n",
|
| 384 |
" r, s = np.array(rvals), np.array(svals)\n",
|
| 385 |
-
" ax.fill_between(rsteps, r - s, r + s, color=\"#16a34a\", alpha=0.15, label=\"
|
| 386 |
" ax.set_xlabel(\"Training Step\")\n",
|
| 387 |
" ax.set_ylabel(\"Mean Total Reward\")\n",
|
| 388 |
-
" ax.set_title(\"GRPO Mean Reward over Training
|
| 389 |
" ax.legend()\n",
|
| 390 |
" ax.grid(True, alpha=0.3)\n",
|
| 391 |
" plt.tight_layout()\n",
|
|
@@ -440,7 +428,7 @@
|
|
| 440 |
" ax.plot(smooth_x, smooth, color=\"#7e22ce\", linewidth=2.5, label=f\"Rolling mean ({win}-step)\")\n",
|
| 441 |
" ax.axhline(0.0, color=\"k\", linewidth=0.5, linestyle=\"--\", alpha=0.5, label=\"neutral belief baseline (0.0)\")\n",
|
| 442 |
" ax.set_xlabel(\"Training Step\")\n",
|
| 443 |
-
" ax.set_ylabel(\"Mean belief_accuracy reward (
|
| 444 |
" ax.set_title(\"Belief-Accuracy Reward over Training (proof agent learned to model user)\")\n",
|
| 445 |
" ax.legend(loc=\"best\")\n",
|
| 446 |
" ax.grid(True, alpha=0.3)\n",
|
|
|
|
| 3 |
{
|
| 4 |
"cell_type": "markdown",
|
| 5 |
"metadata": {},
|
| 6 |
+
"source": "# RhythmEnv GRPO Training β Meta-RL Life Simulator\n\n> **β οΈ Read this banner first β what this notebook is and isn't.**\n>\n> This notebook runs **pure GRPO from scratch** on Qwen 2.5-3B with the\n> 4-layer reward stack. It corresponds to **Iter 5** in\n> [`docs/iterations.md`](../docs/iterations.md) and is preserved here as\n> reproducible evidence of that iteration. **It is NOT the path that\n> produced the submission's headline numbers.**\n>\n> Five GRPO-from-scratch iterations all matched the heuristic baseline but\n> never beat it β small models (β€3B) need a teacher to bootstrap reasoning\n> at this scale.\n>\n> The headline result comes from **Algorithm Distillation** instead:\n>\n> 1. `scripts/generate_teacher_trajectories.py` β gpt-5.4 plays episodes and writes reasoning + beliefs.\n> 2. `scripts/sft_on_hf.py` β distill teacher trajectories into Qwen 3B (the SFT v3 checkpoint).\n> 3. `scripts/train_on_hf.py` β *optional* GRPO refine **on top of** the SFT checkpoint. This is where the GRPO machinery this notebook exercises (via `training/train.py`) is actually used in the final pipeline.\n>\n> See [`README.md`](../README.md) and [`BLOG.md`](../BLOG.md) for the\n> end-to-end story. To reproduce the headline numbers, follow Stages 1β3\n> of the README's *Training: Algorithm Distillation* section β not this\n> notebook.\n\nWhat this notebook actually does (pure GRPO from scratch with the 4-layer reward):\n\n- format_valid (parseable ACTION + 3 belief digits)\n- action_legal (one of 10 valid actions)\n- env_reward (actual env reward via seed-replay)\n- belief_accuracy (cosine-MAE vs hidden true profile vector)\n\nContinuous profile space + curriculum warmup + adaptation-aware grader.\n\nDefault config: 1500 steps on Qwen 2.5-3B + LoRA rank 8, ~6h on Colab T4.\n"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
},
|
| 8 |
{
|
| 9 |
"cell_type": "markdown",
|
|
|
|
| 74 |
"\n",
|
| 75 |
"env = RhythmEnvironment()\n",
|
| 76 |
"obs = env.reset(seed=0)\n",
|
| 77 |
+
"print(f\"Initial state Γ’β¬β Vitality: {obs.vitality:.2f}, Cognition: {obs.cognition:.2f}, \"\n",
|
| 78 |
" f\"Progress: {obs.progress:.2f}, Serenity: {obs.serenity:.2f}, Connection: {obs.connection:.2f}\")\n",
|
| 79 |
"print(f\"Day: {obs.day}, Slot: {obs.slot}, Steps: {obs.remaining_steps}\")\n",
|
| 80 |
"\n",
|
| 81 |
"obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))\n",
|
| 82 |
+
"print(f\"\\nAfter DEEP_WORK Γ’β¬β Reward: {obs.reward:.3f}\")\n",
|
| 83 |
"print(f\"Vitality: {obs.vitality:.2f}, Progress: {obs.progress:.2f}\")\n",
|
| 84 |
"print(\"\\nEnvironment is working!\")"
|
| 85 |
]
|
|
|
|
| 355 |
" ax.plot(steps, losses, color=\"#2563eb\", linewidth=1.5, alpha=0.8)\n",
|
| 356 |
" ax.set_xlabel(\"Training Step\")\n",
|
| 357 |
" ax.set_ylabel(\"Loss\")\n",
|
| 358 |
+
" ax.set_title(\"GRPO Training Loss β RhythmEnv Meta-RL\")\n",
|
| 359 |
" ax.grid(True, alpha=0.3)\n",
|
| 360 |
" plt.tight_layout()\n",
|
| 361 |
" plt.savefig(\"plots/training_loss.png\", dpi=150)\n",
|
|
|
|
| 370 |
" ax.plot(rsteps, rvals, color=\"#16a34a\", linewidth=1.5, label=f\"Mean Reward ({rkey})\")\n",
|
| 371 |
" if svals and len(svals) == len(rvals):\n",
|
| 372 |
" r, s = np.array(rvals), np.array(svals)\n",
|
| 373 |
+
" ax.fill_between(rsteps, r - s, r + s, color=\"#16a34a\", alpha=0.15, label=\"Β±1 std\")\n",
|
| 374 |
" ax.set_xlabel(\"Training Step\")\n",
|
| 375 |
" ax.set_ylabel(\"Mean Total Reward\")\n",
|
| 376 |
+
" ax.set_title(\"GRPO Mean Reward over Training β RhythmEnv Meta-RL\")\n",
|
| 377 |
" ax.legend()\n",
|
| 378 |
" ax.grid(True, alpha=0.3)\n",
|
| 379 |
" plt.tight_layout()\n",
|
|
|
|
| 428 |
" ax.plot(smooth_x, smooth, color=\"#7e22ce\", linewidth=2.5, label=f\"Rolling mean ({win}-step)\")\n",
|
| 429 |
" ax.axhline(0.0, color=\"k\", linewidth=0.5, linestyle=\"--\", alpha=0.5, label=\"neutral belief baseline (0.0)\")\n",
|
| 430 |
" ax.set_xlabel(\"Training Step\")\n",
|
| 431 |
+
" ax.set_ylabel(\"Mean belief_accuracy reward (β0.5 to +0.5)\")\n",
|
| 432 |
" ax.set_title(\"Belief-Accuracy Reward over Training (proof agent learned to model user)\")\n",
|
| 433 |
" ax.legend(loc=\"best\")\n",
|
| 434 |
" ax.grid(True, alpha=0.3)\n",
|