Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka commited on 6 days ago

Commit

361aed7

1 Parent(s): 39e98db

Clarify documentation: anomaly signal explainer, GRPO scope notes

User-authored documentation improvements that resolve the structural
confusion flagged in review.

README.md
+ New section 'The anomaly signal (the cleanest inference channel)'
with concrete worked example showing how per-step anomalies encode
the hidden profile. Explains why the env can train meta-RL inference
even on a small model.
+ 'Continuous Personality Space' table now has a 'concrete intuition'
column for each parameter (turns abstract distributions into
readable behavioral descriptions).

docs/training.md
+ 'Scope of this doc' banner: clarifies this guide describes the
pure-GRPO-from-scratch stack (the historical Iter 5 path), NOT the
canonical Algorithm Distillation recipe that produced the headline
numbers. Explicit pointer to README's AD section.
+ Reward stack section expanded from 3 to 4 layers (now correctly
documents belief_accuracy alongside format_valid, action_legal,
env_reward).

training/RhythmEnv_GRPO_Training.ipynb
+ Top-of-notebook ⚠️ banner: 'Read this first' — explains the notebook
runs Iter 5 (pure GRPO), preserved as reproducible evidence, NOT the
headline path. Points readers at the AD pipeline scripts for
reproducing the actual submitted numbers.
+ Fixed em-dash encoding in two intro print statements.

.gitignore
+ Add HANDOFF.md to local-only ignores.

Files changed (4) hide show

.gitignore +1 -0
README.md +108 -19
docs/training.md +19 -3
training/RhythmEnv_GRPO_Training.ipynb +7 -19

.gitignore CHANGED Viewed

@@ -14,3 +14,4 @@ iter1_results/
 iter2_results/
 iter5_results/
 outputs/

 iter2_results/
 iter5_results/
 outputs/
+HANDOFF.md

README.md CHANGED Viewed

@@ -142,9 +142,30 @@ All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receiv
 - All 5 meter values + per-meter deltas from the last action
 - Current day, slot, timestep
 - Active random event (if any)
-- Rolling 7-step history of (action, reward, deltas)
 - Total scalar reward
 ### What the Agent Does NOT See
 - **The hidden personality vector** — sampled per episode, controls everything below
 - **Reward weight decomposition** — same meter changes produce different rewards for different people
@@ -154,24 +175,24 @@ All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receiv
 Each `reset()` samples a fresh personality from parameter distributions:
-| Parameter | Distribution | Effect |
-|---|---|---|
-| `social_vitality_multiplier` | U(0.2, 3.0) | low = extrovert, high = introvert |
-| `morning_cognition_bonus` | U(0.4, 2.0) or none | morning person trait |
-| `evening_night_cognition_bonus` | U(0.6, 1.8) or none | night owl trait |
-| `binge_shame` | Bernoulli(0.5) | conscientiousness |
-| `progress_serenity_bonus` | U(0, 0.10) | workaholic peace-from-work |
-| `work_vitality_recovery` | U(0, 0.06) | work energizes vs drains |
-| `solo_serenity_bonus` | U(0, 0.10) | introvert solo recharge |
-| `social_connection_multiplier` | U(1, 2) | extrovert social fluency |
-| `social_serenity_bonus` | U(0, 0.06) | extrovert peace-from-people |
-| `connection_decay_rate` | U(0.005, 0.02) | how fast relationships fade |
-| `vitality_decay_rate` | U(0, 0.04) | base energy decline |
-| `event_impact_multiplier` | U(0.5, 1.0) | resilience to setbacks |
-| `stress_tolerance` | U(0.15, 0.30) | when stress spiral kicks in |
-| `reward_weights` | Dirichlet biased to non-vit/cog | what THIS person values |
-This produces an effectively infinite personality space — memorization is impossible, the agent must learn the *skill* of inference.
 ### Three reference profiles
@@ -315,6 +336,74 @@ the optional Stage 3, but it isn't on the critical path of the headline
 result. See [`docs/iterations.md`](docs/iterations.md) for the full journey
 and what each GRPO iteration taught us.
 ## Setup Instructions
 ### Local Development

 - All 5 meter values + per-meter deltas from the last action
 - Current day, slot, timestep
 - Active random event (if any)
+- **Rolling 7-step history of (action, reward, deltas, *anomalies*)** — see below
 - Total scalar reward
+### The anomaly signal (the cleanest inference channel)
+For every past step in the rolling history, the agent sees both the actual
+per-meter delta *and* a per-meter **anomaly** = `actual_delta − expected_delta_under_neutral_profile`.
+A neutral profile is the average person; the anomaly therefore tells the
+agent **how much this specific user's response deviated from the average
+user's response to the same action**. That deviation is what encodes the
+hidden personality.
+Concrete example: the agent does `SOCIALIZE` and observes `vitality_delta = -0.18`.
+Under a neutral profile the expected change is `-0.06`, so
+`vitality_anomaly = -0.12` — the user lost twice as much energy as average.
+Strong evidence the hidden `social_vitality_multiplier` is high
+(introvert: socializing is costly). The agent should down-weight social
+actions and reach for solo recovery instead.
+This is why a tiny model can learn meta-RL inference here: the env hands it
+a clean, well-typed deviation-from-baseline signal at every step. See
+[`models.py` (StepRecord)](models.py) for the exact field layout.
 ### What the Agent Does NOT See
 - **The hidden personality vector** — sampled per episode, controls everything below
 - **Reward weight decomposition** — same meter changes produce different rewards for different people
 Each `reset()` samples a fresh personality from parameter distributions:
+| Parameter | Distribution | What it controls | Concrete intuition |
+|---|---|---|---|
+| `social_vitality_multiplier` | U(0.2, 3.0) | how much vitality `SOCIALIZE` / `FAMILY_TIME` drain | **0.2** = "people energize me" (extrovert). **3.0** = "every party leaves me wrecked" (introvert). |
+| `morning_cognition_bonus` | U(0.4, 2.0) or none | bonus on cognitive gains in morning slots | High = sharp at 8am, useless after 6pm (morning person). |
+| `evening_night_cognition_bonus` | U(0.6, 1.8) or none | same, evening/night | High = comes alive after dark (night owl). |
+| `work_vitality_recovery` | U(0, 0.06) | `DEEP_WORK` *adds* vitality instead of draining | High = work is fuel, not cost (workaholic). |
+| `progress_serenity_bonus` | U(0, 0.10) | calmness gained from career progress | High = "I am at peace when I'm shipping." |
+| `solo_serenity_bonus` | U(0, 0.10) | `ME_TIME` extra calm gain | High = recharges by being alone (introvert). |
+| `social_connection_multiplier` | U(1, 2) | how strongly social actions build Connection | High = relationships compound fast (extrovert fluency). |
+| `social_serenity_bonus` | U(0, 0.06) | extra calm from social actions | High = extrovert peace-from-people. |
+| `binge_shame` | Bernoulli(0.5) | does `BINGE_WATCH` cost extra serenity afterwards | True = guilt spiral, False = guilt-free. |
+| `connection_decay_rate` | U(0.005, 0.02) | passive Connection drop per step | High = relationships need active maintenance. |
+| `vitality_decay_rate` | U(0, 0.04) | passive Vitality drop per step | High = always low-energy. |
+| `event_impact_multiplier` | U(0.5, 1.0) | how hard random events (Prod Crash etc.) hit | High = brittle to setbacks. |
+| `stress_tolerance` | U(0.15, 0.30) | meter level where stress-spiral penalty fires | Low = falls apart easily. |
+| `reward_weights` | Dirichlet (biased to non-vit/cog) | which meter changes are *valuable to this person* | One person scores high on `progress` gains; another scores high on `connection`. |
+This produces an effectively infinite personality space — memorization is impossible, the agent must learn the *skill* of inference. The three named reference profiles below are concrete points in this space, useful as test anchors. For the exact sampling logic see [`server/rhythm_environment.py`](server/rhythm_environment.py) (`sample_profile`).
 ### Three reference profiles
 result. See [`docs/iterations.md`](docs/iterations.md) for the full journey
 and what each GRPO iteration taught us.
+## Reproducing the headline numbers
+There are two reproduction paths depending on how much time and budget you
+have. Both produce the numbers in the *Headline result* table above.
+### Fast path (~10–20 min, $0): re-run eval against the published checkpoint
+This is the path most reviewers should take. The trained model is already
+on the Hub. Download it once, then run `inference_eval.py` (which expects a
+local path) against all three eval conditions.
+```bash
+# Prereqs: Python 3.10+, ~12 GB free disk, GPU strongly recommended (CPU works but is slow).
+pip install -e .
+export HF_TOKEN=...   # any read-scoped HF token; the model is public.
+# 1. Snapshot the trained checkpoint locally.
+hf download InosLihka/rhythm-env-meta-trained-sft-v3 \
+    --local-dir ./models/sft-v3
+# 2. Run all 3 eval conditions (discrete-3, continuous-in-dist, continuous-OOD).
+python training/inference_eval.py \
+    --model_path ./models/sft-v3 \
+    --num_episodes 5 \
+    --output_file eval_results_v2.json
+```
+Expected output: `eval_results_v2.json` whose per-condition averages match
+[the published JSON](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json)
+(distilled student 0.574 / 0.536 / 0.507 across continuous-in-dist /
+continuous-OOD / discrete-3) within ±0.02 stochastic noise.
+> If you'd rather not run a GPU locally, `scripts/eval_on_hf.py` does the
+> same thing as a HF Jobs run: it snapshots the model on a remote a10g/a100,
+> runs `inference_eval.py`, and uploads the resulting JSON back to the
+> model repo. See the docstring in that script for the submit command.
+### Full path (~1.5 hr, ~$5–6 in API + GPU credits): retrain from scratch
+```bash
+# 1. Teacher rollouts (Azure OpenAI gpt-5.4, ~$3 for 30 episodes).
+cp .env.example .env   # fill in AZURE_OPENAI_* + HF_TOKEN
+python scripts/generate_teacher_trajectories.py \
+    --seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3
+# 2. Push teacher data to a HF dataset repo of your choice.
+python scripts/upload_teacher_data.py \
+    --files data/teacher_30ep.jsonl \
+    --repo <your_user>/rhythm-env-teacher-trajectories
+# 3. SFT distill on HF Jobs (~25 min on a10g-large, ~$2–3).
+hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
+    -e TEACHER_DATA_REPO=<your_user>/rhythm-env-teacher-trajectories \
+    -e MODEL_REPO_SUFFIX=sft-v3-repro \
+    -d scripts/sft_on_hf.py
+# 4. Eval the new checkpoint via the Fast Path above:
+hf download <your_user>/rhythm-env-meta-trained-sft-v3-repro \
+    --local-dir ./models/sft-v3-repro
+python training/inference_eval.py \
+    --model_path ./models/sft-v3-repro \
+    --num_episodes 5 \
+    --output_file eval_results_v2.json
+```
+Stage 3 (optional GRPO refine on top of SFT) is the same `scripts/train_on_hf.py`
+command shown earlier in the *Training: Algorithm Distillation* section.
 ## Setup Instructions
 ### Local Development

docs/training.md CHANGED Viewed

@@ -1,5 +1,14 @@
 # Training Guide — RhythmEnv GRPO
 ## What we're training
 A Qwen 2.5-3B model (4-bit quantized + LoRA) to play one-week episodes in RhythmEnv. The agent sees 5 life meters, time of day, and a reward signal. It must infer the hidden personality profile from those signals and adapt its action selection accordingly.
@@ -20,18 +29,25 @@ The goal is not to teach the model the rules of the environment — a capable LL
 ---
-## Three-layer reward stack
-Each training step scores four candidate completions per prompt across three reward functions:
 | Layer | Function | Signal | Pass | Fail |
 |---|---|---|---|---|
-| 1 | `format_valid` | Is the output a parseable action name? | +1.0 | −2.0 |
 | 2 | `action_legal` | Is it one of the 10 valid `ActionType` values? | +0.5 | −1.0 |
 | 3 | `env_reward` | Real reward from stepping the environment | varies | −3.0 |
 `env_reward` uses seed-based episode replay: the dataset stores `seed`, `step_index`, and `action_history` alongside each prompt. The reward function reconstructs the exact episode state and steps the environment with the candidate action — the reward cannot be fabricated.
 ---
 ## Key config choices

 # Training Guide — RhythmEnv GRPO
+> **Scope of this doc:** the **pure-GRPO-from-scratch** training stack
+> exercised by [`training/RhythmEnv_GRPO_Training.ipynb`](../training/RhythmEnv_GRPO_Training.ipynb)
+> (Iter 5 in [`iterations.md`](iterations.md)). The submission's **headline
+> numbers** come from a different recipe — **Algorithm Distillation**
+> (teacher trajectories → SFT) — described in the [README](../README.md)
+> under *Training: Algorithm Distillation*. The GRPO machinery here is
+> still used in the final pipeline, but only as the *optional Stage 3
+> refine on top of the SFT checkpoint*, not as the from-scratch trainer.
 ## What we're training
 A Qwen 2.5-3B model (4-bit quantized + LoRA) to play one-week episodes in RhythmEnv. The agent sees 5 life meters, time of day, and a reward signal. It must infer the hidden personality profile from those signals and adapt its action selection accordingly.
 ---
+## Four-layer reward stack
+Each training step scores four candidate completions per prompt across four reward functions:
 | Layer | Function | Signal | Pass | Fail |
 |---|---|---|---|---|
+| 1 | `format_valid` | Is the output parseable as `ACTION + 3 belief digits`? | +1.0 | −2.0 |
 | 2 | `action_legal` | Is it one of the 10 valid `ActionType` values? | +0.5 | −1.0 |
 | 3 | `env_reward` | Real reward from stepping the environment | varies | −3.0 |
+| 4 | `belief_accuracy` | Cosine-MAE between emitted belief vector and the hidden profile (the meta-RL signal) | +0.5 | −0.5 |
 `env_reward` uses seed-based episode replay: the dataset stores `seed`, `step_index`, and `action_history` alongside each prompt. The reward function reconstructs the exact episode state and steps the environment with the candidate action — the reward cannot be fabricated.
+`belief_accuracy` is what the v2 grader was added to measure. The first
+five GRPO iterations under the v1 (three-layer) grader kept tying with
+heuristic because nothing was rewarding the agent for *learning who it was
+helping*. Adding `belief_accuracy` makes inference a first-class
+training signal. See [`iterations.md`](iterations.md) for the post-mortem.
 ---
 ## Key config choices

training/RhythmEnv_GRPO_Training.ipynb CHANGED Viewed

@@ -3,19 +3,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "# RhythmEnv GRPO Training \u2014 Meta-RL Life Simulator\n",
-    "\n",
-    "Train an LLM agent to *learn how to learn a person*. The agent receives 4-layer reward:\n",
-    "- format_valid (parseable ACTION + 3 belief digits)\n",
-    "- action_legal (one of 10 valid actions)\n",
-    "- env_reward (actual env reward via seed-replay)\n",
-    "- belief_accuracy (cosine-MAE vs hidden true profile vector)\n",
-    "\n",
-    "Continuous profile space + curriculum warmup + adaptation-aware grader \u2192 trains a meta-policy that generalizes to unseen personalities.\n",
-    "\n",
-    "Default config: 1500 steps on Qwen 2.5-3B + LoRA rank 8, ~6h on Colab T4.\n"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -86,12 +74,12 @@
     "\n",
     "env = RhythmEnvironment()\n",
     "obs = env.reset(seed=0)\n",
-    "print(f\"Initial state \u00e2\u20ac\u201d Vitality: {obs.vitality:.2f}, Cognition: {obs.cognition:.2f}, \"\n",
     "      f\"Progress: {obs.progress:.2f}, Serenity: {obs.serenity:.2f}, Connection: {obs.connection:.2f}\")\n",
     "print(f\"Day: {obs.day}, Slot: {obs.slot}, Steps: {obs.remaining_steps}\")\n",
     "\n",
     "obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))\n",
-    "print(f\"\\nAfter DEEP_WORK \u00e2\u20ac\u201d Reward: {obs.reward:.3f}\")\n",
     "print(f\"Vitality: {obs.vitality:.2f}, Progress: {obs.progress:.2f}\")\n",
     "print(\"\\nEnvironment is working!\")"
    ]
@@ -367,7 +355,7 @@
     "    ax.plot(steps, losses, color=\"#2563eb\", linewidth=1.5, alpha=0.8)\n",
     "    ax.set_xlabel(\"Training Step\")\n",
     "    ax.set_ylabel(\"Loss\")\n",
-    "    ax.set_title(\"GRPO Training Loss \u2014 RhythmEnv Meta-RL\")\n",
     "    ax.grid(True, alpha=0.3)\n",
     "    plt.tight_layout()\n",
     "    plt.savefig(\"plots/training_loss.png\", dpi=150)\n",
@@ -382,10 +370,10 @@
     "    ax.plot(rsteps, rvals, color=\"#16a34a\", linewidth=1.5, label=f\"Mean Reward ({rkey})\")\n",
     "    if svals and len(svals) == len(rvals):\n",
     "        r, s = np.array(rvals), np.array(svals)\n",
-    "        ax.fill_between(rsteps, r - s, r + s, color=\"#16a34a\", alpha=0.15, label=\"\u00b11 std\")\n",
     "    ax.set_xlabel(\"Training Step\")\n",
     "    ax.set_ylabel(\"Mean Total Reward\")\n",
-    "    ax.set_title(\"GRPO Mean Reward over Training \u2014 RhythmEnv Meta-RL\")\n",
     "    ax.legend()\n",
     "    ax.grid(True, alpha=0.3)\n",
     "    plt.tight_layout()\n",
@@ -440,7 +428,7 @@
     "        ax.plot(smooth_x, smooth, color=\"#7e22ce\", linewidth=2.5, label=f\"Rolling mean ({win}-step)\")\n",
     "    ax.axhline(0.0, color=\"k\", linewidth=0.5, linestyle=\"--\", alpha=0.5, label=\"neutral belief baseline (0.0)\")\n",
     "    ax.set_xlabel(\"Training Step\")\n",
-    "    ax.set_ylabel(\"Mean belief_accuracy reward (\u22120.5 to +0.5)\")\n",
     "    ax.set_title(\"Belief-Accuracy Reward over Training (proof agent learned to model user)\")\n",
     "    ax.legend(loc=\"best\")\n",
     "    ax.grid(True, alpha=0.3)\n",

   {
    "cell_type": "markdown",
    "metadata": {},
+   "source": "# RhythmEnv GRPO Training — Meta-RL Life Simulator\n\n> **⚠️ Read this banner first — what this notebook is and isn't.**\n>\n> This notebook runs **pure GRPO from scratch** on Qwen 2.5-3B with the\n> 4-layer reward stack. It corresponds to **Iter 5** in\n> [`docs/iterations.md`](../docs/iterations.md) and is preserved here as\n> reproducible evidence of that iteration. **It is NOT the path that\n> produced the submission's headline numbers.**\n>\n> Five GRPO-from-scratch iterations all matched the heuristic baseline but\n> never beat it — small models (≤3B) need a teacher to bootstrap reasoning\n> at this scale.\n>\n> The headline result comes from **Algorithm Distillation** instead:\n>\n> 1. `scripts/generate_teacher_trajectories.py` — gpt-5.4 plays episodes and writes reasoning + beliefs.\n> 2. `scripts/sft_on_hf.py` — distill teacher trajectories into Qwen 3B (the SFT v3 checkpoint).\n> 3. `scripts/train_on_hf.py` — *optional* GRPO refine **on top of** the SFT checkpoint. This is where the GRPO machinery this notebook exercises (via `training/train.py`) is actually used in the final pipeline.\n>\n> See [`README.md`](../README.md) and [`BLOG.md`](../BLOG.md) for the\n> end-to-end story. To reproduce the headline numbers, follow Stages 1–3\n> of the README's *Training: Algorithm Distillation* section — not this\n> notebook.\n\nWhat this notebook actually does (pure GRPO from scratch with the 4-layer reward):\n\n- format_valid (parseable ACTION + 3 belief digits)\n- action_legal (one of 10 valid actions)\n- env_reward (actual env reward via seed-replay)\n- belief_accuracy (cosine-MAE vs hidden true profile vector)\n\nContinuous profile space + curriculum warmup + adaptation-aware grader.\n\nDefault config: 1500 steps on Qwen 2.5-3B + LoRA rank 8, ~6h on Colab T4.\n"
   },
   {
    "cell_type": "markdown",
     "\n",
     "env = RhythmEnvironment()\n",
     "obs = env.reset(seed=0)\n",
+    "print(f\"Initial state â€” Vitality: {obs.vitality:.2f}, Cognition: {obs.cognition:.2f}, \"\n",
     "      f\"Progress: {obs.progress:.2f}, Serenity: {obs.serenity:.2f}, Connection: {obs.connection:.2f}\")\n",
     "print(f\"Day: {obs.day}, Slot: {obs.slot}, Steps: {obs.remaining_steps}\")\n",
     "\n",
     "obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))\n",
+    "print(f\"\\nAfter DEEP_WORK â€” Reward: {obs.reward:.3f}\")\n",
     "print(f\"Vitality: {obs.vitality:.2f}, Progress: {obs.progress:.2f}\")\n",
     "print(\"\\nEnvironment is working!\")"
    ]
     "    ax.plot(steps, losses, color=\"#2563eb\", linewidth=1.5, alpha=0.8)\n",
     "    ax.set_xlabel(\"Training Step\")\n",
     "    ax.set_ylabel(\"Loss\")\n",
+    "    ax.set_title(\"GRPO Training Loss — RhythmEnv Meta-RL\")\n",
     "    ax.grid(True, alpha=0.3)\n",
     "    plt.tight_layout()\n",
     "    plt.savefig(\"plots/training_loss.png\", dpi=150)\n",
     "    ax.plot(rsteps, rvals, color=\"#16a34a\", linewidth=1.5, label=f\"Mean Reward ({rkey})\")\n",
     "    if svals and len(svals) == len(rvals):\n",
     "        r, s = np.array(rvals), np.array(svals)\n",
+    "        ax.fill_between(rsteps, r - s, r + s, color=\"#16a34a\", alpha=0.15, label=\"±1 std\")\n",
     "    ax.set_xlabel(\"Training Step\")\n",
     "    ax.set_ylabel(\"Mean Total Reward\")\n",
+    "    ax.set_title(\"GRPO Mean Reward over Training — RhythmEnv Meta-RL\")\n",
     "    ax.legend()\n",
     "    ax.grid(True, alpha=0.3)\n",
     "    plt.tight_layout()\n",
     "        ax.plot(smooth_x, smooth, color=\"#7e22ce\", linewidth=2.5, label=f\"Rolling mean ({win}-step)\")\n",
     "    ax.axhline(0.0, color=\"k\", linewidth=0.5, linestyle=\"--\", alpha=0.5, label=\"neutral belief baseline (0.0)\")\n",
     "    ax.set_xlabel(\"Training Step\")\n",
+    "    ax.set_ylabel(\"Mean belief_accuracy reward (−0.5 to +0.5)\")\n",
     "    ax.set_title(\"Belief-Accuracy Reward over Training (proof agent learned to model user)\")\n",
     "    ax.legend(loc=\"best\")\n",
     "    ax.grid(True, alpha=0.3)\n",