Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka Claude Opus 4.7 (1M context) commited on 13 days ago

Commit

ecbe0d8

1 Parent(s): 0a15ab5

env: meta-RL refactor (continuous profiles, action+belief, adaptation grader)

Transforms rhythm_env from a 3-profile single-task RL benchmark into a
meta-RL environment where the agent must infer the user's hidden personality
from observations alone. Original 500-step training regressed vs heuristic
on all 3 profiles; this refactor addresses the four root causes:

1. server/rhythm_environment.py
- Continuous profile sampling: sample_profile(seed) draws 14 personality
parameters from bounded distributions per episode (memorization
impossible, agent must learn the *skill* of inference)
- Discrete 3-profile mode preserved via profile_mode='discrete' or
profile=<name> for legacy comparison
- profile_to_belief_vector(): reduces a profile to 3-dim ground truth
[social, morning, work] for belief-accuracy reward
- get_belief_target() / get_profile_hint(): privileged accessors for
training-side reward computation and curriculum hints
- Grader rewrite: dropped meter_balance (was rewarding random behavior),
added adaptation_score (0.30 weight) computed as late-half mean reward
minus early-half, gated by absolute late-half quality. This is the
direct meta-learning signal: only positive if the agent got better
mid-episode

2. training/reward_functions.py
- 4-layer reward stack adds belief_accuracy as the meta-RL gradient
- extract_action_and_belief() parses 'ACTION_NAME S M W' format where
S/M/W are 0-9 digits encoding agent's belief about the user
- format_valid graduated: action+belief +1.0, action only +0.5, no action -2.0
- env_reward and belief_accuracy use seed+history+profile_mode replay
for deterministic, non-fabricable rewards

3. training/dataset.py
- profile_mode parameter (default 'continuous')
- hint_fraction parameter for curriculum: a fraction of training samples
include the true profile vector in prompt as warmup
- System prompt updated to teach action+belief format with examples
- Observation prompt now includes recent step history

4. training/train.py + RhythmEnv_GRPO_Training.ipynb
- max_steps default 1500 (was 500), LoRA rank 8 (was 4)
- max_prompt_length 600, max_completion_length 20 (room for belief)
- hint_fraction=0.15 default
- Notebook: comparison plot uses eval_results.json schema, plots both
final_score and adaptation across 3 conditions (discrete/in-dist/OOD)

5. training/inference_eval.py
- 3-condition eval suite: discrete-3-profiles (legacy comparison),
continuous-in-distribution, continuous-OOD (generalization test)
- Tracks adaptation and belief_mae per episode
- OOD test is the meta-learning proof: trained agent must beat heuristic
on profiles drawn from a held-out parameter region

6. inference.py
- System prompt updated for action+belief format
- parse_llm_action delegates to training parser (consistent with training)
- Observation prompt includes step history

7. New diagnostics + baselines
- scripts/diagnostic_replay.py: verifies env_reward replay is deterministic
(passes 280/280 steps with 0 mismatches)
- scripts/pipeline_dryrun.py: validates 4-layer reward ordering on
synthetic completions (perfect=+2.60 > good=+2.21 > wrong_belief=+2.20
> action_only=+1.41 > garbage=-6.20)
- eval_baselines_meta.json: new grader baselines across 3 conditions
- eval_results_v1.json: preserved old grader baselines for comparison

8. README.md rewritten with meta-RL framing, parameter distribution table,
action+belief format docs, new baseline numbers (heuristic in-dist
0.587, OOD 0.580 — the bars to beat). All 31 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (13) hide show

README.md +112 -60
eval_baselines_meta.json +3642 -0
eval_results_v1.json +758 -0
inference.py +52 -31
scripts/diagnostic_replay.py +90 -0
scripts/pipeline_dryrun.py +121 -0
server/rhythm_environment.py +167 -50
tests/test_rhythm_env.py +9 -0
training/RhythmEnv_GRPO_Training.ipynb +149 -148
training/dataset.py +126 -48
training/inference_eval.py +167 -89
training/reward_functions.py +176 -58
training/train.py +70 -60

README.md CHANGED Viewed

@@ -9,24 +9,25 @@ tags:
   - openenv
 ---
-# RhythmEnv — Life Simulator RL Environment
-An OpenEnv environment where AI agents learn to balance 5 life meters across a 7-day week while discovering hidden personality profiles through reward signals.
 ## Links
 - **Live Environment**: [HF Space — InosLihka/rhythm_env](https://huggingface.co/spaces/InosLihka/rhythm_env)
 - **Training Notebook**: [Colab — GRPO Training](training/RhythmEnv_GRPO_Training.ipynb)
 - **Blog Post**: [Teaching an AI to Know You (Without Asking)](docs/blog_post.md)
-<!-- - **Training Results**: See [Results](#training-results) section below -->
 ## Why a Life Simulator?
-Personal AI assistants give generic advice. They don't know *you*. RhythmEnv is an environment where an agent learns your specific patterns through experience — not configuration.
-Every person has a hidden "DNA" that determines how activities affect their well-being: introverts drain faster from socializing, workaholics find peace through progress, night owls peak in the evening. The agent must discover these hidden dynamics by observing how the same actions produce different rewards across episodes.
-This is **Theme 3.2 (Personalized Tasks)** + **Theme 2 (Long-Horizon Planning)** — the agent must plan across a full week while adapting to a hidden personality it must infer.
 ## Quick Start
@@ -90,91 +91,142 @@ All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receiv
 - **Random events** (~8% per step): Prod Crash, Family Emergency, Illness, Good News
 - **Deterministic** given seed: same seed → same episode trajectory
-## The Hidden Variable System (Core Innovation)
-### What the Agent Sees
-- All 5 meter values
 - Current day, slot, timestep
-- Active event (if any)
-- Total scalar reward per step
 ### What the Agent Does NOT See
-- **Personality profile name** — assigned per episode, controls everything below
-- **Reward weight decomposition** — same meter changes produce different rewards for different profiles
-- **Action modifiers** — social drain, cognitive bonuses, shame spirals vary by profile
-### 3 Hidden Personality Profiles
-| Profile | Hidden Trait Values | "Good Day" (Reward Weights) |
-|---------|----------------|-------------------|
-| **Introvert Morning Person** | Social drains 3× Vitality, Morning cognition ×2.0, Binge shame spiral, Solo activities +0.10 Serenity | **Serenity 60%**, Progress 20% |
-| **Extrovert Night Owl** | Social drains only 0.2× Vitality, Evening cognition ×1.8, Morning penalty ×0.4, Social gives 2× Connection + Serenity bonus | **Connection 75%** |
-| **Workaholic Stoic** | Progress gives +0.10 Serenity, Idle drains -0.10 Serenity, Work gives +0.06 Vitality recovery, 1.5× Vitality decay | **Progress 70%** |
 ### The Discovery Challenge
 The agent must:
-1. **Probe** — try different actions to see unexpected meter changes
-2. **Infer** — correlate reward patterns to discover the hidden personality
-3. **Adapt** — change its entire strategy based on what it learned
-A trained agent for the Introvert learns to: front-load deep work in the morning, avoid socializing, never binge watch. The same agent for the Extrovert learns to: socialize freely, work in the evening, skip mornings.
-## Reward Architecture
-**Per-step reward** = `sum(meter_delta × hidden_weight) × 15`
-The reward weights are hidden and differ per profile. The agent sees only the total scalar reward.
-**Critical threshold**: If any meter drops below 0.1, a -0.30 penalty is applied.
-**Final grader** (0-1 score at episode end):
 ```
-score = 0.30 × meter_balance + 0.25 × crash_free + 0.20 × progress + 0.15 × connection + 0.10 × efficiency
 ```
-## Baseline Scores
 | Profile | Heuristic | Random |
 |---------|-----------|--------|
-| Introvert Morning | **0.765** | 0.693 |
-| Extrovert Night Owl | **0.819** | 0.863 |
-| Workaholic Stoic | **0.761** | 0.602 |
-The heuristic uses fixed rules (sleep at night, work in morning, socialize when connection is low). It doesn't adapt to profiles — a trained agent that discovers the hidden personality should significantly outperform it.
-## Training Results
-Training uses **GRPO (Group Relative Policy Optimization)** with [Unsloth](https://github.com/unslothai/unsloth) for efficient 4-bit fine-tuning on Qwen 2.5-3B (4-bit + LoRA rank 4, free Colab T4, ~3 hours for 500 steps).
-**Three-layer reward stack:**
-1. `format_valid` — did the model output a parseable action name? (+1.0 / -2.0)
-2. `action_legal` — is it one of the 10 valid actions? (+0.5 / -1.0)
-3. `env_reward` — actual reward from seed-based episode replay (cannot be fabricated)
-**Key config:** `beta=0.1` (KL penalty), `lr=5e-5`, `max_grad_norm=0.5`, `max_completion_length=16`
-### Training Curves
-![Training Loss](plots/training_loss.png)
-*GRPO training loss over 500 steps.*
-![Reward Curve](plots/reward_curve.png)
-*Mean reward with ±1 std band. Upward trend indicates the agent learning to balance meters.*
-### Baseline vs Trained Comparison
-![Baseline vs Trained](plots/baseline_vs_trained.png)
-| Profile | Random | Heuristic | Trained (500 steps) | vs Heuristic |
-|---------|--------|-----------|---------------------|--------------|
-| Introvert Morning | 0.693 | 0.765 | **0.617** | -0.148 |
-| Extrovert Night Owl | 0.863 | 0.819 | **0.725** | -0.094 |
-| Workaholic Stoic | 0.602 | 0.761 | **0.539** | -0.222 |
-**What the agent learned**: Serenity and Connection management improved substantially — the introvert's serenity maxed at 1.0, the extrovert's connection reached 1.0. The model correctly learned the *sign* of personality differences (introvert avoids socialising; extrovert embraces it).
-**What still needs training**: The workaholic profile's key hidden modifier — that `DEEP_WORK` *recovers* vitality instead of draining it (+0.06 instead of -0.05) — was not discovered in 500 steps. Without this insight, the trained agent under-uses deep work for the workaholic. A longer run (1000+ steps) with higher DEEP_WORK sampling density in the workaholic training episodes is the clear next step.
 ## Setup Instructions
@@ -210,9 +262,9 @@ python inference.py
 | Method | Endpoint | Description |
 |--------|----------|-------------|
-| `POST` | `/reset` | Start a new episode (`seed`, `profile` optional) |
 | `POST` | `/step` | Execute an action (`action_type`) |
-| `GET` | `/state` | Get current state (includes hidden profile for debugging) |
 | `GET` | `/health` | Health check |
 | `GET` | `/metadata` | Environment metadata |
 | `GET` | `/schema` | Action/observation JSON schemas |

   - openenv
 ---
+# RhythmEnv — Meta-RL Life Simulator
+An OpenEnv environment where an LLM agent learns *how to learn a person*. Each episode samples a different hidden personality from a continuous parameter space — the agent must infer who it's helping from rewards alone, then adapt its strategy mid-episode.
+This is **meta-reinforcement learning** for personalization: the agent isn't trained to optimize one person's life, it's trained to acquire the *skill of figuring out a new person* from a handful of interactions.
 ## Links
 - **Live Environment**: [HF Space — InosLihka/rhythm_env](https://huggingface.co/spaces/InosLihka/rhythm_env)
 - **Training Notebook**: [Colab — GRPO Training](training/RhythmEnv_GRPO_Training.ipynb)
 - **Blog Post**: [Teaching an AI to Know You (Without Asking)](docs/blog_post.md)
 ## Why a Life Simulator?
+Personal AI assistants give generic advice. They don't know *you*. RhythmEnv trains an agent on a continuous distribution of simulated personalities so that, when it meets a real user, it already knows the *structure* of personality inference — it personalizes in ~5 interactions instead of 50+.
+Every sampled person has a hidden "DNA" — a vector of preferences (social tolerance, morning energy, work motivation) plus action modifiers — drawn from distributions. The agent sees only the resulting meter changes and rewards. To do well, it must infer the hidden vector and adapt.
+This is **Theme 3.2 (Personalized Tasks)** + **Theme 2 (Long-Horizon Planning)** — the agent plans across a full week while inferring a hidden personality from observation alone.
 ## Quick Start
 - **Random events** (~8% per step): Prod Crash, Family Emergency, Illness, Good News
 - **Deterministic** given seed: same seed → same episode trajectory
+## The Meta-Learning Setup (Core Innovation)
+### What the Agent Sees Each Step
+- All 5 meter values + per-meter deltas from the last action
 - Current day, slot, timestep
+- Active random event (if any)
+- Rolling 7-step history of (action, reward, deltas)
+- Total scalar reward
 ### What the Agent Does NOT See
+- **The hidden personality vector** — sampled per episode, controls everything below
+- **Reward weight decomposition** — same meter changes produce different rewards for different people
+- **Action modifiers** — social drain, cognitive bonuses, shame spirals vary continuously
+### Continuous Personality Space
+Each `reset()` samples a fresh personality from parameter distributions:
+| Parameter | Distribution | Effect |
+|---|---|---|
+| `social_vitality_multiplier` | U(0.2, 3.0) | low = extrovert, high = introvert |
+| `morning_cognition_bonus` | U(0.4, 2.0) or none | morning person trait |
+| `evening_night_cognition_bonus` | U(0.6, 1.8) or none | night owl trait |
+| `binge_shame` | Bernoulli(0.5) | conscientiousness |
+| `progress_serenity_bonus` | U(0, 0.10) | workaholic peace-from-work |
+| `work_vitality_recovery` | U(0, 0.06) | work energizes vs drains |
+| `solo_serenity_bonus` | U(0, 0.10) | introvert solo recharge |
+| `social_connection_multiplier` | U(1, 2) | extrovert social fluency |
+| `social_serenity_bonus` | U(0, 0.06) | extrovert peace-from-people |
+| `connection_decay_rate` | U(0.005, 0.02) | how fast relationships fade |
+| `vitality_decay_rate` | U(0, 0.04) | base energy decline |
+| `event_impact_multiplier` | U(0.5, 1.0) | resilience to setbacks |
+| `stress_tolerance` | U(0.15, 0.30) | when stress spiral kicks in |
+| `reward_weights` | Dirichlet biased to non-vit/cog | what THIS person values |
+This produces an effectively infinite personality space — memorization is impossible, the agent must learn the *skill* of inference.
+### The 3 Discrete Profiles (legacy / eval anchor)
+For comparison with the original benchmark, the env also exposes 3 hardcoded personalities reachable via `profile_mode='discrete'` or `profile=<name>`:
+- **Introvert Morning Person** → belief vector ≈ `[0.0 social, 1.0 morning, 0.07 work]`
+- **Extrovert Night Owl** → belief vector ≈ `[1.0 social, 0.20 morning, 0.02 work]`
+- **Workaholic Stoic** → belief vector ≈ `[0.36 social, 0.50 morning, 1.0 work]`
+These exist as anchor points in the continuous space.
+### The Action+Belief Output Format
+Each step the agent outputs ONE LINE: `ACTION_NAME S M W`
+Where `S M W` are single digits (0-9) representing the agent's CURRENT belief about the user:
+- **S** = social preference (0=hates social, 9=loves social)
+- **M** = morning preference (0=night owl, 9=morning person)
+- **W** = work preference (0=avoids work, 9=workaholic)
+Example: `DEEP_WORK 3 8 7` means *do deep work, AND I think this person is moderately introverted, a strong morning person, and somewhat work-oriented*.
+The belief output is the meta-learning lever. The agent receives a separate reward for belief accuracy (cosine-vs-MAE against the hidden true vector), forcing it to *explicitly model the user* rather than just pattern-match action→reward.
 ### The Discovery Challenge
 The agent must:
+1. **Probe** — try different actions in early steps to see how the person responds
+2. **Infer** — update its belief vector each step based on observed rewards
+3. **Adapt** — late in the episode, exploit the belief by choosing actions matching the inferred personality
+## Reward Architecture (4-layer training stack)
+| Layer | Function | Range | Purpose |
+|---|---|---|---|
+| 1 | `format_valid` | -2 to +1 | parseable as ACTION + 3 belief digits |
+| 2 | `action_legal` | -1 to +0.5 | action is one of 10 valid types |
+| 3 | `env_reward` | -3 to ~+1.5 | actual env reward via seed-based replay |
+| 4 | `belief_accuracy` | -0.5 to +0.5 | cosine-MAE vs true profile vector |
+**Per-step env reward** = `sum(meter_delta × hidden_weight) × 15` — weights are sampled per profile.
+**Critical threshold**: any meter < 0.1 → -0.30 penalty.
+**Final grader (meta-RL aligned)** — `final_score ∈ [0, 1]`:
 ```
+score = 0.20 × crash_free + 0.25 × progress + 0.15 × connection
+      + 0.30 × adaptation_score + 0.10 × efficiency
 ```
+**`adaptation_score` is the meta-learning signal**: late-half mean reward minus early-half mean, gated by absolute late-half quality so a "terrible-then-mediocre" exploit cannot win. It directly rewards "the agent got better as it learned the user." Random and heuristic baselines have *negative* adaptation; only an agent that actually learns mid-episode can win this term.
+## Baseline Scores (new grader, 5 episodes per condition)
+### Discrete-3-profiles (legacy comparison)
 | Profile | Heuristic | Random |
 |---------|-----------|--------|
+| Introvert Morning | **0.574** | 0.535 |
+| Extrovert Night Owl | 0.594 | **0.641** |
+| Workaholic Stoic | **0.585** | 0.485 |
+### Continuous in-distribution (the meta-RL training surface)
+| Strategy | avg_score | avg_adaptation |
+|---|---|---|
+| Heuristic | **0.587** | -0.242 |
+| Random | 0.516 | -0.304 |
+### Continuous OOD (held-out parameter region — generalization test)
+| Strategy | avg_score | avg_adaptation |
+|---|---|---|
+| Heuristic | **0.580** | -0.083 |
+| Random | 0.508 | -0.218 |
+The trained agent must beat heuristic on continuous in-distribution AND on the OOD region — beating only in-distribution would suggest memorization, not meta-learning. Positive `adaptation` is direct evidence of mid-episode learning.
+## Training
+Training uses **GRPO (Group Relative Policy Optimization)** with [Unsloth](https://github.com/unslothai/unsloth) for efficient 4-bit fine-tuning on Qwen 2.5-3B + LoRA rank 8 (free Colab T4, ~6 hours for 1500 steps).
+**Run:**
+```bash
+python training/train.py --max_steps 1500
+```
+Defaults are tuned for meta-RL on T4: continuous profile mode, hint_fraction=0.15 (curriculum warmup), num_generations=4, lr=5e-5, beta=0.1, max_prompt_length=600, max_completion_length=20.
+**Curriculum:** ~15% of training samples include the true profile vector in the prompt as a *training hint*, teaching the model the *skill* of using profile signals before forcing inference. The remaining 85% require inference from observation alone.
+### Why this should beat the original 500-step result
+The original v1 single-task training (500 steps, LoRA rank 4) **regressed vs heuristic on all 3 profiles**. Root cause analysis identified four issues that this version fixes:
+| Original problem | Fix |
+|---|---|
+| Grader's `meter_balance` (0.30) rewarded random behavior | Replaced with `adaptation_score` — only positive if agent learned mid-episode |
+| Only 3 hardcoded profiles → memorizable | Continuous parameter sampling per episode → memorization impossible |
+| No explicit "model the user" output → no inference gradient | Action+belief format with `belief_accuracy` reward → forces user-modeling |
+| 500 steps + LoRA rank 4 too small | 1500 steps + LoRA rank 8, curriculum warmup |
+Verification will be by running both `discrete-3-profiles` (legacy comparison vs old eval_results.json) and `continuous-OOD` (meta-learning generalization test).
 ## Setup Instructions
 | Method | Endpoint | Description |
 |--------|----------|-------------|
+| `POST` | `/reset` | Start a new episode. Optional kwargs: `seed`, `profile=<name>` (legacy 3-profile), `profile_mode='discrete'\|'continuous'` (default continuous = sampled meta-RL distribution) |
 | `POST` | `/step` | Execute an action (`action_type`) |
+| `GET` | `/state` | Get current state (includes hidden profile name for debugging) |
 | `GET` | `/health` | Health check |
 | `GET` | `/metadata` | Environment metadata |
 | `GET` | `/schema` | Action/observation JSON schemas |

eval_baselines_meta.json ADDED Viewed

	@@ -0,0 +1,3642 @@

+[
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "introvert_morning",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 0,
+    "final_score": 0.5745,
+    "total_reward": 5.84,
+    "adaptation": -0.28,
+    "vitality": 0.76,
+    "cognition": 0.41,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.43,
+    "actions": [
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      0.0,
+      1.0,
+      0.069
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "introvert_morning",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 1,
+    "final_score": 0.5765,
+    "total_reward": 6.88,
+    "adaptation": -0.142,
+    "vitality": 0.64,
+    "cognition": 0.33,
+    "progress": 1.0,
+    "serenity": 0.87,
+    "connection": 0.43,
+    "actions": [
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.0,
+      1.0,
+      0.069
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "introvert_morning",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 2,
+    "final_score": 0.574,
+    "total_reward": 8.97,
+    "adaptation": -0.053,
+    "vitality": 0.84,
+    "cognition": 0.44,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.39,
+    "actions": [
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      0.0,
+      1.0,
+      0.069
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "introvert_morning",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 3,
+    "final_score": 0.567,
+    "total_reward": 10.07,
+    "adaptation": -0.213,
+    "vitality": 0.74,
+    "cognition": 0.25,
+    "progress": 0.96,
+    "serenity": 0.97,
+    "connection": 0.41,
+    "actions": [
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      0.0,
+      1.0,
+      0.069
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "introvert_morning",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 4,
+    "final_score": 0.5796,
+    "total_reward": 5.25,
+    "adaptation": -0.257,
+    "vitality": 0.61,
+    "cognition": 0.32,
+    "progress": 1.0,
+    "serenity": 0.99,
+    "connection": 0.47,
+    "actions": [
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.0,
+      1.0,
+      0.069
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "extrovert_night_owl",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 0,
+    "final_score": 0.6045,
+    "total_reward": 4.21,
+    "adaptation": 0.184,
+    "vitality": 1.0,
+    "cognition": 1.0,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.43,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      1.0,
+      0.2,
+      0.023
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "extrovert_night_owl",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 1,
+    "final_score": 0.6095,
+    "total_reward": 4.59,
+    "adaptation": 0.22,
+    "vitality": 1.0,
+    "cognition": 1.0,
+    "progress": 0.97,
+    "serenity": 1.0,
+    "connection": 0.46,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      1.0,
+      0.2,
+      0.023
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "extrovert_night_owl",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 2,
+    "final_score": 0.5901,
+    "total_reward": 5.85,
+    "adaptation": 0.117,
+    "vitality": 0.97,
+    "cognition": 0.82,
+    "progress": 0.93,
+    "serenity": 1.0,
+    "connection": 0.53,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      1.0,
+      0.2,
+      0.023
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "extrovert_night_owl",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 3,
+    "final_score": 0.5527,
+    "total_reward": 4.53,
+    "adaptation": 0.205,
+    "vitality": 0.92,
+    "cognition": 0.86,
+    "progress": 0.78,
+    "serenity": 0.97,
+    "connection": 0.42,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      1.0,
+      0.2,
+      0.023
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "extrovert_night_owl",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 4,
+    "final_score": 0.6138,
+    "total_reward": 4.23,
+    "adaptation": 0.196,
+    "vitality": 1.0,
+    "cognition": 1.0,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.47,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      1.0,
+      0.2,
+      0.023
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "workaholic_stoic",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 0,
+    "final_score": 0.5825,
+    "total_reward": 11.98,
+    "adaptation": -0.568,
+    "vitality": 0.59,
+    "cognition": 0.23,
+    "progress": 1.0,
+    "serenity": 0.95,
+    "connection": 0.41,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.357,
+      0.5,
+      1.0
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "workaholic_stoic",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 1,
+    "final_score": 0.5799,
+    "total_reward": 12.67,
+    "adaptation": -0.548,
+    "vitality": 0.72,
+    "cognition": 0.32,
+    "progress": 1.0,
+    "serenity": 0.95,
+    "connection": 0.38,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.357,
+      0.5,
+      1.0
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "workaholic_stoic",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 2,
+    "final_score": 0.5901,
+    "total_reward": 13.16,
+    "adaptation": -0.481,
+    "vitality": 0.74,
+    "cognition": 0.34,
+    "progress": 1.0,
+    "serenity": 0.95,
+    "connection": 0.44,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.357,
+      0.5,
+      1.0
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "workaholic_stoic",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 3,
+    "final_score": 0.5824,
+    "total_reward": 13.79,
+    "adaptation": -0.436,
+    "vitality": 0.63,
+    "cognition": 0.21,
+    "progress": 1.0,
+    "serenity": 0.95,
+    "connection": 0.39,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.357,
+      0.5,
+      1.0
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "workaholic_stoic",
+    "profile_mode": "discrete",
+    "strategy": "heuristic",
+    "seed": 4,
+    "final_score": 0.5881,
+    "total_reward": 11.87,
+    "adaptation": -0.59,
+    "vitality": 0.54,
+    "cognition": 0.21,
+    "progress": 1.0,
+    "serenity": 0.95,
+    "connection": 0.45,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.357,
+      0.5,
+      1.0
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "introvert_morning",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 0,
+    "final_score": 0.5229,
+    "total_reward": 1.82,
+    "adaptation": 0.086,
+    "vitality": 0.45,
+    "cognition": 0.56,
+    "progress": 0.66,
+    "serenity": 1.0,
+    "connection": 0.71,
+    "actions": [
+      "socialize",
+      "me_time",
+      "binge_watch",
+      "socialize",
+      "exercise",
+      "family_time",
+      "sleep",
+      "admin_work",
+      "meditate",
+      "binge_watch",
+      "admin_work",
+      "deep_work",
+      "sleep",
+      "meditate",
+      "sleep",
+      "family_time",
+      "exercise",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "socialize",
+      "binge_watch",
+      "exercise",
+      "meditate",
+      "learn",
+      "socialize",
+      "admin_work",
+      "sleep"
+    ],
+    "true_belief": [
+      0.0,
+      1.0,
+      0.069
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "introvert_morning",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 1,
+    "final_score": 0.5328,
+    "total_reward": 4.02,
+    "adaptation": -0.005,
+    "vitality": 0.08,
+    "cognition": 0.5,
+    "progress": 0.76,
+    "serenity": 0.97,
+    "connection": 0.68,
+    "actions": [
+      "exercise",
+      "meditate",
+      "sleep",
+      "meditate",
+      "meditate",
+      "me_time",
+      "meditate",
+      "learn",
+      "meditate",
+      "meditate",
+      "socialize",
+      "socialize",
+      "deep_work",
+      "meditate",
+      "socialize",
+      "deep_work",
+      "meditate",
+      "sleep",
+      "learn",
+      "socialize",
+      "deep_work",
+      "socialize",
+      "learn",
+      "sleep",
+      "family_time",
+      "meditate",
+      "meditate",
+      "admin_work"
+    ],
+    "true_belief": [
+      0.0,
+      1.0,
+      0.069
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "introvert_morning",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 2,
+    "final_score": 0.5339,
+    "total_reward": 6.12,
+    "adaptation": -0.395,
+    "vitality": 0.61,
+    "cognition": 0.22,
+    "progress": 1.0,
+    "serenity": 0.86,
+    "connection": 0.17,
+    "actions": [
+      "me_time",
+      "meditate",
+      "learn",
+      "meditate",
+      "learn",
+      "family_time",
+      "deep_work",
+      "family_time",
+      "me_time",
+      "admin_work",
+      "sleep",
+      "meditate",
+      "sleep",
+      "admin_work",
+      "meditate",
+      "me_time",
+      "sleep",
+      "sleep",
+      "binge_watch",
+      "admin_work",
+      "deep_work",
+      "admin_work",
+      "admin_work",
+      "binge_watch",
+      "learn",
+      "sleep",
+      "me_time",
+      "deep_work"
+    ],
+    "true_belief": [
+      0.0,
+      1.0,
+      0.069
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "introvert_morning",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 3,
+    "final_score": 0.5229,
+    "total_reward": -1.03,
+    "adaptation": -0.594,
+    "vitality": 0.0,
+    "cognition": 0.03,
+    "progress": 0.78,
+    "serenity": 0.83,
+    "connection": 0.84,
+    "actions": [
+      "learn",
+      "learn",
+      "meditate",
+      "sleep",
+      "family_time",
+      "exercise",
+      "learn",
+      "deep_work",
+      "deep_work",
+      "family_time",
+      "admin_work",
+      "deep_work",
+      "deep_work",
+      "me_time",
+      "socialize",
+      "family_time",
+      "deep_work",
+      "admin_work",
+      "me_time",
+      "meditate",
+      "socialize",
+      "binge_watch",
+      "family_time",
+      "socialize",
+      "exercise",
+      "family_time",
+      "exercise",
+      "socialize"
+    ],
+    "true_belief": [
+      0.0,
+      1.0,
+      0.069
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "introvert_morning",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 4,
+    "final_score": 0.5644,
+    "total_reward": 1.47,
+    "adaptation": -0.166,
+    "vitality": 0.0,
+    "cognition": 0.0,
+    "progress": 1.0,
+    "serenity": 0.94,
+    "connection": 0.5,
+    "actions": [
+      "socialize",
+      "binge_watch",
+      "meditate",
+      "deep_work",
+      "sleep",
+      "deep_work",
+      "learn",
+      "learn",
+      "meditate",
+      "me_time",
+      "binge_watch",
+      "binge_watch",
+      "me_time",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "deep_work",
+      "sleep",
+      "family_time",
+      "exercise",
+      "me_time",
+      "learn",
+      "learn",
+      "admin_work",
+      "family_time",
+      "admin_work",
+      "deep_work",
+      "socialize"
+    ],
+    "true_belief": [
+      0.0,
+      1.0,
+      0.069
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "extrovert_night_owl",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 0,
+    "final_score": 0.6522,
+    "total_reward": 8.5,
+    "adaptation": -0.121,
+    "vitality": 1.0,
+    "cognition": 0.83,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.98,
+    "actions": [
+      "socialize",
+      "me_time",
+      "binge_watch",
+      "socialize",
+      "exercise",
+      "family_time",
+      "sleep",
+      "admin_work",
+      "meditate",
+      "binge_watch",
+      "admin_work",
+      "deep_work",
+      "sleep",
+      "meditate",
+      "sleep",
+      "family_time",
+      "exercise",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "socialize",
+      "binge_watch",
+      "exercise",
+      "meditate",
+      "learn",
+      "socialize",
+      "admin_work",
+      "sleep"
+    ],
+    "true_belief": [
+      1.0,
+      0.2,
+      0.023
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "extrovert_night_owl",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 1,
+    "final_score": 0.6748,
+    "total_reward": 8.2,
+    "adaptation": 0.121,
+    "vitality": 0.75,
+    "cognition": 0.69,
+    "progress": 1.0,
+    "serenity": 0.97,
+    "connection": 0.97,
+    "actions": [
+      "exercise",
+      "meditate",
+      "sleep",
+      "meditate",
+      "meditate",
+      "me_time",
+      "meditate",
+      "learn",
+      "meditate",
+      "meditate",
+      "socialize",
+      "socialize",
+      "deep_work",
+      "meditate",
+      "socialize",
+      "deep_work",
+      "meditate",
+      "sleep",
+      "learn",
+      "socialize",
+      "deep_work",
+      "socialize",
+      "learn",
+      "sleep",
+      "family_time",
+      "meditate",
+      "meditate",
+      "admin_work"
+    ],
+    "true_belief": [
+      1.0,
+      0.2,
+      0.023
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "extrovert_night_owl",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 2,
+    "final_score": 0.5646,
+    "total_reward": 4.12,
+    "adaptation": -0.388,
+    "vitality": 0.75,
+    "cognition": 0.32,
+    "progress": 1.0,
+    "serenity": 0.95,
+    "connection": 0.4,
+    "actions": [
+      "me_time",
+      "meditate",
+      "learn",
+      "meditate",
+      "learn",
+      "family_time",
+      "deep_work",
+      "family_time",
+      "me_time",
+      "admin_work",
+      "sleep",
+      "meditate",
+      "sleep",
+      "admin_work",
+      "meditate",
+      "me_time",
+      "sleep",
+      "sleep",
+      "binge_watch",
+      "admin_work",
+      "deep_work",
+      "admin_work",
+      "admin_work",
+      "binge_watch",
+      "learn",
+      "sleep",
+      "me_time",
+      "deep_work"
+    ],
+    "true_belief": [
+      1.0,
+      0.2,
+      0.023
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "extrovert_night_owl",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 3,
+    "final_score": 0.5716,
+    "total_reward": 3.74,
+    "adaptation": -0.54,
+    "vitality": 0.19,
+    "cognition": 0.0,
+    "progress": 0.78,
+    "serenity": 1.0,
+    "connection": 1.0,
+    "actions": [
+      "learn",
+      "learn",
+      "meditate",
+      "sleep",
+      "family_time",
+      "exercise",
+      "learn",
+      "deep_work",
+      "deep_work",
+      "family_time",
+      "admin_work",
+      "deep_work",
+      "deep_work",
+      "me_time",
+      "socialize",
+      "family_time",
+      "deep_work",
+      "admin_work",
+      "me_time",
+      "meditate",
+      "socialize",
+      "binge_watch",
+      "family_time",
+      "socialize",
+      "exercise",
+      "family_time",
+      "exercise",
+      "socialize"
+    ],
+    "true_belief": [
+      1.0,
+      0.2,
+      0.023
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "extrovert_night_owl",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 4,
+    "final_score": 0.7395,
+    "total_reward": 7.38,
+    "adaptation": 0.393,
+    "vitality": 0.4,
+    "cognition": 0.0,
+    "progress": 1.0,
+    "serenity": 0.99,
+    "connection": 0.99,
+    "actions": [
+      "socialize",
+      "binge_watch",
+      "meditate",
+      "deep_work",
+      "sleep",
+      "deep_work",
+      "learn",
+      "learn",
+      "meditate",
+      "me_time",
+      "binge_watch",
+      "binge_watch",
+      "me_time",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "deep_work",
+      "sleep",
+      "family_time",
+      "exercise",
+      "me_time",
+      "learn",
+      "learn",
+      "admin_work",
+      "family_time",
+      "admin_work",
+      "deep_work",
+      "socialize"
+    ],
+    "true_belief": [
+      1.0,
+      0.2,
+      0.023
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "workaholic_stoic",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 0,
+    "final_score": 0.4941,
+    "total_reward": 4.07,
+    "adaptation": 0.272,
+    "vitality": 0.4,
+    "cognition": 0.52,
+    "progress": 0.55,
+    "serenity": 0.95,
+    "connection": 0.41,
+    "actions": [
+      "socialize",
+      "me_time",
+      "binge_watch",
+      "socialize",
+      "exercise",
+      "family_time",
+      "sleep",
+      "admin_work",
+      "meditate",
+      "binge_watch",
+      "admin_work",
+      "deep_work",
+      "sleep",
+      "meditate",
+      "sleep",
+      "family_time",
+      "exercise",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "socialize",
+      "binge_watch",
+      "exercise",
+      "meditate",
+      "learn",
+      "socialize",
+      "admin_work",
+      "sleep"
+    ],
+    "true_belief": [
+      0.357,
+      0.5,
+      1.0
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "workaholic_stoic",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 1,
+    "final_score": 0.4906,
+    "total_reward": 5.39,
+    "adaptation": 0.146,
+    "vitality": 0.04,
+    "cognition": 0.55,
+    "progress": 0.6,
+    "serenity": 1.0,
+    "connection": 0.44,
+    "actions": [
+      "exercise",
+      "meditate",
+      "sleep",
+      "meditate",
+      "meditate",
+      "me_time",
+      "meditate",
+      "learn",
+      "meditate",
+      "meditate",
+      "socialize",
+      "socialize",
+      "deep_work",
+      "meditate",
+      "socialize",
+      "deep_work",
+      "meditate",
+      "sleep",
+      "learn",
+      "socialize",
+      "deep_work",
+      "socialize",
+      "learn",
+      "sleep",
+      "family_time",
+      "meditate",
+      "meditate",
+      "admin_work"
+    ],
+    "true_belief": [
+      0.357,
+      0.5,
+      1.0
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "workaholic_stoic",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 2,
+    "final_score": 0.468,
+    "total_reward": 7.33,
+    "adaptation": -0.065,
+    "vitality": 0.4,
+    "cognition": 0.23,
+    "progress": 0.88,
+    "serenity": 0.99,
+    "connection": 0.0,
+    "actions": [
+      "me_time",
+      "meditate",
+      "learn",
+      "meditate",
+      "learn",
+      "family_time",
+      "deep_work",
+      "family_time",
+      "me_time",
+      "admin_work",
+      "sleep",
+      "meditate",
+      "sleep",
+      "admin_work",
+      "meditate",
+      "me_time",
+      "sleep",
+      "sleep",
+      "binge_watch",
+      "admin_work",
+      "deep_work",
+      "admin_work",
+      "admin_work",
+      "binge_watch",
+      "learn",
+      "sleep",
+      "me_time",
+      "deep_work"
+    ],
+    "true_belief": [
+      0.357,
+      0.5,
+      1.0
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "workaholic_stoic",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 3,
+    "final_score": 0.4734,
+    "total_reward": 1.7,
+    "adaptation": -0.989,
+    "vitality": 0.0,
+    "cognition": 0.0,
+    "progress": 0.71,
+    "serenity": 1.0,
+    "connection": 0.56,
+    "actions": [
+      "learn",
+      "learn",
+      "meditate",
+      "sleep",
+      "family_time",
+      "exercise",
+      "learn",
+      "deep_work",
+      "deep_work",
+      "family_time",
+      "admin_work",
+      "deep_work",
+      "deep_work",
+      "me_time",
+      "socialize",
+      "family_time",
+      "deep_work",
+      "admin_work",
+      "me_time",
+      "meditate",
+      "socialize",
+      "binge_watch",
+      "family_time",
+      "socialize",
+      "exercise",
+      "family_time",
+      "exercise",
+      "socialize"
+    ],
+    "true_belief": [
+      0.357,
+      0.5,
+      1.0
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "discrete-3-profiles (legacy)",
+    "profile_name": "workaholic_stoic",
+    "profile_mode": "discrete",
+    "strategy": "random",
+    "seed": 4,
+    "final_score": 0.5002,
+    "total_reward": 6.83,
+    "adaptation": 0.031,
+    "vitality": 0.0,
+    "cognition": 0.0,
+    "progress": 0.86,
+    "serenity": 1.0,
+    "connection": 0.19,
+    "actions": [
+      "socialize",
+      "binge_watch",
+      "meditate",
+      "deep_work",
+      "sleep",
+      "deep_work",
+      "learn",
+      "learn",
+      "meditate",
+      "me_time",
+      "binge_watch",
+      "binge_watch",
+      "me_time",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "deep_work",
+      "sleep",
+      "family_time",
+      "exercise",
+      "me_time",
+      "learn",
+      "learn",
+      "admin_work",
+      "family_time",
+      "admin_work",
+      "deep_work",
+      "socialize"
+    ],
+    "true_belief": [
+      0.357,
+      0.5,
+      1.0
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_100",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 100,
+    "final_score": 0.5861,
+    "total_reward": 10.74,
+    "adaptation": -0.456,
+    "vitality": 0.47,
+    "cognition": 0.26,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.45,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      0.613,
+      0.5,
+      0.503
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_101",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 101,
+    "final_score": 0.5802,
+    "total_reward": 3.85,
+    "adaptation": -0.041,
+    "vitality": 1.0,
+    "cognition": 0.51,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.5,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      0.545,
+      0.277,
+      0.461
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_102",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 102,
+    "final_score": 0.5912,
+    "total_reward": 10.16,
+    "adaptation": -0.432,
+    "vitality": 0.44,
+    "cognition": 0.24,
+    "progress": 1.0,
+    "serenity": 0.98,
+    "connection": 0.49,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.252,
+      0.5,
+      0.558
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_103",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 103,
+    "final_score": 0.59,
+    "total_reward": 9.67,
+    "adaptation": -0.465,
+    "vitality": 0.32,
+    "cognition": 0.83,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.48,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.629,
+      0.951,
+      0.487
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_104",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 104,
+    "final_score": 0.5856,
+    "total_reward": 9.49,
+    "adaptation": -0.184,
+    "vitality": 0.72,
+    "cognition": 0.63,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.46,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "family_time",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.302,
+      0.5,
+      0.693
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_105",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 105,
+    "final_score": 0.5866,
+    "total_reward": 5.04,
+    "adaptation": -0.353,
+    "vitality": 0.32,
+    "cognition": 0.5,
+    "progress": 1.0,
+    "serenity": 0.96,
+    "connection": 0.53,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.458,
+      0.5,
+      0.315
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_106",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 106,
+    "final_score": 0.5881,
+    "total_reward": 6.6,
+    "adaptation": -0.196,
+    "vitality": 0.46,
+    "cognition": 0.22,
+    "progress": 1.0,
+    "serenity": 0.98,
+    "connection": 0.51,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.337,
+      0.5,
+      0.467
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_107",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 107,
+    "final_score": 0.5826,
+    "total_reward": 8.01,
+    "adaptation": -0.072,
+    "vitality": 0.31,
+    "cognition": 0.61,
+    "progress": 1.0,
+    "serenity": 0.76,
+    "connection": 0.46,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.315,
+      0.376,
+      0.501
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_108",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 108,
+    "final_score": 0.5881,
+    "total_reward": 7.99,
+    "adaptation": -0.055,
+    "vitality": 0.28,
+    "cognition": 0.37,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.49,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.741,
+      0.5,
+      0.143
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_109",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 109,
+    "final_score": 0.5936,
+    "total_reward": 8.46,
+    "adaptation": -0.162,
+    "vitality": 0.37,
+    "cognition": 0.36,
+    "progress": 1.0,
+    "serenity": 0.98,
+    "connection": 0.52,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.46,
+      0.718,
+      0.273
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_100",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 100,
+    "final_score": 0.5315,
+    "total_reward": 0.26,
+    "adaptation": -0.375,
+    "vitality": 0.0,
+    "cognition": 0.0,
+    "progress": 0.81,
+    "serenity": 1.0,
+    "connection": 0.84,
+    "actions": [
+      "exercise",
+      "meditate",
+      "socialize",
+      "learn",
+      "binge_watch",
+      "admin_work",
+      "family_time",
+      "family_time",
+      "socialize",
+      "family_time",
+      "deep_work",
+      "admin_work",
+      "me_time",
+      "deep_work",
+      "family_time",
+      "meditate",
+      "meditate",
+      "admin_work",
+      "learn",
+      "deep_work",
+      "admin_work",
+      "learn",
+      "learn",
+      "deep_work",
+      "family_time",
+      "me_time",
+      "me_time",
+      "learn"
+    ],
+    "true_belief": [
+      0.613,
+      0.5,
+      0.503
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_101",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 101,
+    "final_score": 0.5347,
+    "total_reward": 6.42,
+    "adaptation": -0.236,
+    "vitality": 0.75,
+    "cognition": 0.48,
+    "progress": 0.51,
+    "serenity": 1.0,
+    "connection": 0.99,
+    "actions": [
+      "sleep",
+      "binge_watch",
+      "me_time",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "family_time",
+      "socialize",
+      "deep_work",
+      "deep_work",
+      "meditate",
+      "socialize",
+      "binge_watch",
+      "me_time",
+      "meditate",
+      "sleep",
+      "sleep",
+      "socialize",
+      "meditate",
+      "me_time",
+      "exercise",
+      "family_time",
+      "sleep",
+      "sleep",
+      "socialize",
+      "deep_work",
+      "family_time",
+      "sleep"
+    ],
+    "true_belief": [
+      0.545,
+      0.277,
+      0.461
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_102",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 102,
+    "final_score": 0.4745,
+    "total_reward": -0.52,
+    "adaptation": 0.19,
+    "vitality": 0.2,
+    "cognition": 0.13,
+    "progress": 0.45,
+    "serenity": 0.98,
+    "connection": 0.74,
+    "actions": [
+      "family_time",
+      "learn",
+      "meditate",
+      "me_time",
+      "exercise",
+      "socialize",
+      "binge_watch",
+      "socialize",
+      "family_time",
+      "admin_work",
+      "me_time",
+      "binge_watch",
+      "deep_work",
+      "meditate",
+      "socialize",
+      "binge_watch",
+      "socialize",
+      "sleep",
+      "meditate",
+      "sleep",
+      "me_time",
+      "learn",
+      "learn",
+      "learn",
+      "me_time",
+      "deep_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.252,
+      0.5,
+      0.558
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_103",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 103,
+    "final_score": 0.5636,
+    "total_reward": 1.31,
+    "adaptation": -0.591,
+    "vitality": 0.0,
+    "cognition": 0.0,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.69,
+    "actions": [
+      "learn",
+      "me_time",
+      "binge_watch",
+      "meditate",
+      "socialize",
+      "deep_work",
+      "binge_watch",
+      "me_time",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "deep_work",
+      "me_time",
+      "socialize",
+      "learn",
+      "family_time",
+      "learn",
+      "meditate",
+      "admin_work",
+      "sleep",
+      "learn",
+      "socialize",
+      "admin_work",
+      "me_time",
+      "deep_work",
+      "deep_work",
+      "admin_work",
+      "learn"
+    ],
+    "true_belief": [
+      0.629,
+      0.951,
+      0.487
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_104",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 104,
+    "final_score": 0.5435,
+    "total_reward": 6.85,
+    "adaptation": 0.231,
+    "vitality": 0.05,
+    "cognition": 0.68,
+    "progress": 0.46,
+    "serenity": 1.0,
+    "connection": 0.91,
+    "actions": [
+      "admin_work",
+      "sleep",
+      "family_time",
+      "meditate",
+      "binge_watch",
+      "meditate",
+      "family_time",
+      "socialize",
+      "exercise",
+      "me_time",
+      "sleep",
+      "admin_work",
+      "me_time",
+      "meditate",
+      "socialize",
+      "meditate",
+      "exercise",
+      "meditate",
+      "family_time",
+      "deep_work",
+      "deep_work",
+      "socialize",
+      "sleep",
+      "me_time",
+      "socialize",
+      "family_time",
+      "socialize",
+      "meditate"
+    ],
+    "true_belief": [
+      0.302,
+      0.5,
+      0.693
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_105",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 105,
+    "final_score": 0.468,
+    "total_reward": -6.09,
+    "adaptation": -0.579,
+    "vitality": 0.12,
+    "cognition": 0.01,
+    "progress": 0.53,
+    "serenity": 1.0,
+    "connection": 0.99,
+    "actions": [
+      "learn",
+      "binge_watch",
+      "family_time",
+      "me_time",
+      "socialize",
+      "admin_work",
+      "deep_work",
+      "meditate",
+      "sleep",
+      "socialize",
+      "binge_watch",
+      "learn",
+      "binge_watch",
+      "family_time",
+      "admin_work",
+      "deep_work",
+      "socialize",
+      "admin_work",
+      "deep_work",
+      "binge_watch",
+      "socialize",
+      "socialize",
+      "learn",
+      "meditate",
+      "me_time",
+      "family_time",
+      "socialize",
+      "exercise"
+    ],
+    "true_belief": [
+      0.458,
+      0.5,
+      0.315
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_106",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 106,
+    "final_score": 0.527,
+    "total_reward": -3.21,
+    "adaptation": -0.268,
+    "vitality": 0.24,
+    "cognition": 0.0,
+    "progress": 0.73,
+    "serenity": 1.0,
+    "connection": 0.92,
+    "actions": [
+      "deep_work",
+      "meditate",
+      "family_time",
+      "meditate",
+      "learn",
+      "socialize",
+      "socialize",
+      "socialize",
+      "learn",
+      "learn",
+      "exercise",
+      "deep_work",
+      "meditate",
+      "family_time",
+      "meditate",
+      "admin_work",
+      "family_time",
+      "admin_work",
+      "family_time",
+      "me_time",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "me_time",
+      "deep_work",
+      "me_time",
+      "sleep",
+      "learn"
+    ],
+    "true_belief": [
+      0.337,
+      0.5,
+      0.467
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_107",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 107,
+    "final_score": 0.5088,
+    "total_reward": -3.12,
+    "adaptation": -0.481,
+    "vitality": 0.03,
+    "cognition": 0.0,
+    "progress": 0.98,
+    "serenity": 0.34,
+    "connection": 0.36,
+    "actions": [
+      "admin_work",
+      "admin_work",
+      "learn",
+      "meditate",
+      "deep_work",
+      "learn",
+      "me_time",
+      "sleep",
+      "deep_work",
+      "learn",
+      "admin_work",
+      "socialize",
+      "deep_work",
+      "socialize",
+      "socialize",
+      "exercise",
+      "binge_watch",
+      "binge_watch",
+      "admin_work",
+      "admin_work",
+      "family_time",
+      "sleep",
+      "admin_work",
+      "family_time",
+      "deep_work",
+      "deep_work",
+      "me_time",
+      "binge_watch"
+    ],
+    "true_belief": [
+      0.315,
+      0.376,
+      0.501
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_108",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 108,
+    "final_score": 0.4869,
+    "total_reward": 1.19,
+    "adaptation": -0.472,
+    "vitality": 0.0,
+    "cognition": 0.0,
+    "progress": 0.48,
+    "serenity": 1.0,
+    "connection": 1.0,
+    "actions": [
+      "learn",
+      "sleep",
+      "socialize",
+      "binge_watch",
+      "socialize",
+      "me_time",
+      "meditate",
+      "learn",
+      "meditate",
+      "deep_work",
+      "socialize",
+      "family_time",
+      "admin_work",
+      "meditate",
+      "family_time",
+      "socialize",
+      "binge_watch",
+      "family_time",
+      "binge_watch",
+      "binge_watch",
+      "learn",
+      "socialize",
+      "exercise",
+      "learn",
+      "family_time",
+      "family_time",
+      "deep_work",
+      "socialize"
+    ],
+    "true_belief": [
+      0.741,
+      0.5,
+      0.143
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-in-distribution",
+    "profile_name": "sampled_109",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 109,
+    "final_score": 0.5165,
+    "total_reward": 2.09,
+    "adaptation": -0.463,
+    "vitality": 0.4,
+    "cognition": 0.03,
+    "progress": 0.76,
+    "serenity": 0.83,
+    "connection": 0.65,
+    "actions": [
+      "binge_watch",
+      "exercise",
+      "deep_work",
+      "learn",
+      "sleep",
+      "sleep",
+      "learn",
+      "exercise",
+      "deep_work",
+      "deep_work",
+      "me_time",
+      "meditate",
+      "socialize",
+      "meditate",
+      "socialize",
+      "socialize",
+      "sleep",
+      "learn",
+      "admin_work",
+      "socialize",
+      "socialize",
+      "admin_work",
+      "me_time",
+      "exercise",
+      "exercise",
+      "exercise",
+      "sleep",
+      "binge_watch"
+    ],
+    "true_belief": [
+      0.46,
+      0.718,
+      0.273
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10000",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 10000,
+    "final_score": 0.5738,
+    "total_reward": 7.69,
+    "adaptation": -0.184,
+    "vitality": 0.39,
+    "cognition": 0.58,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.4,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.304,
+      0.844,
+      0.388
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10001",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 10001,
+    "final_score": 0.5621,
+    "total_reward": 3.59,
+    "adaptation": -0.04,
+    "vitality": 1.0,
+    "cognition": 1.0,
+    "progress": 1.0,
+    "serenity": 0.93,
+    "connection": 0.38,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      0.272,
+      0.291,
+      0.323
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10002",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 10002,
+    "final_score": 0.6146,
+    "total_reward": 7.08,
+    "adaptation": 0.158,
+    "vitality": 0.79,
+    "cognition": 0.33,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.47,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.369,
+      0.899,
+      0.069
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10003",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 10003,
+    "final_score": 0.5761,
+    "total_reward": 11.98,
+    "adaptation": -0.053,
+    "vitality": 1.0,
+    "cognition": 0.43,
+    "progress": 0.98,
+    "serenity": 1.0,
+    "connection": 0.41,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      0.388,
+      0.532,
+      0.55
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10004",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 10004,
+    "final_score": 0.5789,
+    "total_reward": 5.48,
+    "adaptation": -0.104,
+    "vitality": 0.4,
+    "cognition": 0.2,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.46,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.586,
+      0.812,
+      0.436
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10005",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 10005,
+    "final_score": 0.5806,
+    "total_reward": 9.41,
+    "adaptation": 0.055,
+    "vitality": 0.58,
+    "cognition": 0.44,
+    "progress": 0.95,
+    "serenity": 0.97,
+    "connection": 0.43,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      0.273,
+      0.5,
+      0.524
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10006",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 10006,
+    "final_score": 0.5823,
+    "total_reward": 7.59,
+    "adaptation": -0.385,
+    "vitality": 0.32,
+    "cognition": 0.13,
+    "progress": 1.0,
+    "serenity": 0.96,
+    "connection": 0.47,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.546,
+      0.798,
+      0.413
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10007",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 10007,
+    "final_score": 0.5894,
+    "total_reward": 6.29,
+    "adaptation": -0.086,
+    "vitality": 0.67,
+    "cognition": 0.78,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.52,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ],
+    "true_belief": [
+      0.567,
+      0.5,
+      0.172
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10008",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 10008,
+    "final_score": 0.5752,
+    "total_reward": 7.39,
+    "adaptation": -0.164,
+    "vitality": 1.0,
+    "cognition": 0.44,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.41,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      0.844,
+      0.307,
+      0.464
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10009",
+    "profile_mode": "continuous",
+    "strategy": "heuristic",
+    "seed": 10009,
+    "final_score": 0.5636,
+    "total_reward": 4.99,
+    "adaptation": -0.023,
+    "vitality": 0.89,
+    "cognition": 0.43,
+    "progress": 0.98,
+    "serenity": 1.0,
+    "connection": 0.4,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ],
+    "true_belief": [
+      0.333,
+      0.383,
+      0.498
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10000",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 10000,
+    "final_score": 0.5259,
+    "total_reward": 3.27,
+    "adaptation": -0.098,
+    "vitality": 0.3,
+    "cognition": 0.29,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.23,
+    "actions": [
+      "binge_watch",
+      "binge_watch",
+      "deep_work",
+      "learn",
+      "meditate",
+      "deep_work",
+      "socialize",
+      "sleep",
+      "meditate",
+      "admin_work",
+      "admin_work",
+      "deep_work",
+      "me_time",
+      "binge_watch",
+      "me_time",
+      "exercise",
+      "sleep",
+      "exercise",
+      "binge_watch",
+      "meditate",
+      "learn",
+      "sleep",
+      "exercise",
+      "socialize",
+      "socialize",
+      "sleep",
+      "family_time",
+      "deep_work"
+    ],
+    "true_belief": [
+      0.304,
+      0.844,
+      0.388
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10001",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 10001,
+    "final_score": 0.5465,
+    "total_reward": 3.05,
+    "adaptation": -0.102,
+    "vitality": 0.87,
+    "cognition": 0.2,
+    "progress": 0.95,
+    "serenity": 0.98,
+    "connection": 0.39,
+    "actions": [
+      "me_time",
+      "me_time",
+      "exercise",
+      "learn",
+      "socialize",
+      "family_time",
+      "family_time",
+      "me_time",
+      "binge_watch",
+      "binge_watch",
+      "binge_watch",
+      "learn",
+      "sleep",
+      "deep_work",
+      "sleep",
+      "binge_watch",
+      "meditate",
+      "admin_work",
+      "me_time",
+      "admin_work",
+      "sleep",
+      "deep_work",
+      "meditate",
+      "deep_work",
+      "exercise",
+      "admin_work",
+      "binge_watch",
+      "exercise"
+    ],
+    "true_belief": [
+      0.272,
+      0.291,
+      0.323
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10002",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 10002,
+    "final_score": 0.4923,
+    "total_reward": -2.3,
+    "adaptation": -0.662,
+    "vitality": 0.15,
+    "cognition": 0.02,
+    "progress": 0.87,
+    "serenity": 1.0,
+    "connection": 0.43,
+    "actions": [
+      "socialize",
+      "exercise",
+      "deep_work",
+      "learn",
+      "exercise",
+      "exercise",
+      "meditate",
+      "admin_work",
+      "meditate",
+      "exercise",
+      "admin_work",
+      "exercise",
+      "binge_watch",
+      "family_time",
+      "admin_work",
+      "deep_work",
+      "deep_work",
+      "socialize",
+      "socialize",
+      "meditate",
+      "deep_work",
+      "me_time",
+      "admin_work",
+      "me_time",
+      "deep_work",
+      "learn",
+      "me_time",
+      "exercise"
+    ],
+    "true_belief": [
+      0.369,
+      0.899,
+      0.069
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10003",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 10003,
+    "final_score": 0.3943,
+    "total_reward": -7.11,
+    "adaptation": -0.261,
+    "vitality": 0.13,
+    "cognition": 0.0,
+    "progress": 0.25,
+    "serenity": 0.86,
+    "connection": 0.95,
+    "actions": [
+      "socialize",
+      "family_time",
+      "socialize",
+      "deep_work",
+      "socialize",
+      "socialize",
+      "me_time",
+      "admin_work",
+      "deep_work",
+      "family_time",
+      "admin_work",
+      "family_time",
+      "me_time",
+      "sleep",
+      "family_time",
+      "family_time",
+      "family_time",
+      "learn",
+      "exercise",
+      "binge_watch",
+      "me_time",
+      "family_time",
+      "binge_watch",
+      "family_time",
+      "learn",
+      "binge_watch",
+      "sleep",
+      "learn"
+    ],
+    "true_belief": [
+      0.388,
+      0.532,
+      0.55
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10004",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 10004,
+    "final_score": 0.527,
+    "total_reward": 6.44,
+    "adaptation": -0.077,
+    "vitality": 0.84,
+    "cognition": 0.93,
+    "progress": 0.59,
+    "serenity": 1.0,
+    "connection": 0.83,
+    "actions": [
+      "meditate",
+      "sleep",
+      "me_time",
+      "exercise",
+      "me_time",
+      "deep_work",
+      "socialize",
+      "me_time",
+      "me_time",
+      "family_time",
+      "sleep",
+      "admin_work",
+      "exercise",
+      "admin_work",
+      "family_time",
+      "me_time",
+      "sleep",
+      "meditate",
+      "exercise",
+      "admin_work",
+      "me_time",
+      "admin_work",
+      "exercise",
+      "family_time",
+      "sleep",
+      "meditate",
+      "deep_work",
+      "meditate"
+    ],
+    "true_belief": [
+      0.586,
+      0.812,
+      0.436
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10005",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 10005,
+    "final_score": 0.4318,
+    "total_reward": -4.79,
+    "adaptation": -0.17,
+    "vitality": 0.03,
+    "cognition": 0.02,
+    "progress": 0.65,
+    "serenity": 1.0,
+    "connection": 0.44,
+    "actions": [
+      "sleep",
+      "family_time",
+      "meditate",
+      "family_time",
+      "meditate",
+      "binge_watch",
+      "admin_work",
+      "deep_work",
+      "learn",
+      "socialize",
+      "exercise",
+      "admin_work",
+      "family_time",
+      "deep_work",
+      "me_time",
+      "meditate",
+      "exercise",
+      "me_time",
+      "deep_work",
+      "exercise",
+      "me_time",
+      "deep_work",
+      "admin_work",
+      "binge_watch",
+      "binge_watch",
+      "binge_watch",
+      "deep_work",
+      "meditate"
+    ],
+    "true_belief": [
+      0.273,
+      0.5,
+      0.524
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10006",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 10006,
+    "final_score": 0.4614,
+    "total_reward": -5.54,
+    "adaptation": -0.802,
+    "vitality": 0.04,
+    "cognition": 0.0,
+    "progress": 0.6,
+    "serenity": 0.26,
+    "connection": 0.72,
+    "actions": [
+      "me_time",
+      "socialize",
+      "me_time",
+      "learn",
+      "learn",
+      "sleep",
+      "deep_work",
+      "family_time",
+      "binge_watch",
+      "me_time",
+      "deep_work",
+      "admin_work",
+      "exercise",
+      "learn",
+      "meditate",
+      "binge_watch",
+      "binge_watch",
+      "socialize",
+      "deep_work",
+      "deep_work",
+      "socialize",
+      "deep_work",
+      "binge_watch",
+      "binge_watch",
+      "family_time",
+      "binge_watch",
+      "exercise",
+      "socialize"
+    ],
+    "true_belief": [
+      0.546,
+      0.798,
+      0.413
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10007",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 10007,
+    "final_score": 0.6107,
+    "total_reward": 3.38,
+    "adaptation": 0.0,
+    "vitality": 0.01,
+    "cognition": 0.09,
+    "progress": 0.87,
+    "serenity": 1.0,
+    "connection": 1.0,
+    "actions": [
+      "family_time",
+      "me_time",
+      "binge_watch",
+      "me_time",
+      "socialize",
+      "sleep",
+      "family_time",
+      "admin_work",
+      "exercise",
+      "learn",
+      "deep_work",
+      "me_time",
+      "binge_watch",
+      "admin_work",
+      "admin_work",
+      "learn",
+      "me_time",
+      "learn",
+      "exercise",
+      "me_time",
+      "family_time",
+      "family_time",
+      "sleep",
+      "me_time",
+      "deep_work",
+      "family_time",
+      "family_time",
+      "socialize"
+    ],
+    "true_belief": [
+      0.567,
+      0.5,
+      0.172
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10008",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 10008,
+    "final_score": 0.5869,
+    "total_reward": 6.77,
+    "adaptation": 0.138,
+    "vitality": 0.54,
+    "cognition": 0.16,
+    "progress": 0.7,
+    "serenity": 1.0,
+    "connection": 0.85,
+    "actions": [
+      "exercise",
+      "exercise",
+      "exercise",
+      "learn",
+      "admin_work",
+      "learn",
+      "deep_work",
+      "me_time",
+      "binge_watch",
+      "exercise",
+      "socialize",
+      "meditate",
+      "socialize",
+      "exercise",
+      "meditate",
+      "meditate",
+      "exercise",
+      "socialize",
+      "learn",
+      "socialize",
+      "meditate",
+      "binge_watch",
+      "sleep",
+      "meditate",
+      "deep_work",
+      "family_time",
+      "deep_work",
+      "exercise"
+    ],
+    "true_belief": [
+      0.844,
+      0.307,
+      0.464
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  },
+  {
+    "condition": "continuous-OOD (generalization)",
+    "profile_name": "sampled_10009",
+    "profile_mode": "continuous",
+    "strategy": "random",
+    "seed": 10009,
+    "final_score": 0.5031,
+    "total_reward": 0.74,
+    "adaptation": -0.142,
+    "vitality": 0.06,
+    "cognition": 0.26,
+    "progress": 0.53,
+    "serenity": 1.0,
+    "connection": 0.9,
+    "actions": [
+      "sleep",
+      "socialize",
+      "sleep",
+      "admin_work",
+      "me_time",
+      "learn",
+      "exercise",
+      "deep_work",
+      "learn",
+      "exercise",
+      "socialize",
+      "me_time",
+      "learn",
+      "family_time",
+      "sleep",
+      "sleep",
+      "socialize",
+      "deep_work",
+      "socialize",
+      "exercise",
+      "me_time",
+      "me_time",
+      "exercise",
+      "family_time",
+      "binge_watch",
+      "admin_work",
+      "me_time",
+      "me_time"
+    ],
+    "true_belief": [
+      0.333,
+      0.383,
+      0.498
+    ],
+    "final_belief": null,
+    "belief_mae": null
+  }
+]

eval_results_v1.json ADDED Viewed

	@@ -0,0 +1,758 @@

+[
+  {
+    "profile": "introvert_morning",
+    "strategy": "heuristic",
+    "seed": 0,
+    "final_score": 0.7696,
+    "total_reward": 5.84,
+    "vitality": 0.76,
+    "cognition": 0.41,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.43,
+    "actions": [
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ]
+  },
+  {
+    "profile": "introvert_morning",
+    "strategy": "heuristic",
+    "seed": 1,
+    "final_score": 0.7526,
+    "total_reward": 6.88,
+    "vitality": 0.64,
+    "cognition": 0.33,
+    "progress": 1.0,
+    "serenity": 0.87,
+    "connection": 0.43,
+    "actions": [
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ]
+  },
+  {
+    "profile": "introvert_morning",
+    "strategy": "heuristic",
+    "seed": 2,
+    "final_score": 0.7723,
+    "total_reward": 8.97,
+    "vitality": 0.84,
+    "cognition": 0.44,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.39,
+    "actions": [
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ]
+  },
+  {
+    "profile": "extrovert_night_owl",
+    "strategy": "heuristic",
+    "seed": 0,
+    "final_score": 0.8197,
+    "total_reward": 4.21,
+    "vitality": 1.0,
+    "cognition": 1.0,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.43,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ]
+  },
+  {
+    "profile": "extrovert_night_owl",
+    "strategy": "heuristic",
+    "seed": 1,
+    "final_score": 0.8209,
+    "total_reward": 4.59,
+    "vitality": 1.0,
+    "cognition": 1.0,
+    "progress": 0.97,
+    "serenity": 1.0,
+    "connection": 0.46,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ]
+  },
+  {
+    "profile": "extrovert_night_owl",
+    "strategy": "heuristic",
+    "seed": 2,
+    "final_score": 0.8164,
+    "total_reward": 5.85,
+    "vitality": 0.97,
+    "cognition": 0.82,
+    "progress": 0.93,
+    "serenity": 1.0,
+    "connection": 0.53,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "learn",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep"
+    ]
+  },
+  {
+    "profile": "workaholic_stoic",
+    "strategy": "heuristic",
+    "seed": 0,
+    "final_score": 0.7461,
+    "total_reward": 11.98,
+    "vitality": 0.59,
+    "cognition": 0.23,
+    "progress": 1.0,
+    "serenity": 0.95,
+    "connection": 0.41,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ]
+  },
+  {
+    "profile": "workaholic_stoic",
+    "strategy": "heuristic",
+    "seed": 1,
+    "final_score": 0.7585,
+    "total_reward": 12.67,
+    "vitality": 0.72,
+    "cognition": 0.32,
+    "progress": 1.0,
+    "serenity": 0.95,
+    "connection": 0.38,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ]
+  },
+  {
+    "profile": "workaholic_stoic",
+    "strategy": "heuristic",
+    "seed": 2,
+    "final_score": 0.7782,
+    "total_reward": 13.16,
+    "vitality": 0.74,
+    "cognition": 0.34,
+    "progress": 1.0,
+    "serenity": 0.95,
+    "connection": 0.44,
+    "actions": [
+      "deep_work",
+      "learn",
+      "meditate",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "deep_work",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep",
+      "exercise",
+      "admin_work",
+      "socialize",
+      "sleep"
+    ]
+  },
+  {
+    "profile": "introvert_morning",
+    "strategy": "random",
+    "seed": 0,
+    "final_score": 0.7141,
+    "total_reward": 1.82,
+    "vitality": 0.45,
+    "cognition": 0.56,
+    "progress": 0.66,
+    "serenity": 1.0,
+    "connection": 0.71,
+    "actions": [
+      "socialize",
+      "me_time",
+      "binge_watch",
+      "socialize",
+      "exercise",
+      "family_time",
+      "sleep",
+      "admin_work",
+      "meditate",
+      "binge_watch",
+      "admin_work",
+      "deep_work",
+      "sleep",
+      "meditate",
+      "sleep",
+      "family_time",
+      "exercise",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "socialize",
+      "binge_watch",
+      "exercise",
+      "meditate",
+      "learn",
+      "socialize",
+      "admin_work",
+      "sleep"
+    ]
+  },
+  {
+    "profile": "introvert_morning",
+    "strategy": "random",
+    "seed": 1,
+    "final_score": 0.6924,
+    "total_reward": 4.02,
+    "vitality": 0.08,
+    "cognition": 0.5,
+    "progress": 0.76,
+    "serenity": 0.97,
+    "connection": 0.68,
+    "actions": [
+      "exercise",
+      "meditate",
+      "sleep",
+      "meditate",
+      "meditate",
+      "me_time",
+      "meditate",
+      "learn",
+      "meditate",
+      "meditate",
+      "socialize",
+      "socialize",
+      "deep_work",
+      "meditate",
+      "socialize",
+      "deep_work",
+      "meditate",
+      "sleep",
+      "learn",
+      "socialize",
+      "deep_work",
+      "socialize",
+      "learn",
+      "sleep",
+      "family_time",
+      "meditate",
+      "meditate",
+      "admin_work"
+    ]
+  },
+  {
+    "profile": "introvert_morning",
+    "strategy": "random",
+    "seed": 2,
+    "final_score": 0.6715,
+    "total_reward": 6.12,
+    "vitality": 0.61,
+    "cognition": 0.22,
+    "progress": 1.0,
+    "serenity": 0.86,
+    "connection": 0.17,
+    "actions": [
+      "me_time",
+      "meditate",
+      "learn",
+      "meditate",
+      "learn",
+      "family_time",
+      "deep_work",
+      "family_time",
+      "me_time",
+      "admin_work",
+      "sleep",
+      "meditate",
+      "sleep",
+      "admin_work",
+      "meditate",
+      "me_time",
+      "sleep",
+      "sleep",
+      "binge_watch",
+      "admin_work",
+      "deep_work",
+      "admin_work",
+      "admin_work",
+      "binge_watch",
+      "learn",
+      "sleep",
+      "me_time",
+      "deep_work"
+    ]
+  },
+  {
+    "profile": "extrovert_night_owl",
+    "strategy": "random",
+    "seed": 0,
+    "final_score": 0.9368,
+    "total_reward": 8.5,
+    "vitality": 1.0,
+    "cognition": 0.83,
+    "progress": 1.0,
+    "serenity": 1.0,
+    "connection": 0.98,
+    "actions": [
+      "socialize",
+      "me_time",
+      "binge_watch",
+      "socialize",
+      "exercise",
+      "family_time",
+      "sleep",
+      "admin_work",
+      "meditate",
+      "binge_watch",
+      "admin_work",
+      "deep_work",
+      "sleep",
+      "meditate",
+      "sleep",
+      "family_time",
+      "exercise",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "socialize",
+      "binge_watch",
+      "exercise",
+      "meditate",
+      "learn",
+      "socialize",
+      "admin_work",
+      "sleep"
+    ]
+  },
+  {
+    "profile": "extrovert_night_owl",
+    "strategy": "random",
+    "seed": 1,
+    "final_score": 0.9054,
+    "total_reward": 8.2,
+    "vitality": 0.75,
+    "cognition": 0.69,
+    "progress": 1.0,
+    "serenity": 0.97,
+    "connection": 0.97,
+    "actions": [
+      "exercise",
+      "meditate",
+      "sleep",
+      "meditate",
+      "meditate",
+      "me_time",
+      "meditate",
+      "learn",
+      "meditate",
+      "meditate",
+      "socialize",
+      "socialize",
+      "deep_work",
+      "meditate",
+      "socialize",
+      "deep_work",
+      "meditate",
+      "sleep",
+      "learn",
+      "socialize",
+      "deep_work",
+      "socialize",
+      "learn",
+      "sleep",
+      "family_time",
+      "meditate",
+      "meditate",
+      "admin_work"
+    ]
+  },
+  {
+    "profile": "extrovert_night_owl",
+    "strategy": "random",
+    "seed": 2,
+    "final_score": 0.7462,
+    "total_reward": 4.12,
+    "vitality": 0.75,
+    "cognition": 0.32,
+    "progress": 1.0,
+    "serenity": 0.95,
+    "connection": 0.4,
+    "actions": [
+      "me_time",
+      "meditate",
+      "learn",
+      "meditate",
+      "learn",
+      "family_time",
+      "deep_work",
+      "family_time",
+      "me_time",
+      "admin_work",
+      "sleep",
+      "meditate",
+      "sleep",
+      "admin_work",
+      "meditate",
+      "me_time",
+      "sleep",
+      "sleep",
+      "binge_watch",
+      "admin_work",
+      "deep_work",
+      "admin_work",
+      "admin_work",
+      "binge_watch",
+      "learn",
+      "sleep",
+      "me_time",
+      "deep_work"
+    ]
+  },
+  {
+    "profile": "workaholic_stoic",
+    "strategy": "random",
+    "seed": 0,
+    "final_score": 0.6185,
+    "total_reward": 4.07,
+    "vitality": 0.4,
+    "cognition": 0.52,
+    "progress": 0.55,
+    "serenity": 0.95,
+    "connection": 0.41,
+    "actions": [
+      "socialize",
+      "me_time",
+      "binge_watch",
+      "socialize",
+      "exercise",
+      "family_time",
+      "sleep",
+      "admin_work",
+      "meditate",
+      "binge_watch",
+      "admin_work",
+      "deep_work",
+      "sleep",
+      "meditate",
+      "sleep",
+      "family_time",
+      "exercise",
+      "deep_work",
+      "admin_work",
+      "meditate",
+      "socialize",
+      "binge_watch",
+      "exercise",
+      "meditate",
+      "learn",
+      "socialize",
+      "admin_work",
+      "sleep"
+    ]
+  },
+  {
+    "profile": "workaholic_stoic",
+    "strategy": "random",
+    "seed": 1,
+    "final_score": 0.6094,
+    "total_reward": 5.39,
+    "vitality": 0.04,
+    "cognition": 0.55,
+    "progress": 0.6,
+    "serenity": 1.0,
+    "connection": 0.44,
+    "actions": [
+      "exercise",
+      "meditate",
+      "sleep",
+      "meditate",
+      "meditate",
+      "me_time",
+      "meditate",
+      "learn",
+      "meditate",
+      "meditate",
+      "socialize",
+      "socialize",
+      "deep_work",
+      "meditate",
+      "socialize",
+      "deep_work",
+      "meditate",
+      "sleep",
+      "learn",
+      "socialize",
+      "deep_work",
+      "socialize",
+      "learn",
+      "sleep",
+      "family_time",
+      "meditate",
+      "meditate",
+      "admin_work"
+    ]
+  },
+  {
+    "profile": "workaholic_stoic",
+    "strategy": "random",
+    "seed": 2,
+    "final_score": 0.5782,
+    "total_reward": 7.33,
+    "vitality": 0.4,
+    "cognition": 0.23,
+    "progress": 0.88,
+    "serenity": 0.99,
+    "connection": 0.0,
+    "actions": [
+      "me_time",
+      "meditate",
+      "learn",
+      "meditate",
+      "learn",
+      "family_time",
+      "deep_work",
+      "family_time",
+      "me_time",
+      "admin_work",
+      "sleep",
+      "meditate",
+      "sleep",
+      "admin_work",
+      "meditate",
+      "me_time",
+      "sleep",
+      "sleep",
+      "binge_watch",
+      "admin_work",
+      "deep_work",
+      "admin_work",
+      "admin_work",
+      "binge_watch",
+      "learn",
+      "sleep",
+      "me_time",
+      "deep_work"
+    ]
+  }
+]

inference.py CHANGED Viewed

@@ -74,30 +74,32 @@ SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
 DAY_NAMES = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
 SYSTEM_PROMPT = textwrap.dedent("""\
-You are a life management agent. You balance 5 life meters over a 7-day week.
-Each day has 4 slots: Morning (0), Afternoon (1), Evening (2), Night (3).
-Total: 28 steps.
-Meters (0.0 to 1.0):
-- Vitality: physical energy, sleep quality
-- Cognition: mental clarity, focus
-- Progress: career/skill advancement
-- Serenity: inner peace, stress management
-- Connection: relationship health
-Available actions (respond with EXACTLY the action name):
   DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE,
   FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH
-Key rules:
-- Low Vitality reduces effectiveness of all positive actions.
-- If ANY meter drops below 0.1, you get a heavy penalty.
-- Connection decays passively if you don't socialize.
-- Different episodes may have different hidden dynamics — adapt!
-- Time of day matters: cognition is generally best in the morning.
-- Balance all meters — neglecting any one will hurt your score.
-Respond with ONLY the action name, nothing else.""")
 # ---------------------------------------------------------------------------
@@ -183,11 +185,22 @@ def choose_action_heuristic(obs) -> RhythmAction:
 def choose_action_llm(obs, llm_client: OpenAI) -> RhythmAction:
-    """Use LLM to pick an action, fall back to heuristic on failure."""
     day_name = DAY_NAMES[obs.day] if obs.day < 7 else f"Day {obs.day}"
     slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
     event_str = f"\nActive event: {obs.active_event}" if obs.active_event else ""
     user_prompt = textwrap.dedent(f"""\
 Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})
 Remaining steps: {obs.remaining_steps}
@@ -197,9 +210,9 @@ Meters:
   Cognition:  {obs.cognition:.2f}
   Progress:   {obs.progress:.2f}
   Serenity:   {obs.serenity:.2f}
-  Connection: {obs.connection:.2f}{event_str}
-Choose your action:""")
     try:
         completion = llm_client.chat.completions.create(
@@ -209,7 +222,7 @@ Choose your action:""")
                 {"role": "user", "content": user_prompt},
             ],
             temperature=0.3,
-            max_tokens=30,
             stream=False,
         )
         text = (completion.choices[0].message.content or "").strip()
@@ -219,20 +232,28 @@ Choose your action:""")
 def parse_llm_action(text: str) -> RhythmAction:
-    """Parse LLM response text into a RhythmAction."""
-    text = text.strip().upper().replace(" ", "_")
-    # Try direct match
     for action_type in ActionType:
         if action_type.value.upper() == text:
             return RhythmAction(action_type=action_type)
-    # Try partial match
     for action_type in ActionType:
         if action_type.value.upper() in text:
             return RhythmAction(action_type=action_type)
-    # Default fallback
     return RhythmAction(action_type=ActionType.SLEEP)

 DAY_NAMES = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
 SYSTEM_PROMPT = textwrap.dedent("""\
+You are a life-management agent helping a person with HIDDEN preferences.
+You see 5 life meters and a rolling history. The same action affects different
+people differently — you must INFER who you're helping from the rewards and
+meter changes you observe.
+Each step, output ONE LINE in this exact format:
+    ACTION_NAME S M W
+where ACTION_NAME is one of:
   DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE,
   FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH
+and S, M, W are single digits (0-9) representing your current belief:
+  S = social preference (0=hates being social, 9=loves being social)
+  M = morning preference (0=night owl, 9=morning person)
+  W = work preference   (0=avoids work, 9=workaholic)
+Example: DEEP_WORK 3 8 7
+Tips:
+- Update your belief from rewards: SOCIALIZE giving big reward → raise S.
+- Early in the week, PROBE different actions to learn the person.
+- Late in the week, EXPLOIT what you've learned.
+- Watch for crashes: any meter under 0.1 → big penalty.
+- Connection decays passively — actively maintain it.
+Respond with ONLY the format line, no other text.""")
 # ---------------------------------------------------------------------------
 def choose_action_llm(obs, llm_client: OpenAI) -> RhythmAction:
+    """Use LLM to pick an action (and emit belief), fall back to heuristic on failure."""
     day_name = DAY_NAMES[obs.day] if obs.day < 7 else f"Day {obs.day}"
     slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
     event_str = f"\nActive event: {obs.active_event}" if obs.active_event else ""
+    history_lines = []
+    for h in (getattr(obs, "step_history", None) or [])[-5:]:
+        history_lines.append(
+            f"  step {h.step}: {h.action} -> reward {h.reward:+.2f} "
+            f"(V{h.vitality_delta:+.2f} C{h.cognition_delta:+.2f} "
+            f"P{h.progress_delta:+.2f} S{h.serenity_delta:+.2f} Cn{h.connection_delta:+.2f})"
+        )
+    history_str = ""
+    if history_lines:
+        history_str = "\n\nRecent history:\n" + "\n".join(history_lines)
     user_prompt = textwrap.dedent(f"""\
 Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})
 Remaining steps: {obs.remaining_steps}
   Cognition:  {obs.cognition:.2f}
   Progress:   {obs.progress:.2f}
   Serenity:   {obs.serenity:.2f}
+  Connection: {obs.connection:.2f}{event_str}{history_str}
+Choose your action (format: ACTION S M W):""")
     try:
         completion = llm_client.chat.completions.create(
                 {"role": "user", "content": user_prompt},
             ],
             temperature=0.3,
+            max_tokens=20,
             stream=False,
         )
         text = (completion.choices[0].message.content or "").strip()
 def parse_llm_action(text: str) -> RhythmAction:
+    """Parse LLM response (action+belief format) into a RhythmAction.
+    Belief digits are ignored at inference time — only used as a demo signal.
+    """
+    # Reuse the training parser for consistency
+    sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), "training"))
+    try:
+        from reward_functions import extract_action_and_belief
+        action, _belief, _provided = extract_action_and_belief(text)
+        if action is not None:
+            return RhythmAction(action_type=action)
+    except ImportError:
+        pass
+    # Fallback: legacy parsing
+    text = text.strip().upper().replace(" ", "_")
     for action_type in ActionType:
         if action_type.value.upper() == text:
             return RhythmAction(action_type=action_type)
     for action_type in ActionType:
         if action_type.value.upper() in text:
             return RhythmAction(action_type=action_type)
     return RhythmAction(action_type=ActionType.SLEEP)

scripts/diagnostic_replay.py ADDED Viewed

	@@ -0,0 +1,90 @@

+"""
+Diagnostic: verify env_reward replay matches live env.
+For 10 seeds, plays a random episode while recording each (action, reward).
+Then for each step independently, replays the prefix and asserts the reward
+from the replay matches the recorded reward within 1e-6.
+Run from rhythm_env root:
+    python scripts/diagnostic_replay.py
+"""
+import os
+import random
+import sys
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from models import ActionType, RhythmAction
+from server.rhythm_environment import RhythmEnvironment, MAX_STEPS
+from training.reward_functions import env_reward
+def play_and_record(seed: int):
+    """Play one random episode, return list of (action_history_at_step, action_taken, reward_received)."""
+    env = RhythmEnvironment()
+    env.reset(seed=seed)
+    rng = random.Random(seed + 7777)
+    actions_so_far = []
+    records = []
+    for _ in range(MAX_STEPS):
+        action_type = rng.choice(list(ActionType))
+        history_snapshot = list(actions_so_far)
+        obs = env.step(RhythmAction(action_type=action_type))
+        records.append((history_snapshot, action_type, obs.reward))
+        actions_so_far.append(action_type.value)
+        if obs.done:
+            break
+    return records
+def main():
+    print("=" * 70)
+    print("env_reward replay diagnostic")
+    print("=" * 70)
+    total_steps = 0
+    mismatches = 0
+    max_diff = 0.0
+    for seed in range(10):
+        records = play_and_record(seed)
+        for history, action_type, recorded_reward in records:
+            # Build a fake completion that yields this exact action.
+            completions = [[{"content": action_type.value.upper()}]]
+            scores = env_reward(
+                completions,
+                seed=[seed],
+                step_index=[len(history)],
+                action_history=[history],
+            )
+            replay_reward = scores[0]
+            diff = abs(replay_reward - recorded_reward)
+            max_diff = max(max_diff, diff)
+            if diff > 1e-6:
+                mismatches += 1
+                if mismatches <= 3:
+                    print(
+                        f"  MISMATCH seed={seed} step={len(history)} "
+                        f"action={action_type.value} recorded={recorded_reward:.6f} "
+                        f"replay={replay_reward:.6f} diff={diff:.6f}"
+                    )
+            total_steps += 1
+    print()
+    print(f"Total steps checked: {total_steps}")
+    print(f"Mismatches (>1e-6): {mismatches}")
+    print(f"Max diff: {max_diff:.6e}")
+    print()
+    if mismatches == 0:
+        print("PASS: env_reward replay is deterministic and matches live env.")
+        return 0
+    else:
+        print("FAIL: env_reward replay diverges from live env.")
+        print("Likely cause: non-determinism in env (RNG state, profile selection, etc.)")
+        return 1
+if __name__ == "__main__":
+    sys.exit(main())

scripts/pipeline_dryrun.py ADDED Viewed

	@@ -0,0 +1,121 @@

+"""
+Pipeline dry-run: validate the full meta-RL reward stack end-to-end without GPU.
+Generates a small dataset, synthesizes completions of varying quality
+(random valid, perfect, garbage, action-only, action+belief, etc.), and
+runs all 4 reward functions. Reports score distributions and prompt sizes.
+This is the local Gate 2 smoke check — proves the dataset, parser, and
+reward stack are internally consistent before kicking off real training.
+Run from rhythm_env root:
+    python scripts/pipeline_dryrun.py
+"""
+import os
+import random
+import sys
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from training.dataset import generate_dataset
+from training.reward_functions import (
+    extract_action_and_belief,
+    format_valid,
+    action_legal,
+    env_reward,
+    belief_accuracy,
+)
+from models import ActionType
+from server.rhythm_environment import sample_profile, profile_to_belief_vector
+def synth_completion(prompt_seed: int, kind: str) -> str:
+    """Synthesize a completion of a given quality."""
+    rng = random.Random(prompt_seed)
+    actions = list(ActionType)
+    action_str = rng.choice(actions).value.upper()
+    s, m, w = rng.randint(0, 9), rng.randint(0, 9), rng.randint(0, 9)
+    if kind == "perfect":
+        # Perfect belief means matching the profile
+        true = profile_to_belief_vector(sample_profile(prompt_seed))
+        s = round(true[0] * 9)
+        m = round(true[1] * 9)
+        w = round(true[2] * 9)
+        return f"{action_str} {s} {m} {w}"
+    if kind == "good":
+        return f"{action_str} {s} {m} {w}"
+    if kind == "action_only":
+        return action_str
+    if kind == "garbage":
+        return "I don't know what to do here"
+    if kind == "verbose":
+        return f"My choice is {action_str} with belief {s} {m} {w} based on the rewards I see."
+    if kind == "wrong_belief":
+        # Output opposite of true belief
+        true = profile_to_belief_vector(sample_profile(prompt_seed))
+        s = round((1 - true[0]) * 9)
+        m = round((1 - true[1]) * 9)
+        w = round((1 - true[2]) * 9)
+        return f"{action_str} {s} {m} {w}"
+    return action_str
+def main():
+    print("=" * 70)
+    print("Pipeline Dry-Run")
+    print("=" * 70)
+    # Generate a small dataset (continuous profiles, 10% hint)
+    samples = generate_dataset(
+        num_episodes=20,
+        strategy="mixed",
+        max_samples=80,
+        profile_mode="continuous",
+        hint_fraction=0.1,
+    )
+    # Prompt size analysis
+    sizes = [len(s["prompt"][0]["content"]) + len(s["prompt"][1]["content"]) for s in samples]
+    print(f"\nPrompt sizes (chars): min={min(sizes)}, max={max(sizes)}, mean={sum(sizes)/len(sizes):.0f}")
+    # Rough token estimate: ~4 chars per token
+    print(f"Estimated tokens:    min={min(sizes)//4}, max={max(sizes)//4}, mean={sum(sizes)//len(sizes)//4}")
+    print("\n" + "=" * 70)
+    print("Reward distributions across completion kinds")
+    print("=" * 70)
+    # For each completion kind, generate completions for first N samples
+    kinds = ["perfect", "good", "action_only", "garbage", "verbose", "wrong_belief"]
+    n = 30
+    sub = samples[:n]
+    seeds_col = [s["seed"] for s in sub]
+    history_col = [s["action_history"] for s in sub]
+    mode_col = [s["profile_mode"] for s in sub]
+    print(f"\n{'kind':<14} | {'fmt':>6} {'leg':>6} {'env':>6} {'bel':>6} | {'TOTAL':>6}")
+    print("-" * 60)
+    for kind in kinds:
+        completions = [[{"content": synth_completion(s["seed"], kind)}] for s in sub]
+        f_scores = format_valid(completions)
+        l_scores = action_legal(completions)
+        e_scores = env_reward(completions, seed=seeds_col, action_history=history_col, profile_mode=mode_col)
+        b_scores = belief_accuracy(completions, seed=seeds_col, action_history=history_col, profile_mode=mode_col)
+        f_avg = sum(f_scores) / len(f_scores)
+        l_avg = sum(l_scores) / len(l_scores)
+        e_avg = sum(e_scores) / len(e_scores)
+        b_avg = sum(b_scores) / len(b_scores)
+        total = f_avg + l_avg + e_avg + b_avg
+        print(f"{kind:<14} | {f_avg:+6.2f} {l_avg:+6.2f} {e_avg:+6.2f} {b_avg:+6.2f} | {total:+6.2f}")
+    print()
+    print("Expected ordering (best -> worst total):")
+    print("  perfect > good > wrong_belief, action_only > verbose > garbage")
+    print("If `perfect > wrong_belief`, the belief signal is gradient-providing.")
+    print("If `good > action_only`, format_valid pushes toward emitting belief.")
+    print("If `garbage` is most negative, format penalty is doing its job.")
+if __name__ == "__main__":
+    main()

server/rhythm_environment.py CHANGED Viewed

@@ -187,6 +187,104 @@ SOCIAL_ACTIONS = {"family_time", "socialize"}
 IDLE_ACTIONS = {"me_time", "binge_watch", "sleep"}
 WORK_ACTIONS = {"deep_work", "learn", "admin_work"}
 class RhythmEnvironment(Environment):
     """
@@ -226,6 +324,7 @@ class RhythmEnvironment(Environment):
         self._total_reward: float = 0.0
         self._recent_actions: list = []
         self._step_history: list = []
     def get_metadata(self) -> EnvironmentMetadata:
         return EnvironmentMetadata(
@@ -256,16 +355,20 @@ class RhythmEnvironment(Environment):
         self._rng = random.Random(effective_seed)
-        # Profile selection: explicit kwarg overrides; otherwise use scrambled seed.
-        # Scrambling decouples profile from episode dynamics (events, etc.) so the
-        # model cannot memorize seed → profile patterns during training.
         profile_name = kwargs.get("profile")
         if profile_name and profile_name in PROFILE_MAP:
             self._profile = deepcopy(PROFILE_MAP[profile_name])
-        else:
             profile_rng = random.Random(effective_seed ^ 0xA3C5F729)
             profile_index = profile_rng.randint(0, len(PROFILES) - 1)
             self._profile = deepcopy(PROFILES[profile_index])
         # Initialize meters from profile defaults
         initial = self._profile["initial_meters"]
@@ -281,6 +384,7 @@ class RhythmEnvironment(Environment):
         self._total_reward = 0.0
         self._recent_actions = []
         self._step_history = []
         self._state = RhythmState(
             episode_id=episode_id or str(uuid4()),
@@ -376,6 +480,7 @@ class RhythmEnvironment(Environment):
         # Clamp reward
         reward = max(-3.0, min(3.0, round(reward, 4)))
         self._total_reward += reward
         # --- 10. Advance timestep ---
         self._timestep += 1
@@ -446,6 +551,24 @@ class RhythmEnvironment(Environment):
     def state(self) -> RhythmState:
         return self._state
     # ------------------------------------------------------------------
     # Internal helpers
     # ------------------------------------------------------------------
@@ -581,66 +704,60 @@ class RhythmEnvironment(Environment):
         """
         Compute final episode score in [0, 1].
-        Scoring breakdown:
-          0.25 — meter balance (high mean, low variance)
-          0.20 — crash-free ratio (no critical meter drops)
-          0.20 — progress made
-          0.15 — connection maintained
-          0.05 — efficiency (average reward)
-          0.15 — discovery bonus (profile-adapted strategy in second half)
-        """
-        meters = {m: getattr(self, f"_{m}") for m in METERS}
-        # 1. Meter balance (0.25): high mean, low variance
-        values = list(meters.values())
-        mean_meter = sum(values) / len(values)
-        variance = sum((v - mean_meter) ** 2 for v in values) / len(values)
-        balance_score = max(0.0, mean_meter - variance)
-        # 2. No crashes (0.20): fraction of steps without critical meters
         steps = max(self._timestep, 1)
         crash_free_ratio = 1.0 - (self._crash_count / (steps * len(METERS)))
-        # 3. Progress made (0.20)
         progress_score = self._progress
-        # 4. Connection maintained (0.15)
         connection_score = self._connection
-        # 5. Efficiency (0.05): normalized average reward
         avg_reward = self._total_reward / steps
         efficiency_score = max(0.0, min(1.0, (avg_reward + 1.0) / 2.0))
-        # 6. Discovery bonus (0.15): did the agent adapt its strategy to the
-        #    hidden profile in the second half of the week (steps 14–27)?
-        #    This is the only component that directly rewards personality discovery.
-        second_half = self._recent_actions[14:]
-        if len(second_half) > 0:
-            profile_name = self._profile["name"]
-            if profile_name == "introvert_morning":
-                # Introvert should minimise social actions
-                social_frac = sum(1 for a in second_half if a in SOCIAL_ACTIONS) / len(second_half)
-                discovery_score = max(0.0, 1.0 - social_frac * 2.5)
-            elif profile_name == "extrovert_night_owl":
-                # Extrovert should embrace social actions
-                social_frac = sum(1 for a in second_half if a in SOCIAL_ACTIONS) / len(second_half)
-                discovery_score = min(1.0, social_frac * 2.5)
-            elif profile_name == "workaholic_stoic":
-                # Workaholic should front-load work actions
-                work_frac = sum(1 for a in second_half if a in WORK_ACTIONS) / len(second_half)
-                discovery_score = min(1.0, work_frac * 1.5)
-            else:
-                discovery_score = 0.5
-        else:
-            discovery_score = 0.5
         score = (
-            0.25 * balance_score
-            + 0.20 * crash_free_ratio
-            + 0.20 * progress_score
             + 0.15 * connection_score
-            + 0.05 * efficiency_score
-            + 0.15 * discovery_score
         )
         return max(0.0, min(1.0, score))

 IDLE_ACTIONS = {"me_time", "binge_watch", "sleep"}
 WORK_ACTIONS = {"deep_work", "learn", "admin_work"}
+# ---------------------------------------------------------------------------
+# Continuous profile sampling (meta-RL training distribution)
+# ---------------------------------------------------------------------------
+#
+# Hardcoded profiles are 3 fixed personalities — memorizable, classification-like.
+# Continuous sampling draws profile parameters from distributions per episode,
+# making memorization impossible and forcing the agent to learn the *skill* of
+# profile inference. This is the core meta-learning move.
+#
+# Belief vector dimensions (for Phase 3 cosine-similarity reward):
+#   social_pref:  0 = hates social (introvert), 1 = loves social (extrovert)
+#   morning_pref: 0 = night owl, 1 = morning person
+#   work_pref:    0 = work-averse, 1 = workaholic
+def sample_profile(seed: int) -> Dict[str, Any]:
+    """Sample a continuous profile deterministically from a seed.
+    Reward weights drawn from a Dirichlet biased toward non-infrastructure
+    meters (progress, serenity, connection). Per-action modifiers drawn from
+    bounded uniforms so any sampled profile is playable.
+    """
+    rng = random.Random(seed ^ 0xA3C5F729)
+    # Reward weights via Dirichlet (α<1 produces sparse weights → "personalities")
+    alphas = [0.5, 0.5, 1.5, 1.5, 1.5]  # [vit, cog, prog, ser, conn]
+    raw = [rng.gammavariate(a, 1.0) for a in alphas]
+    total = sum(raw)
+    weights = [w / total for w in raw]
+    # Clamp and re-normalize to avoid degenerate weights
+    weights = [max(0.02, min(0.80, w)) for w in weights]
+    total = sum(weights)
+    weights = [w / total for w in weights]
+    return {
+        "name": f"sampled_{seed}",
+        "social_vitality_multiplier": rng.uniform(0.2, 3.0),
+        "morning_cognition_bonus": rng.uniform(0.4, 2.0) if rng.random() < 0.5 else None,
+        "evening_night_cognition_bonus": rng.uniform(0.6, 1.8) if rng.random() < 0.5 else None,
+        "morning_penalty": rng.uniform(0.4, 0.9) if rng.random() < 0.3 else None,
+        "binge_shame": rng.random() < 0.5,
+        "progress_serenity_bonus": rng.uniform(0.0, 0.10),
+        "idle_serenity_decay": rng.uniform(0.0, 0.10),
+        "vitality_decay_rate": rng.uniform(0.0, 0.04),
+        "stress_tolerance": rng.uniform(0.15, 0.30),
+        "event_impact_multiplier": rng.uniform(0.5, 1.0),
+        "connection_decay_rate": rng.uniform(0.005, 0.02),
+        "solo_serenity_bonus": rng.uniform(0.0, 0.10),
+        "social_connection_multiplier": rng.uniform(1.0, 2.0),
+        "social_serenity_bonus": rng.uniform(0.0, 0.06),
+        "work_vitality_recovery": rng.uniform(0.0, 0.06),
+        "initial_meters": {
+            "vitality": 0.7, "cognition": 0.7, "progress": 0.0,
+            "serenity": 0.7, "connection": 0.5,
+        },
+        "reward_weights": dict(zip(METERS, weights)),
+    }
+def profile_to_belief_vector(profile: Dict[str, Any]) -> List[float]:
+    """Reduce a profile to a 3-dim trait vector [social, morning, work] in [0, 1].
+    Used as the ground-truth target for the agent's belief output (Phase 3).
+    Aggregates the most-diagnostic modifiers per trait.
+    """
+    # social_pref: low if social drains vitality a lot, high if social bonuses are big
+    # social_vitality_multiplier in [0.2, 3.0]: lower = more extroverted
+    sm = profile.get("social_vitality_multiplier", 1.0)
+    social_drain_norm = 1.0 - max(0.0, min(1.0, (sm - 0.2) / 2.8))  # invert
+    scm = profile.get("social_connection_multiplier", 1.0)
+    social_conn_norm = max(0.0, min(1.0, (scm - 1.0) / 1.0))
+    ssb = profile.get("social_serenity_bonus", 0.0)
+    social_ser_norm = max(0.0, min(1.0, ssb / 0.06))
+    social_pref = 0.5 * social_drain_norm + 0.3 * social_conn_norm + 0.2 * social_ser_norm
+    # morning_pref: high if morning bonus exists, low if morning penalty exists
+    mcb = profile.get("morning_cognition_bonus")
+    mp = profile.get("morning_penalty")
+    morning_pref = 0.5
+    if mcb is not None:
+        morning_pref = 0.5 + 0.5 * max(0.0, min(1.0, (mcb - 0.4) / 1.6))
+    if mp is not None:
+        morning_pref = min(morning_pref, 0.5 - 0.5 * (1.0 - mp))
+    # work_pref: high if work recovers vitality + progress gives serenity + progress weight high
+    wvr = profile.get("work_vitality_recovery", 0.0)
+    wvr_norm = max(0.0, min(1.0, wvr / 0.06))
+    psb = profile.get("progress_serenity_bonus", 0.0)
+    psb_norm = max(0.0, min(1.0, psb / 0.10))
+    pw = profile.get("reward_weights", {}).get("progress", 0.2)
+    pw_norm = max(0.0, min(1.0, (pw - 0.05) / 0.65))
+    work_pref = 0.4 * wvr_norm + 0.3 * psb_norm + 0.3 * pw_norm
+    return [
+        max(0.0, min(1.0, social_pref)),
+        max(0.0, min(1.0, morning_pref)),
+        max(0.0, min(1.0, work_pref)),
+    ]
 class RhythmEnvironment(Environment):
     """
         self._total_reward: float = 0.0
         self._recent_actions: list = []
         self._step_history: list = []
+        self._step_rewards: list = []  # per-step rewards (for adaptation_score in grader)
     def get_metadata(self) -> EnvironmentMetadata:
         return EnvironmentMetadata(
         self._rng = random.Random(effective_seed)
+        # Profile selection — three modes:
+        #   1. Explicit hardcoded profile name → legacy 3-profile mode (eval comparison)
+        #   2. profile_mode='discrete' → pick from 3 hardcoded profiles by scrambled seed
+        #   3. Default → sampled continuous profile (meta-RL training distribution)
         profile_name = kwargs.get("profile")
+        profile_mode = kwargs.get("profile_mode", "continuous")
         if profile_name and profile_name in PROFILE_MAP:
             self._profile = deepcopy(PROFILE_MAP[profile_name])
+        elif profile_mode == "discrete":
             profile_rng = random.Random(effective_seed ^ 0xA3C5F729)
             profile_index = profile_rng.randint(0, len(PROFILES) - 1)
             self._profile = deepcopy(PROFILES[profile_index])
+        else:
+            self._profile = sample_profile(effective_seed)
         # Initialize meters from profile defaults
         initial = self._profile["initial_meters"]
         self._total_reward = 0.0
         self._recent_actions = []
         self._step_history = []
+        self._step_rewards = []
         self._state = RhythmState(
             episode_id=episode_id or str(uuid4()),
         # Clamp reward
         reward = max(-3.0, min(3.0, round(reward, 4)))
         self._total_reward += reward
+        self._step_rewards.append(reward)
         # --- 10. Advance timestep ---
         self._timestep += 1
     def state(self) -> RhythmState:
         return self._state
+    def get_belief_target(self) -> List[float]:
+        """Return the 3-dim ground-truth belief vector for the active profile.
+        Used during training to compute belief-accuracy reward (Phase 3).
+        Privileged information — not exposed via observation.
+        """
+        return profile_to_belief_vector(self._profile)
+    def get_profile_hint(self) -> Dict[str, float]:
+        """Return a coarse profile hint usable in observation during curriculum.
+        Returns the 3-dim belief vector with descriptive keys. The hint is
+        included in observations only when the env is started with
+        profile_visible=True — this is the curriculum's "visible" phase.
+        """
+        b = profile_to_belief_vector(self._profile)
+        return {"social_pref": round(b[0], 3), "morning_pref": round(b[1], 3), "work_pref": round(b[2], 3)}
     # ------------------------------------------------------------------
     # Internal helpers
     # ------------------------------------------------------------------
         """
         Compute final episode score in [0, 1].
+        Components (meta-learning aligned):
+          0.20 — crash_free_ratio: no critical meter drops
+          0.25 — progress: career/skill growth
+          0.15 — connection: relationship maintained
+          0.30 — adaptation_score: agent got better as it learned the user
+          0.10 — efficiency: bounded normalized average reward
+        adaptation_score is the meta-learning signal: late-half mean per-step
+        reward minus early-half mean, gated by absolute late-half quality so
+        a "terrible-then-mediocre" exploit cannot win.
+        Per-step reward is already profile-weighted via _compute_reward(), so
+        a high late-half mean directly means the agent figured out and
+        optimized for THIS profile's preferences.
+        """
         steps = max(self._timestep, 1)
+        # 1. Crash-free ratio (0.20)
         crash_free_ratio = 1.0 - (self._crash_count / (steps * len(METERS)))
+        # 2. Progress (0.25)
         progress_score = self._progress
+        # 3. Connection (0.15)
         connection_score = self._connection
+        # 4. Adaptation score (0.30) — KEY meta-learning signal.
+        # Split rewards in halves; positive only if late half is non-negative
+        # AND late > early. Normalized to [0, 1].
+        half = max(steps // 2, 1)
+        early = self._step_rewards[:half]
+        late = self._step_rewards[half:]
+        if early and late:
+            mean_early = sum(early) / len(early)
+            mean_late = sum(late) / len(late)
+            # late_quality: rewards typically in [-1, 1] per step, normalize
+            late_quality = max(0.0, min(1.0, (mean_late + 1.0) / 2.0))
+            gain = mean_late - mean_early
+            # gain typically in [-1, 1]; clip to [0, 1] (only positive counts)
+            gain_norm = max(0.0, min(1.0, gain))
+            adaptation_score = gain_norm * late_quality
+        else:
+            adaptation_score = 0.0
+        # 5. Efficiency (0.10): bounded normalized average reward
         avg_reward = self._total_reward / steps
         efficiency_score = max(0.0, min(1.0, (avg_reward + 1.0) / 2.0))
         score = (
+            0.20 * crash_free_ratio
+            + 0.25 * progress_score
             + 0.15 * connection_score
+            + 0.30 * adaptation_score
+            + 0.10 * efficiency_score
         )
         return max(0.0, min(1.0, score))

tests/test_rhythm_env.py CHANGED Viewed

@@ -356,10 +356,19 @@ class TestEdgeCases:
     def test_state_exposes_profile(self, env):
         """State should include profile_name for debugging."""
         env.reset(seed=0)
         assert env.state.profile_name != ""
         assert env.state.profile_name in [p["name"] for p in PROFILES]
     def test_all_action_types_valid(self, env):
         """Every ActionType should be processable without error."""
         env.reset(seed=0)

     def test_state_exposes_profile(self, env):
         """State should include profile_name for debugging."""
+        # Default mode: continuous profile (name like 'sampled_0')
         env.reset(seed=0)
         assert env.state.profile_name != ""
+        assert env.state.profile_name.startswith("sampled_")
+        # Discrete mode: name from hardcoded PROFILES list
+        env.reset(seed=0, profile_mode="discrete")
         assert env.state.profile_name in [p["name"] for p in PROFILES]
+        # Explicit profile: name matches the requested profile
+        env.reset(seed=0, profile="workaholic_stoic")
+        assert env.state.profile_name == "workaholic_stoic"
     def test_all_action_types_valid(self, env):
         """Every ActionType should be processable without error."""
         env.reset(seed=0)

training/RhythmEnv_GRPO_Training.ipynb CHANGED Viewed

@@ -4,15 +4,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# RhythmEnv GRPO Training — Life Simulator\n",
     "\n",
-    "Train an LLM agent to balance 5 life meters across a 7-day week using **Group Relative Policy Optimization (GRPO)**.\n",
     "\n",
-    "The agent must discover hidden personality profiles (introvert, extrovert, workaholic) through reward signals alone.\n",
     "\n",
-    "**Requirements**: Free Colab GPU (T4) is sufficient. Training takes ~30-60 minutes for 500 steps.\n",
-    "\n",
-    "**Environment**: https://huggingface.co/spaces/InosLihka/rhythm_env"
    ]
   },
   {
@@ -31,9 +33,9 @@
     "%%capture\n",
     "!pip install --upgrade uv\n",
     "!uv pip install --system torch>=2.8.0 triton>=3.4.0 torchvision bitsandbytes\n",
-    "!uv pip install --system transformers==4.56.2 trl==0.22.2 datasets wandb\n",
     "!pip install unsloth\n",
-    "!pip install fastapi uvicorn pydantic matplotlib"
    ]
   },
   {
@@ -50,8 +52,9 @@
    "outputs": [],
    "source": [
     "import os\n",
     "\n",
-    "# Clone the repo (update URL to your actual repo)\n",
     "REPO_URL = \"https://huggingface.co/spaces/InosLihka/rhythm_env\"\n",
     "ENV_DIR = \"/content/rhythm_env\"\n",
     "\n",
@@ -59,13 +62,10 @@
     "    !git clone {REPO_URL} {ENV_DIR}\n",
     "\n",
     "os.chdir(ENV_DIR)\n",
-    "\n",
-    "import sys\n",
     "sys.path.insert(0, ENV_DIR)\n",
     "sys.path.insert(0, os.path.join(ENV_DIR, \"training\"))\n",
-    "\n",
     "print(f\"Working directory: {os.getcwd()}\")\n",
-    "print(f\"Files: {os.listdir('.')}\")"
    ]
   },
   {
@@ -86,12 +86,12 @@
     "\n",
     "env = RhythmEnvironment()\n",
     "obs = env.reset(seed=0)\n",
-    "print(f\"Initial state — Vitality: {obs.vitality:.2f}, Cognition: {obs.cognition:.2f}, \"\n",
     "      f\"Progress: {obs.progress:.2f}, Serenity: {obs.serenity:.2f}, Connection: {obs.connection:.2f}\")\n",
     "print(f\"Day: {obs.day}, Slot: {obs.slot}, Steps: {obs.remaining_steps}\")\n",
     "\n",
     "obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))\n",
-    "print(f\"\\nAfter DEEP_WORK — Reward: {obs.reward:.3f}\")\n",
     "print(f\"Vitality: {obs.vitality:.2f}, Progress: {obs.progress:.2f}\")\n",
     "print(\"\\nEnvironment is working!\")"
    ]
@@ -109,26 +109,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from training.inference_eval import run_episode, heuristic_action\n",
-    "import json\n",
-    "\n",
-    "PROFILES = [\"introvert_morning\", \"extrovert_night_owl\", \"workaholic_stoic\"]\n",
-    "NUM_EVAL_EPISODES = 5\n",
-    "\n",
-    "baseline_results = {}\n",
-    "\n",
-    "for strategy in [\"heuristic\", \"random\"]:\n",
-    "    baseline_results[strategy] = {}\n",
-    "    for profile in PROFILES:\n",
-    "        scores = []\n",
-    "        for ep in range(NUM_EVAL_EPISODES):\n",
-    "            result = run_episode(profile=profile, seed=ep, strategy=strategy)\n",
-    "            scores.append(result[\"final_score\"])\n",
-    "        avg = sum(scores) / len(scores)\n",
-    "        baseline_results[strategy][profile] = avg\n",
-    "        print(f\"{strategy:10s} | {profile:25s} | avg_score={avg:.3f}\")\n",
-    "\n",
-    "print(\"\\nBaseline scores saved. These are our 'before training' numbers.\")"
    ]
   },
   {
@@ -143,7 +126,33 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "from dataset import generate_dataset\nfrom datasets import Dataset\n\nNUM_EPISODES = 200\nMAX_SAMPLES = 2000\n\nraw_samples = generate_dataset(\n    num_episodes=NUM_EPISODES,\n    strategy=\"mixed\",\n    max_samples=MAX_SAMPLES,\n)\n\n# Include replay metadata so env_reward can reconstruct the exact state\ndataset = Dataset.from_list([\n    {\n        \"prompt\": sample[\"prompt\"],\n        \"seed\": sample[\"seed\"],\n        \"step_index\": sample[\"step_index\"],\n        \"action_history\": sample[\"action_history\"],\n    }\n    for sample in raw_samples\n])\n\nprint(f\"\\nDataset size: {len(dataset)}\")\nprint(f\"\\nSample prompt:\")\nprint(dataset[0][\"prompt\"][1][\"content\"])"
   },
   {
    "cell_type": "markdown",
@@ -160,9 +169,9 @@
    "source": [
     "from unsloth import FastLanguageModel\n",
     "\n",
-    "MODEL_NAME = \"unsloth/Qwen2.5-3B-Instruct\"  # Good balance of capability vs VRAM\n",
-    "MAX_SEQ_LENGTH = 768\n",
-    "LORA_RANK = 4\n",
     "\n",
     "model, tokenizer = FastLanguageModel.from_pretrained(\n",
     "    model_name=MODEL_NAME,\n",
@@ -173,15 +182,15 @@
     "model = FastLanguageModel.get_peft_model(\n",
     "    model,\n",
     "    r=LORA_RANK,\n",
-    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
-    "                    \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
     "    lora_alpha=LORA_RANK * 2,\n",
     "    use_gradient_checkpointing=\"unsloth\",\n",
     "    random_state=3407,\n",
     ")\n",
-    "\n",
-    "print(f\"Model loaded: {MODEL_NAME}\")\n",
-    "print(f\"LoRA rank: {LORA_RANK}, seq_length: {MAX_SEQ_LENGTH}\")"
    ]
   },
   {
@@ -196,7 +205,17 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "from reward_functions import format_valid, action_legal, env_reward\n\n# Three-layer reward stack:\n# 1. format_valid  — did the model output a parseable action name? (+1.0 / -2.0)\n# 2. action_legal  — is it one of the 10 valid actions?         (+0.5 / -1.0)\n# 3. env_reward    — actual reward from stepping the environment  (varies)\n#\n# GRPOTrainer passes seed/step_index/action_history as dataset columns,\n# so env_reward replays the exact episode state for each training prompt.\nreward_funcs = [format_valid, action_legal, env_reward]\n\n# Quick sanity check\ntest_completions = [\n    [{\"content\": \"DEEP_WORK\"}],\n    [{\"content\": \"garbage\"}],\n    [{\"content\": \"SLEEP\"}],\n]\nprint(\"format_valid:\", format_valid(test_completions))\nprint(\"action_legal:\", action_legal(test_completions))\nprint(\"env_reward (no history, seed=0):\", env_reward(test_completions))"
   },
   {
    "cell_type": "markdown",
@@ -211,12 +230,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import wandb\n",
-    "\n",
-    "# Optional: login to W&B for tracking (or set report_to=\"none\" below)\n",
-    "# wandb.login(key=\"YOUR_KEY\")\n",
-    "\n",
-    "REPORT_TO = \"none\"  # Change to \"wandb\" if you have a key"
    ]
   },
   {
@@ -227,18 +243,18 @@
    "source": [
     "from trl import GRPOConfig, GRPOTrainer\n",
     "\n",
-    "MAX_STEPS = 500  # Increase to 1000 if time allows\n",
     "NUM_GENERATIONS = 4\n",
-    "LEARNING_RATE = 5e-5  # Reduced from default — lower lr prevents destabilizing early gradient steps\n",
-    "\n",
-    "max_prompt_length = 400\n",
-    "max_completion_length = 16  # Action names are 3-15 chars\n",
     "\n",
     "training_args = GRPOConfig(\n",
     "    temperature=1.0,\n",
     "    learning_rate=LEARNING_RATE,\n",
-    "    beta=0.1,                   # KL penalty — higher = more conservative, prevents policy drift\n",
-    "    max_grad_norm=0.5,          # Gradient clipping prevents large destabilizing updates\n",
     "    weight_decay=0.001,\n",
     "    warmup_ratio=0.1,\n",
     "    lr_scheduler_type=\"linear\",\n",
@@ -247,12 +263,12 @@
     "    per_device_train_batch_size=1,\n",
     "    gradient_accumulation_steps=4,\n",
     "    num_generations=NUM_GENERATIONS,\n",
-    "    max_prompt_length=max_prompt_length,\n",
-    "    max_completion_length=max_completion_length,\n",
     "    max_steps=MAX_STEPS,\n",
-    "    save_steps=100,\n",
-    "    report_to=REPORT_TO,\n",
-    "    output_dir=\"outputs/rhythmenv_trained\",\n",
     ")\n",
     "\n",
     "trainer = GRPOTrainer(\n",
@@ -262,12 +278,7 @@
     "    args=training_args,\n",
     "    train_dataset=dataset,\n",
     ")\n",
-    "\n",
-    "print(f\"Training config: {MAX_STEPS} steps, {NUM_GENERATIONS} generations, lr={LEARNING_RATE}\")\n",
-    "print(f\"  beta=0.1 (higher KL penalty = more conservative = less policy drift)\")\n",
-    "print(f\"  max_grad_norm=0.5 (gradient clipping for stability)\")\n",
-    "print(f\"  max_completion_length=16 (action names only, no verbose outputs)\")\n",
-    "print(\"Starting training...\")"
    ]
   },
   {
@@ -292,14 +303,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "OUTPUT_DIR = \"outputs/rhythmenv_trained\"\n",
     "\n",
     "model.save_pretrained_merged(\n",
     "    OUTPUT_DIR,\n",
     "    tokenizer,\n",
     "    save_method=\"merged_16bit\",\n",
     ")\n",
-    "print(f\"Model saved to: {OUTPUT_DIR}\")"
    ]
   },
   {
@@ -353,7 +364,7 @@
     "ax.plot(steps[:len(losses)], losses, color=\"#2563eb\", linewidth=1.5, alpha=0.8)\n",
     "ax.set_xlabel(\"Training Step\", fontsize=12)\n",
     "ax.set_ylabel(\"Loss\", fontsize=12)\n",
-    "ax.set_title(\"GRPO Training Loss — RhythmEnv Life Simulator\", fontsize=14)\n",
     "ax.grid(True, alpha=0.3)\n",
     "plt.tight_layout()\n",
     "plt.savefig(\"plots/training_loss.png\", dpi=150)\n",
@@ -369,10 +380,10 @@
     "        import numpy as np\n",
     "        r = np.array(rewards)\n",
     "        s = np.array(reward_stds)\n",
-    "        ax.fill_between(reward_steps, r - s, r + s, color=\"#16a34a\", alpha=0.15, label=\"±1 Std Dev\")\n",
     "    ax.set_xlabel(\"Training Step\", fontsize=12)\n",
     "    ax.set_ylabel(\"Mean Reward\", fontsize=12)\n",
-    "    ax.set_title(\"GRPO Mean Reward — RhythmEnv Life Simulator\", fontsize=14)\n",
     "    ax.legend()\n",
     "    ax.grid(True, alpha=0.3)\n",
     "    plt.tight_layout()\n",
@@ -403,39 +414,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from training.inference_eval import run_episode, model_action\n",
-    "from unsloth import FastLanguageModel\n",
     "\n",
-    "# Load trained model for inference\n",
-    "trained_model, trained_tokenizer = FastLanguageModel.from_pretrained(\n",
-    "    model_name=\"outputs/rhythmenv_trained\",\n",
-    "    load_in_4bit=True,\n",
-    "    max_seq_length=768,\n",
-    ")\n",
-    "FastLanguageModel.for_inference(trained_model)\n",
-    "\n",
-    "PROFILES = [\"introvert_morning\", \"extrovert_night_owl\", \"workaholic_stoic\"]\n",
-    "NUM_EVAL = 5\n",
-    "\n",
-    "trained_results = {}\n",
-    "all_eval_results = []\n",
-    "\n",
-    "for profile in PROFILES:\n",
-    "    scores = []\n",
-    "    for ep in range(NUM_EVAL):\n",
-    "        result = run_episode(\n",
-    "            profile=profile, seed=ep, strategy=\"model\",\n",
-    "            model=trained_model, tokenizer=trained_tokenizer,\n",
-    "        )\n",
-    "        scores.append(result[\"final_score\"])\n",
-    "        all_eval_results.append(result)\n",
-    "    avg = sum(scores) / len(scores)\n",
-    "    trained_results[profile] = avg\n",
-    "    print(f\"Trained | {profile:25s} | avg_score={avg:.3f} | scores={[round(s,3) for s in scores]}\")\n",
-    "\n",
-    "# Save eval results\n",
-    "with open(\"plots/eval_results.json\", \"w\") as f:\n",
-    "    json.dump(all_eval_results, f, indent=2)"
    ]
   },
   {
@@ -451,52 +433,71 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import matplotlib.pyplot as plt\n",
     "import numpy as np\n",
     "\n",
-    "profiles_short = [\"Introvert\\nMorning\", \"Extrovert\\nNight Owl\", \"Workaholic\\nStoic\"]\n",
-    "profile_keys = [\"introvert_morning\", \"extrovert_night_owl\", \"workaholic_stoic\"]\n",
-    "\n",
-    "random_scores = [baseline_results[\"random\"][p] for p in profile_keys]\n",
-    "heuristic_scores = [baseline_results[\"heuristic\"][p] for p in profile_keys]\n",
-    "trained_scores = [trained_results[p] for p in profile_keys]\n",
-    "\n",
-    "x = np.arange(len(profiles_short))\n",
-    "width = 0.25\n",
-    "\n",
-    "fig, ax = plt.subplots(figsize=(10, 6))\n",
-    "bars1 = ax.bar(x - width, random_scores, width, label=\"Random Baseline\", color=\"#94a3b8\")\n",
-    "bars2 = ax.bar(x, heuristic_scores, width, label=\"Heuristic Baseline\", color=\"#60a5fa\")\n",
-    "bars3 = ax.bar(x + width, trained_scores, width, label=\"GRPO Trained\", color=\"#22c55e\")\n",
-    "\n",
-    "ax.set_ylabel(\"Final Episode Score (0-1)\", fontsize=12)\n",
-    "ax.set_title(\"RhythmEnv: Baseline vs Trained Agent Across Hidden Profiles\", fontsize=14)\n",
-    "ax.set_xticks(x)\n",
-    "ax.set_xticklabels(profiles_short, fontsize=11)\n",
-    "ax.legend(fontsize=11)\n",
-    "ax.set_ylim(0, 1.0)\n",
-    "ax.grid(axis=\"y\", alpha=0.3)\n",
-    "\n",
-    "# Add value labels on bars\n",
-    "for bars in [bars1, bars2, bars3]:\n",
-    "    for bar in bars:\n",
-    "        height = bar.get_height()\n",
-    "        ax.annotate(f\"{height:.3f}\",\n",
-    "                    xy=(bar.get_x() + bar.get_width() / 2, height),\n",
-    "                    xytext=(0, 3), textcoords=\"offset points\",\n",
-    "                    ha=\"center\", va=\"bottom\", fontsize=9)\n",
     "\n",
     "plt.tight_layout()\n",
     "plt.savefig(\"plots/baseline_vs_trained.png\", dpi=150)\n",
     "plt.show()\n",
     "print(\"Saved: plots/baseline_vs_trained.png\")\n",
     "\n",
     "# Print summary table\n",
-    "print(f\"\\n{'Profile':<25} {'Random':>10} {'Heuristic':>10} {'Trained':>10} {'Improvement':>12}\")\n",
-    "print(\"-\" * 70)\n",
-    "for i, p in enumerate(profile_keys):\n",
-    "    improvement = trained_scores[i] - heuristic_scores[i]\n",
-    "    print(f\"{p:<25} {random_scores[i]:>10.3f} {heuristic_scores[i]:>10.3f} {trained_scores[i]:>10.3f} {improvement:>+12.3f}\")"
    ]
   },
   {
@@ -519,15 +520,15 @@
    "source": [
     "# List all generated artifacts\n",
     "print(\"Generated files:\")\n",
-    "for f in os.listdir(\"plots\"):\n",
     "    size = os.path.getsize(f\"plots/{f}\")\n",
     "    print(f\"  plots/{f} ({size:,} bytes)\")\n",
     "\n",
-    "print(f\"\\nModel saved at: outputs/rhythmenv_trained/\")\n",
     "print(\"\\nNext steps:\")\n",
     "print(\"1. Download plots/*.png and commit to your repo\")\n",
     "print(\"2. Embed plots in README.md\")\n",
-    "print(\"3. Create mini-blog or video\")"
    ]
   },
   {
@@ -537,8 +538,8 @@
    "outputs": [],
    "source": [
     "# Optional: zip everything for easy download\n",
-    "!zip -r /content/rhythmenv_training_artifacts.zip plots/ outputs/rhythmenv_trained/training_config.json\n",
-    "print(\"Download: /content/rhythmenv_training_artifacts.zip\")"
    ]
   }
  ],

    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# RhythmEnv GRPO Training \u2014 Meta-RL Life Simulator\n",
     "\n",
+    "Train an LLM agent to *learn how to learn a person*. The agent receives 4-layer reward:\n",
+    "- format_valid (parseable ACTION + 3 belief digits)\n",
+    "- action_legal (one of 10 valid actions)\n",
+    "- env_reward (actual env reward via seed-replay)\n",
+    "- belief_accuracy (cosine-MAE vs hidden true profile vector)\n",
     "\n",
+    "Continuous profile space + curriculum warmup + adaptation-aware grader \u2192 trains a meta-policy that generalizes to unseen personalities.\n",
     "\n",
+    "Default config: 1500 steps on Qwen 2.5-3B + LoRA rank 8, ~6h on Colab T4.\n"
    ]
   },
   {
     "%%capture\n",
     "!pip install --upgrade uv\n",
     "!uv pip install --system torch>=2.8.0 triton>=3.4.0 torchvision bitsandbytes\n",
+    "!uv pip install --system transformers==4.56.2 trl==0.22.2 datasets wandb peft accelerate\n",
     "!pip install unsloth\n",
+    "!pip install fastapi uvicorn pydantic matplotlib openenv-core\n"
    ]
   },
   {
    "outputs": [],
    "source": [
     "import os\n",
+    "import sys\n",
     "\n",
+    "# HF Space contains the meta-RL code (server/, models.py, training/, inference.py).\n",
     "REPO_URL = \"https://huggingface.co/spaces/InosLihka/rhythm_env\"\n",
     "ENV_DIR = \"/content/rhythm_env\"\n",
     "\n",
     "    !git clone {REPO_URL} {ENV_DIR}\n",
     "\n",
     "os.chdir(ENV_DIR)\n",
     "sys.path.insert(0, ENV_DIR)\n",
     "sys.path.insert(0, os.path.join(ENV_DIR, \"training\"))\n",
     "print(f\"Working directory: {os.getcwd()}\")\n",
+    "print(f\"Files: {sorted(os.listdir('.'))}\")\n"
    ]
   },
   {
     "\n",
     "env = RhythmEnvironment()\n",
     "obs = env.reset(seed=0)\n",
+    "print(f\"Initial state \u00e2\u20ac\u201d Vitality: {obs.vitality:.2f}, Cognition: {obs.cognition:.2f}, \"\n",
     "      f\"Progress: {obs.progress:.2f}, Serenity: {obs.serenity:.2f}, Connection: {obs.connection:.2f}\")\n",
     "print(f\"Day: {obs.day}, Slot: {obs.slot}, Steps: {obs.remaining_steps}\")\n",
     "\n",
     "obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))\n",
+    "print(f\"\\nAfter DEEP_WORK \u00e2\u20ac\u201d Reward: {obs.reward:.3f}\")\n",
     "print(f\"Vitality: {obs.vitality:.2f}, Progress: {obs.progress:.2f}\")\n",
     "print(\"\\nEnvironment is working!\")"
    ]
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Baselines (random + heuristic) under the new meta-RL grader\n",
+    "# Runs 3 conditions: discrete-3-profiles (legacy), continuous-in-dist, continuous-OOD\n",
+    "!python training/inference_eval.py --num_episodes 5 --output_file eval_baselines_meta.json\n"
    ]
   },
   {
    "execution_count": null,
    "metadata": {},
    "outputs": [],
+   "source": [
+    "from dataset import generate_dataset\n",
+    "from datasets import Dataset\n",
+    "\n",
+    "NUM_EPISODES = 300\n",
+    "MAX_SAMPLES = 3000\n",
+    "HINT_FRACTION = 0.15  # curriculum warmup\n",
+    "\n",
+    "raw_samples = generate_dataset(\n",
+    "    num_episodes=NUM_EPISODES,\n",
+    "    strategy=\"mixed\",\n",
+    "    max_samples=MAX_SAMPLES,\n",
+    "    profile_mode=\"continuous\",  # meta-RL: sampled per episode\n",
+    "    hint_fraction=HINT_FRACTION,\n",
+    ")\n",
+    "\n",
+    "dataset = Dataset.from_list([\n",
+    "    {\n",
+    "        \"prompt\": s[\"prompt\"],\n",
+    "        \"seed\": s[\"seed\"],\n",
+    "        \"step_index\": s[\"step_index\"],\n",
+    "        \"action_history\": s[\"action_history\"],\n",
+    "        \"profile_mode\": s[\"profile_mode\"],\n",
+    "    } for s in raw_samples\n",
+    "])\n",
+    "print(f\"Dataset size: {len(dataset)}\")\n"
+   ]
   },
   {
    "cell_type": "markdown",
    "source": [
     "from unsloth import FastLanguageModel\n",
     "\n",
+    "MODEL_NAME = \"unsloth/Qwen2.5-3B-Instruct\"\n",
+    "MAX_SEQ_LENGTH = 1024  # bumped from 768 for longer prompts\n",
+    "LORA_RANK = 8          # bumped from 4 for meta-RL capacity\n",
     "\n",
     "model, tokenizer = FastLanguageModel.from_pretrained(\n",
     "    model_name=MODEL_NAME,\n",
     "model = FastLanguageModel.get_peft_model(\n",
     "    model,\n",
     "    r=LORA_RANK,\n",
+    "    target_modules=[\n",
+    "        \"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
+    "        \"gate_proj\", \"up_proj\", \"down_proj\",\n",
+    "    ],\n",
     "    lora_alpha=LORA_RANK * 2,\n",
     "    use_gradient_checkpointing=\"unsloth\",\n",
     "    random_state=3407,\n",
     ")\n",
+    "print(f\"Loaded {MODEL_NAME} with LoRA rank {LORA_RANK}\")\n"
    ]
   },
   {
    "execution_count": null,
    "metadata": {},
    "outputs": [],
+   "source": [
+    "from reward_functions import format_valid, action_legal, env_reward, belief_accuracy\n",
+    "\n",
+    "# Four-layer reward stack:\n",
+    "#   1. format_valid    -- output parseable as ACTION + 3 belief digits\n",
+    "#   2. action_legal    -- action is one of 10 valid types\n",
+    "#   3. env_reward      -- actual env reward via seed-replay (cannot be fabricated)\n",
+    "#   4. belief_accuracy -- how close belief is to hidden profile (META-RL signal)\n",
+    "reward_funcs = [format_valid, action_legal, env_reward, belief_accuracy]\n",
+    "print(\"Reward functions: format_valid, action_legal, env_reward, belief_accuracy\")\n"
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Optional: enable W&B for live training-loss tracking.\n",
+    "# (Skip this cell if you don't have a W&B account.)\n",
+    "# import wandb; wandb.login(key=\"YOUR_KEY\")\n"
    ]
   },
   {
    "source": [
     "from trl import GRPOConfig, GRPOTrainer\n",
     "\n",
+    "MAX_STEPS = 1500          # bumped from 500 for meta-RL\n",
     "NUM_GENERATIONS = 4\n",
+    "LEARNING_RATE = 5e-5\n",
+    "BETA = 0.1                # raise to 0.2 if training is unstable\n",
+    "MAX_PROMPT_LEN = 600      # bumped from 400 for history + hint\n",
+    "MAX_COMPLETION_LEN = 20   # bumped from 16 for ACTION_NAME D D D format\n",
     "\n",
     "training_args = GRPOConfig(\n",
     "    temperature=1.0,\n",
     "    learning_rate=LEARNING_RATE,\n",
+    "    beta=BETA,\n",
+    "    max_grad_norm=0.5,\n",
     "    weight_decay=0.001,\n",
     "    warmup_ratio=0.1,\n",
     "    lr_scheduler_type=\"linear\",\n",
     "    per_device_train_batch_size=1,\n",
     "    gradient_accumulation_steps=4,\n",
     "    num_generations=NUM_GENERATIONS,\n",
+    "    max_prompt_length=MAX_PROMPT_LEN,\n",
+    "    max_completion_length=MAX_COMPLETION_LEN,\n",
     "    max_steps=MAX_STEPS,\n",
+    "    save_steps=250,           # checkpoint every 250 (Colab can disconnect)\n",
+    "    report_to=\"none\",\n",
+    "    output_dir=\"outputs/rhythmenv_meta_trained\",\n",
     ")\n",
     "\n",
     "trainer = GRPOTrainer(\n",
     "    args=training_args,\n",
     "    train_dataset=dataset,\n",
     ")\n",
+    "print(f\"Training {MAX_STEPS} steps, {NUM_GENERATIONS} generations per prompt\")\n"
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
+    "OUTPUT_DIR = \"outputs/rhythmenv_meta_trained\"\n",
     "\n",
     "model.save_pretrained_merged(\n",
     "    OUTPUT_DIR,\n",
     "    tokenizer,\n",
     "    save_method=\"merged_16bit\",\n",
     ")\n",
+    "print(f\"Model saved to: {OUTPUT_DIR}\")\n"
    ]
   },
   {
     "ax.plot(steps[:len(losses)], losses, color=\"#2563eb\", linewidth=1.5, alpha=0.8)\n",
     "ax.set_xlabel(\"Training Step\", fontsize=12)\n",
     "ax.set_ylabel(\"Loss\", fontsize=12)\n",
+    "ax.set_title(\"GRPO Training Loss \u00e2\u20ac\u201d RhythmEnv Life Simulator\", fontsize=14)\n",
     "ax.grid(True, alpha=0.3)\n",
     "plt.tight_layout()\n",
     "plt.savefig(\"plots/training_loss.png\", dpi=150)\n",
     "        import numpy as np\n",
     "        r = np.array(rewards)\n",
     "        s = np.array(reward_stds)\n",
+    "        ax.fill_between(reward_steps, r - s, r + s, color=\"#16a34a\", alpha=0.15, label=\"\u00c2\u00b11 Std Dev\")\n",
     "    ax.set_xlabel(\"Training Step\", fontsize=12)\n",
     "    ax.set_ylabel(\"Mean Reward\", fontsize=12)\n",
+    "    ax.set_title(\"GRPO Mean Reward \u00e2\u20ac\u201d RhythmEnv Life Simulator\", fontsize=14)\n",
     "    ax.legend()\n",
     "    ax.grid(True, alpha=0.3)\n",
     "    plt.tight_layout()\n",
    "metadata": {},
    "outputs": [],
    "source": [
+    "OUTPUT_DIR = \"outputs/rhythmenv_meta_trained\"\n",
     "\n",
+    "# Full eval suite: discrete + continuous in-dist + continuous OOD\n",
+    "!python training/inference_eval.py --model_path {OUTPUT_DIR} --num_episodes 5 --output_file eval_results.json\n"
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Comparison plot: 3 conditions x {random, heuristic, trained}\n",
+    "# Reads eval_results.json (written by inference_eval.py in cell 24)\n",
+    "import json, os\n",
     "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
     "\n",
+    "with open(\"eval_results.json\") as f:\n",
+    "    results = json.load(f)\n",
+    "\n",
+    "conditions = [\"discrete-3-profiles (legacy)\", \"continuous-in-distribution\", \"continuous-OOD (generalization)\"]\n",
+    "strategies = [\"random\", \"heuristic\", \"model\"]\n",
+    "\n",
+    "def avg(cond, strat, key=\"final_score\"):\n",
+    "    rs = [r[key] for r in results if r[\"condition\"] == cond and r[\"strategy\"] == strat]\n",
+    "    return float(np.mean(rs)) if rs else 0.0\n",
+    "\n",
+    "x = np.arange(len(conditions))\n",
+    "width = 0.27\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
+    "\n",
+    "# Plot 1: final_score by condition\n",
+    "rand = [avg(c, \"random\") for c in conditions]\n",
+    "heur = [avg(c, \"heuristic\") for c in conditions]\n",
+    "trnd = [avg(c, \"model\") for c in conditions]\n",
+    "axes[0].bar(x - width, rand, width, label=\"Random\", color=\"#94a3b8\")\n",
+    "axes[0].bar(x,         heur, width, label=\"Heuristic\", color=\"#60a5fa\")\n",
+    "axes[0].bar(x + width, trnd, width, label=\"Trained Qwen\", color=\"#22c55e\")\n",
+    "axes[0].set_ylabel(\"Final score (0-1)\")\n",
+    "axes[0].set_title(\"Final score by condition\")\n",
+    "axes[0].set_xticks(x)\n",
+    "axes[0].set_xticklabels([c.split(\" \")[0] for c in conditions], fontsize=10)\n",
+    "axes[0].legend()\n",
+    "axes[0].grid(axis=\"y\", alpha=0.3)\n",
+    "for xi, vals in enumerate(zip(rand, heur, trnd)):\n",
+    "    for di, v in enumerate(vals):\n",
+    "        axes[0].annotate(f\"{v:.3f}\", xy=(xi + (di-1)*width, v), xytext=(0, 3),\n",
+    "                         textcoords=\"offset points\", ha=\"center\", fontsize=8)\n",
+    "\n",
+    "# Plot 2: adaptation by condition (THE meta-RL signal)\n",
+    "rand_a = [avg(c, \"random\", \"adaptation\") for c in conditions]\n",
+    "heur_a = [avg(c, \"heuristic\", \"adaptation\") for c in conditions]\n",
+    "trnd_a = [avg(c, \"model\", \"adaptation\") for c in conditions]\n",
+    "axes[1].bar(x - width, rand_a, width, label=\"Random\", color=\"#94a3b8\")\n",
+    "axes[1].bar(x,         heur_a, width, label=\"Heuristic\", color=\"#60a5fa\")\n",
+    "axes[1].bar(x + width, trnd_a, width, label=\"Trained Qwen\", color=\"#22c55e\")\n",
+    "axes[1].set_ylabel(\"Adaptation (late-half mean - early-half mean)\")\n",
+    "axes[1].set_title(\"Adaptation: did agent get better mid-episode?\")\n",
+    "axes[1].set_xticks(x)\n",
+    "axes[1].set_xticklabels([c.split(\" \")[0] for c in conditions], fontsize=10)\n",
+    "axes[1].axhline(0, color=\"k\", linewidth=0.5)\n",
+    "axes[1].legend()\n",
+    "axes[1].grid(axis=\"y\", alpha=0.3)\n",
     "\n",
     "plt.tight_layout()\n",
+    "os.makedirs(\"plots\", exist_ok=True)\n",
     "plt.savefig(\"plots/baseline_vs_trained.png\", dpi=150)\n",
     "plt.show()\n",
     "print(\"Saved: plots/baseline_vs_trained.png\")\n",
     "\n",
     "# Print summary table\n",
+    "print()\n",
+    "print(f\"{'Condition':<40} {'Random':>10} {'Heuristic':>10} {'Trained':>10} {'vs Heuristic':>14}\")\n",
+    "print(\"-\" * 90)\n",
+    "for c, r, h, t in zip(conditions, rand, heur, trnd):\n",
+    "    print(f\"{c:<40} {r:>10.3f} {h:>10.3f} {t:>10.3f} {(t - h):>+14.3f}\")\n"
    ]
   },
   {
    "source": [
     "# List all generated artifacts\n",
     "print(\"Generated files:\")\n",
+    "for f in sorted(os.listdir(\"plots\")):\n",
     "    size = os.path.getsize(f\"plots/{f}\")\n",
     "    print(f\"  plots/{f} ({size:,} bytes)\")\n",
     "\n",
+    "print(f\"\\nModel saved at: outputs/rhythmenv_meta_trained/\")\n",
     "print(\"\\nNext steps:\")\n",
     "print(\"1. Download plots/*.png and commit to your repo\")\n",
     "print(\"2. Embed plots in README.md\")\n",
+    "print(\"3. Push trained model to HF Space if you want it live\")\n"
    ]
   },
   {
    "outputs": [],
    "source": [
     "# Optional: zip everything for easy download\n",
+    "!zip -r /content/rhythmenv_training_artifacts.zip plots/ outputs/rhythmenv_meta_trained/training_config.json eval_results.json eval_baselines_meta.json\n",
+    "print(\"Download: /content/rhythmenv_training_artifacts.zip\")\n"
    ]
   }
  ],

training/dataset.py CHANGED Viewed

@@ -1,8 +1,17 @@
 """
-Dataset generator for RhythmEnv GRPO training.
 Generates observation prompts by playing episodes with random/heuristic
-strategies, capturing diverse states across all profiles and time slots.
 """
 import sys
@@ -18,32 +27,63 @@ SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
 DAY_NAMES = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
 SYSTEM_PROMPT = (
-    "You are a life management agent. You balance 5 life meters over a 7-day week.\n"
-    "Each day has 4 slots: Morning, Afternoon, Evening, Night. Total: 28 steps.\n\n"
-    "Meters (0.0 to 1.0):\n"
-    "- Vitality: physical energy, sleep quality\n"
-    "- Cognition: mental clarity, focus\n"
-    "- Progress: career/skill advancement\n"
-    "- Serenity: inner peace, stress management\n"
-    "- Connection: relationship health\n\n"
-    "Available actions: DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE, "
-    "FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH\n\n"
-    "Key rules:\n"
-    "- Low Vitality reduces effectiveness of all positive actions.\n"
-    "- If ANY meter drops below 0.1, you get a heavy penalty.\n"
-    "- Connection decays passively if you don't socialize.\n"
-    "- Different episodes may have different hidden dynamics - adapt!\n"
-    "- Time of day matters for cognitive tasks.\n\n"
-    "Respond with ONLY the action name, nothing else."
 )
-def format_observation_prompt(obs) -> str:
-    """Format an observation into a user prompt for the LLM."""
     day_name = DAY_NAMES[obs.day] if obs.day < 7 else f"Day {obs.day}"
     slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
     event_str = f"\nActive event: {obs.active_event}" if obs.active_event else ""
     return (
         f"Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})\n"
         f"Remaining steps: {obs.remaining_steps}\n\n"
@@ -53,20 +93,38 @@ def format_observation_prompt(obs) -> str:
         f"  Progress:   {obs.progress:.2f}\n"
         f"  Serenity:   {obs.serenity:.2f}\n"
         f"  Connection: {obs.connection:.2f}"
-        f"{event_str}\n\n"
-        f"Choose your action:"
     )
-def generate_episode_samples(seed: int, strategy: str = "random") -> list:
-    """
-    Play one episode and return a list of (prompt, seed, step_index, action_sequence) tuples.
-    Each sample captures the state at a specific step, along with enough info
-    to replay to that point for reward evaluation.
     """
     env = RhythmEnvironment()
-    obs = env.reset(seed=seed)
     rng = random.Random(seed + 1000)
     actions_taken = []
     samples = []
@@ -76,10 +134,8 @@ def generate_episode_samples(seed: int, strategy: str = "random") -> list:
         if obs.done:
             break
-        # Generate prompt from current observation
-        prompt = format_observation_prompt(obs)
-        # Record sample with replay info
         samples.append({
             "prompt": [
                 {"role": "system", "content": SYSTEM_PROMPT},
@@ -87,10 +143,11 @@ def generate_episode_samples(seed: int, strategy: str = "random") -> list:
             ],
             "seed": seed,
             "step_index": step,
-            "action_history": list(actions_taken),  # copy for replay
         })
-        # Choose action based on strategy
         if strategy == "random":
             action_type = rng.choice(all_actions)
         elif strategy == "heuristic":
@@ -126,7 +183,6 @@ def _heuristic_action(obs) -> ActionType:
         if p < 0.3 and v > 0.3:
             return ActionType.LEARN
         return ActionType.ADMIN_WORK
-    # evening
     if cn < 0.4:
         return ActionType.SOCIALIZE
     if s < 0.5:
@@ -138,19 +194,24 @@ def generate_dataset(
     num_episodes: int = 200,
     strategy: str = "mixed",
     max_samples: int = 2000,
 ) -> list:
-    """
-    Generate a training dataset by playing multiple episodes.
     Args:
         num_episodes: Number of episodes to play.
         strategy: "random", "heuristic", or "mixed" (alternating).
         max_samples: Maximum samples to return.
-    Returns:
-        List of prompt dicts suitable for HF Dataset.from_list().
     """
     all_samples = []
     for i in range(num_episodes):
         seed = i
@@ -158,23 +219,40 @@ def generate_dataset(
             s = "heuristic" if i % 2 == 0 else "random"
         else:
             s = strategy
-        episode_samples = generate_episode_samples(seed=seed, strategy=s)
         all_samples.extend(episode_samples)
         if len(all_samples) >= max_samples:
             break
-    # Shuffle and truncate
     random.shuffle(all_samples)
     all_samples = all_samples[:max_samples]
-    print(f"Generated {len(all_samples)} training samples from {min(i+1, num_episodes)} episodes")
     return all_samples
 if __name__ == "__main__":
-    samples = generate_dataset(num_episodes=50, strategy="mixed", max_samples=500)
-    print(f"\nFirst sample prompt:")
-    print(samples[0]["prompt"][1]["content"])
-    print(f"\nSeed: {samples[0]['seed']}, Step: {samples[0]['step_index']}")

 """
+Dataset generator for RhythmEnv GRPO training (meta-RL version).
 Generates observation prompts by playing episodes with random/heuristic
+strategies, capturing diverse states across continuously-sampled profiles.
+Key changes vs single-task version:
+  - Default profile_mode is "continuous" (sampled profile per episode)
+  - System prompt asks for ACTION_NAME + 3 belief digits
+  - Curriculum support: a fraction of samples include profile hint in obs
+    (training set up to show hint early, mask later, teaching the *skill*
+    of using profile signals before forcing inference)
+  - Dataset rows include seed, step_index, action_history, profile_mode
+    so env_reward and belief_accuracy can replay deterministically
 """
 import sys
 DAY_NAMES = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
 SYSTEM_PROMPT = (
+    "You are a life-management agent helping a person whose preferences are HIDDEN.\n"
+    "You see 5 life meters and a rolling 7-step history. The same action affects\n"
+    "different people differently — you must INFER who you're helping from the\n"
+    "rewards and meter changes you observe.\n\n"
+    "Each step, output ONE LINE in this exact format:\n"
+    "    ACTION_NAME S M W\n\n"
+    "where ACTION_NAME is one of:\n"
+    "  DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE,\n"
+    "  FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH\n\n"
+    "and S, M, W are single digits (0-9) representing your current belief:\n"
+    "  S = social preference (0=hates being social, 9=loves being social)\n"
+    "  M = morning preference (0=night owl, 9=morning person)\n"
+    "  W = work preference   (0=avoids work, 9=workaholic)\n\n"
+    "Example: DEEP_WORK 3 8 7\n\n"
+    "Tips:\n"
+    "- Update your belief based on rewards: if SOCIALIZE gave a big positive reward,\n"
+    "  raise S; if it tanked vitality without a reward, lower S.\n"
+    "- Early in the week, PROBE different actions to learn the person.\n"
+    "- Late in the week, EXPLOIT what you've learned — pick actions matching\n"
+    "  the person's preferences (use your belief to guide).\n"
+    "- Watch for crashes: any meter under 0.1 → big penalty.\n"
+    "- Connection decays passively — actively maintain it.\n"
+    "Respond with ONLY the format line, no other text."
 )
+def format_observation_prompt(obs, profile_hint: dict | None = None) -> str:
+    """Format an observation into a user prompt for the LLM.
+    If profile_hint is provided (curriculum's "visible" phase), include it in
+    the prompt so the agent learns the *skill* of using profile signals
+    before having to infer them from scratch.
+    """
     day_name = DAY_NAMES[obs.day] if obs.day < 7 else f"Day {obs.day}"
     slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
     event_str = f"\nActive event: {obs.active_event}" if obs.active_event else ""
+    history_lines = []
+    for h in (obs.step_history or [])[-5:]:  # last 5 only to fit prompt budget
+        history_lines.append(
+            f"  step {h.step}: {h.action} -> reward {h.reward:+.2f} "
+            f"(V{h.vitality_delta:+.2f} C{h.cognition_delta:+.2f} "
+            f"P{h.progress_delta:+.2f} S{h.serenity_delta:+.2f} Cn{h.connection_delta:+.2f})"
+        )
+    history_str = ""
+    if history_lines:
+        history_str = "\n\nRecent history:\n" + "\n".join(history_lines)
+    hint_str = ""
+    if profile_hint is not None:
+        hint_str = (
+            f"\n\nKnown about this person (training hint):\n"
+            f"  social_pref={profile_hint['social_pref']:.2f}, "
+            f"morning_pref={profile_hint['morning_pref']:.2f}, "
+            f"work_pref={profile_hint['work_pref']:.2f}"
+        )
     return (
         f"Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})\n"
         f"Remaining steps: {obs.remaining_steps}\n\n"
         f"  Progress:   {obs.progress:.2f}\n"
         f"  Serenity:   {obs.serenity:.2f}\n"
         f"  Connection: {obs.connection:.2f}"
+        f"{event_str}"
+        f"{history_str}"
+        f"{hint_str}\n\n"
+        f"Choose your action (format: ACTION S M W):"
     )
+def generate_episode_samples(
+    seed: int,
+    strategy: str = "random",
+    profile_mode: str = "continuous",
+    show_profile_hint: bool = False,
+) -> list:
+    """Play one episode and return a list of training samples.
+    Each sample includes the prompt + replay metadata (seed, step_index,
+    action_history, profile_mode) so reward functions can deterministically
+    reconstruct the env state.
+    Args:
+        seed: Episode seed (also determines profile when profile_mode=continuous).
+        strategy: "random" or "heuristic" — used to roll out the episode for
+            state diversity. The agent's training generations replace these
+            actions; we only need the prefix history for replay.
+        profile_mode: "continuous" (sampled per seed) or "discrete" (1 of 3
+            hardcoded profiles).
+        show_profile_hint: If True, include the true belief vector in the prompt.
+            Use during the curriculum's "visible" warmup phase.
     """
     env = RhythmEnvironment()
+    obs = env.reset(seed=seed, profile_mode=profile_mode)
+    profile_hint = env.get_profile_hint() if show_profile_hint else None
     rng = random.Random(seed + 1000)
     actions_taken = []
     samples = []
         if obs.done:
             break
+        prompt = format_observation_prompt(obs, profile_hint=profile_hint)
         samples.append({
             "prompt": [
                 {"role": "system", "content": SYSTEM_PROMPT},
             ],
             "seed": seed,
             "step_index": step,
+            "action_history": list(actions_taken),
+            "profile_mode": profile_mode,
+            "show_profile_hint": show_profile_hint,
         })
         if strategy == "random":
             action_type = rng.choice(all_actions)
         elif strategy == "heuristic":
         if p < 0.3 and v > 0.3:
             return ActionType.LEARN
         return ActionType.ADMIN_WORK
     if cn < 0.4:
         return ActionType.SOCIALIZE
     if s < 0.5:
     num_episodes: int = 200,
     strategy: str = "mixed",
     max_samples: int = 2000,
+    profile_mode: str = "continuous",
+    hint_fraction: float = 0.2,
 ) -> list:
+    """Generate a training dataset by playing multiple episodes.
+    Curriculum is baked into the dataset: hint_fraction of samples have the
+    true profile visible (visible-phase warmup). After shuffle, GRPOTrainer
+    sees a mix early on; we can sort to put hint samples first if needed.
     Args:
         num_episodes: Number of episodes to play.
         strategy: "random", "heuristic", or "mixed" (alternating).
         max_samples: Maximum samples to return.
+        profile_mode: "continuous" (default, meta-RL) or "discrete" (3 profiles).
+        hint_fraction: Fraction of episodes to play with profile hint visible.
     """
     all_samples = []
+    n_hint_episodes = int(num_episodes * hint_fraction)
     for i in range(num_episodes):
         seed = i
             s = "heuristic" if i % 2 == 0 else "random"
         else:
             s = strategy
+        show_hint = i < n_hint_episodes
+        episode_samples = generate_episode_samples(
+            seed=seed,
+            strategy=s,
+            profile_mode=profile_mode,
+            show_profile_hint=show_hint,
+        )
         all_samples.extend(episode_samples)
         if len(all_samples) >= max_samples:
             break
+    # Shuffle (curriculum is per-sample via show_profile_hint flag, not order)
     random.shuffle(all_samples)
     all_samples = all_samples[:max_samples]
+    n_hint = sum(1 for s in all_samples if s["show_profile_hint"])
+    print(
+        f"Generated {len(all_samples)} samples from {min(i+1, num_episodes)} episodes "
+        f"({n_hint} with profile hint, {len(all_samples) - n_hint} without)"
+    )
     return all_samples
 if __name__ == "__main__":
+    samples = generate_dataset(num_episodes=20, strategy="mixed", max_samples=80, hint_fraction=0.5)
+    print(f"\nFirst sample (with hint):")
+    hinted = next((s for s in samples if s["show_profile_hint"]), None)
+    if hinted:
+        print(hinted["prompt"][1]["content"])
+        print(f"\nseed={hinted['seed']}, step={hinted['step_index']}, mode={hinted['profile_mode']}")
+    print(f"\nFirst sample (without hint):")
+    plain = next((s for s in samples if not s["show_profile_hint"]), None)
+    if plain:
+        print(plain["prompt"][1]["content"])

training/inference_eval.py CHANGED Viewed

@@ -1,38 +1,46 @@
 """
-RhythmEnv Inference Evaluation — Baseline vs Trained comparison.
-Runs both a heuristic baseline and a trained model on all 3 profiles,
-producing a comparison table and score summary.
 Usage:
-    # Heuristic baseline only:
     python training/inference_eval.py
     # With trained model:
     python training/inference_eval.py --model_path outputs/rhythmenv_trained
-    # With remote API:
-    python training/inference_eval.py --api_url https://api.example.com/v1 --api_key xxx
 """
 import argparse
 import json
 import os
 import sys
 from typing import Optional
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
 from models import ActionType, RhythmAction
-from server.rhythm_environment import RhythmEnvironment, MAX_STEPS
-PROFILES = ["introvert_morning", "extrovert_night_owl", "workaholic_stoic"]
 SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
 DAY_NAMES = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
 def heuristic_action(obs) -> ActionType:
-    """Priority-based heuristic baseline."""
     slot = obs.slot
     v, c, p, s, cn = obs.vitality, obs.cognition, obs.progress, obs.serenity, obs.connection
@@ -60,21 +68,55 @@ def heuristic_action(obs) -> ActionType:
 def random_action(rng) -> ActionType:
-    """Random baseline."""
-    import random
     return rng.choice(list(ActionType))
-def run_episode(profile: str, seed: int, strategy: str, model=None, tokenizer=None) -> dict:
-    """Run a single episode and return results."""
-    import random
     rng = random.Random(seed + 500)
     env = RhythmEnvironment()
-    obs = env.reset(seed=seed, profile=profile)
     total_reward = 0.0
     actions_taken = []
     for step in range(MAX_STEPS):
         if obs.done:
@@ -85,7 +127,8 @@ def run_episode(profile: str, seed: int, strategy: str, model=None, tokenizer=No
         elif strategy == "random":
             action_type = random_action(rng)
         elif strategy == "model" and model is not None:
-            action_type = model_action(obs, model, tokenizer)
         else:
             action_type = random_action(rng)
@@ -93,64 +136,99 @@ def run_episode(profile: str, seed: int, strategy: str, model=None, tokenizer=No
         actions_taken.append(action_type.value)
         obs = env.step(action)
         total_reward += obs.reward
     final_score = obs.reward_breakdown.get("final_score", 0.0)
     return {
-        "profile": profile,
         "strategy": strategy,
         "seed": seed,
         "final_score": round(final_score, 4),
         "total_reward": round(total_reward, 2),
         "vitality": round(obs.vitality, 2),
         "cognition": round(obs.cognition, 2),
         "progress": round(obs.progress, 2),
         "serenity": round(obs.serenity, 2),
         "connection": round(obs.connection, 2),
         "actions": actions_taken,
     }
-def model_action(obs, model, tokenizer) -> ActionType:
-    """Get action from trained model."""
-    from training.dataset import format_observation_prompt, SYSTEM_PROMPT
-    from training.reward_functions import extract_action
-    prompt = format_observation_prompt(obs)
-    messages = [
-        {"role": "system", "content": SYSTEM_PROMPT},
-        {"role": "user", "content": prompt},
-    ]
-    text = tokenizer.apply_chat_template(
-        messages,
-        tokenize=False,
-        add_generation_prompt=True,
-    )
-    inputs = tokenizer(text, return_tensors="pt").to(model.device)
-    outputs = model.generate(
-        **inputs,
-        max_new_tokens=32,
-        temperature=0.7,
-        do_sample=True,
-    )
-    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
-    action_type = extract_action(response)
-    return action_type if action_type is not None else ActionType.SLEEP
 def main():
-    parser = argparse.ArgumentParser(description="Evaluate RhythmEnv agent")
     parser.add_argument("--model_path", type=str, default=None,
                         help="Path to trained model (skip for baseline only)")
     parser.add_argument("--num_episodes", type=int, default=5,
-                        help="Episodes per profile per strategy")
-    parser.add_argument("--output_file", type=str, default="eval_results.json",
-                        help="Output file for results")
     args = parser.parse_args()
     model, tokenizer = None, None
     strategies = ["heuristic", "random"]
@@ -169,54 +247,54 @@ def main():
             print(f"Warning: Could not load model: {e}")
             print("Running baseline-only evaluation.")
-    # ---------------------------------------------------------------
-    # Run evaluations
-    # ---------------------------------------------------------------
     all_results = []
-    for strategy in strategies:
-        print(f"\n{'=' * 50}")
-        print(f"Strategy: {strategy.upper()}")
-        print(f"{'=' * 50}")
-        for profile in PROFILES:
-            scores = []
-            for ep in range(args.num_episodes):
-                seed = ep
-                result = run_episode(
-                    profile=profile,
-                    seed=seed,
-                    strategy=strategy,
-                    model=model,
-                    tokenizer=tokenizer,
-                )
-                all_results.append(result)
-                scores.append(result["final_score"])
-            avg_score = sum(scores) / len(scores)
-            print(f"  {profile:25s}  avg_score={avg_score:.3f}  scores={[round(s,3) for s in scores]}")
-    # ---------------------------------------------------------------
-    # Summary table
-    # ---------------------------------------------------------------
     print(f"\n{'=' * 70}")
-    print("SUMMARY")
     print(f"{'=' * 70}")
-    print(f"{'Profile':<25} {'Heuristic':>10} {'Random':>10}", end="")
-    if "model" in strategies:
-        print(f" {'Trained':>10}", end="")
     print()
     print("-" * 70)
-    for profile in PROFILES:
-        row = f"{profile:<25}"
-        for strategy in strategies:
-            profile_results = [r for r in all_results if r["profile"] == profile and r["strategy"] == strategy]
-            avg = sum(r["final_score"] for r in profile_results) / len(profile_results)
-            row += f" {avg:>10.3f}"
         print(row)
-    # Save results
     with open(args.output_file, "w") as f:
         json.dump(all_results, f, indent=2)
     print(f"\nResults saved to: {args.output_file}")

 """
+RhythmEnv Inference Evaluation — Baseline vs Trained, with meta-RL eval suite.
+Three evaluation conditions:
+  1. discrete-3-profiles: Original 3 hardcoded profiles for legacy comparison
+     (compare against eval_results_v1.json from the single-task version)
+  2. continuous-in-distribution: Sampled profiles from the training distribution
+     (was the agent able to learn the meta-policy?)
+  3. continuous-OOD: Profiles from a held-out region of the parameter space
+     (does the meta-policy generalize, or did the agent memorize?)
 Usage:
+    # Baselines only (no trained model):
     python training/inference_eval.py
     # With trained model:
     python training/inference_eval.py --model_path outputs/rhythmenv_trained
 """
 import argparse
 import json
 import os
+import random
 import sys
 from typing import Optional
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
 from models import ActionType, RhythmAction
+from server.rhythm_environment import RhythmEnvironment, MAX_STEPS, sample_profile, profile_to_belief_vector
+DISCRETE_PROFILES = ["introvert_morning", "extrovert_night_owl", "workaholic_stoic"]
 SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
 DAY_NAMES = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
+# Seed ranges: training distribution = [0, 200); OOD = [10000, 10030)
+# (10000 offset makes seeded sampled profiles in OOD region statistically distinct)
+IN_DIST_SEEDS_DEFAULT = list(range(100, 110))   # 10 unseen-by-training in-distribution
+OOD_SEEDS_DEFAULT = list(range(10000, 10010))   # 10 OOD seeds
 def heuristic_action(obs) -> ActionType:
+    """Priority-based heuristic baseline (profile-blind)."""
     slot = obs.slot
     v, c, p, s, cn = obs.vitality, obs.cognition, obs.progress, obs.serenity, obs.connection
 def random_action(rng) -> ActionType:
     return rng.choice(list(ActionType))
+def model_action(obs, model, tokenizer, return_belief: bool = False):
+    """Get action (and optionally belief) from trained model."""
+    from training.dataset import format_observation_prompt, SYSTEM_PROMPT
+    from training.reward_functions import extract_action_and_belief
+    prompt = format_observation_prompt(obs)
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": prompt},
+    ]
+    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    inputs = tokenizer(text, return_tensors="pt").to(model.device)
+    outputs = model.generate(**inputs, max_new_tokens=20, temperature=0.7, do_sample=True)
+    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
+    action_type, belief, _ = extract_action_and_belief(response)
+    if action_type is None:
+        action_type = ActionType.SLEEP
+    return (action_type, belief) if return_belief else action_type
+def run_episode(
+    seed: int,
+    strategy: str,
+    profile_mode: str = "continuous",
+    profile: Optional[str] = None,
+    model=None,
+    tokenizer=None,
+) -> dict:
+    """Run a single episode and return per-episode metrics."""
     rng = random.Random(seed + 500)
     env = RhythmEnvironment()
+    if profile is not None:
+        obs = env.reset(seed=seed, profile=profile)
+    else:
+        obs = env.reset(seed=seed, profile_mode=profile_mode)
+    true_belief = env.get_belief_target()
+    profile_name = env.state.profile_name
     total_reward = 0.0
+    step_rewards = []
     actions_taken = []
+    beliefs_seen = []  # for trained model
     for step in range(MAX_STEPS):
         if obs.done:
         elif strategy == "random":
             action_type = random_action(rng)
         elif strategy == "model" and model is not None:
+            action_type, belief = model_action(obs, model, tokenizer, return_belief=True)
+            beliefs_seen.append(belief)
         else:
             action_type = random_action(rng)
         actions_taken.append(action_type.value)
         obs = env.step(action)
         total_reward += obs.reward
+        step_rewards.append(obs.reward)
     final_score = obs.reward_breakdown.get("final_score", 0.0)
+    # Adaptation: late-half mean minus early-half mean
+    half = max(len(step_rewards) // 2, 1)
+    early = step_rewards[:half]
+    late = step_rewards[half:]
+    adaptation = (sum(late) / len(late) - sum(early) / len(early)) if (early and late) else 0.0
+    # Belief tracking (only for trained model)
+    final_belief = beliefs_seen[-1] if beliefs_seen else None
+    belief_mae = None
+    if final_belief is not None:
+        belief_mae = sum(abs(b - t) for b, t in zip(final_belief, true_belief)) / 3.0
     return {
+        "profile_name": profile_name,
+        "profile_mode": profile_mode if profile is None else "discrete",
         "strategy": strategy,
         "seed": seed,
         "final_score": round(final_score, 4),
         "total_reward": round(total_reward, 2),
+        "adaptation": round(adaptation, 3),
         "vitality": round(obs.vitality, 2),
         "cognition": round(obs.cognition, 2),
         "progress": round(obs.progress, 2),
         "serenity": round(obs.serenity, 2),
         "connection": round(obs.connection, 2),
         "actions": actions_taken,
+        "true_belief": [round(x, 3) for x in true_belief],
+        "final_belief": [round(x, 3) for x in final_belief] if final_belief is not None else None,
+        "belief_mae": round(belief_mae, 3) if belief_mae is not None else None,
     }
+def eval_condition(
+    name: str,
+    strategies: list[str],
+    runs: list[dict],
+    model=None,
+    tokenizer=None,
+) -> list[dict]:
+    """Run an eval condition and print summary."""
+    print(f"\n{'=' * 60}")
+    print(f"Condition: {name}")
+    print(f"{'=' * 60}")
+    results = []
+    for strategy in strategies:
+        print(f"\n  Strategy: {strategy.upper()}")
+        scores = []
+        adaptations = []
+        belief_maes = []
+        for run in runs:
+            r = run_episode(strategy=strategy, model=model, tokenizer=tokenizer, **run)
+            results.append({"condition": name, **r})
+            scores.append(r["final_score"])
+            adaptations.append(r["adaptation"])
+            if r["belief_mae"] is not None:
+                belief_maes.append(r["belief_mae"])
+        avg_score = sum(scores) / len(scores) if scores else 0.0
+        avg_adapt = sum(adaptations) / len(adaptations) if adaptations else 0.0
+        avg_mae = sum(belief_maes) / len(belief_maes) if belief_maes else None
+        line = f"    avg_score={avg_score:.3f}  avg_adaptation={avg_adapt:+.3f}"
+        if avg_mae is not None:
+            line += f"  avg_belief_mae={avg_mae:.3f}"
+        print(line)
+    return results
 def main():
+    parser = argparse.ArgumentParser(description="Evaluate RhythmEnv agent (meta-RL eval suite)")
     parser.add_argument("--model_path", type=str, default=None,
                         help="Path to trained model (skip for baseline only)")
     parser.add_argument("--num_episodes", type=int, default=5,
+                        help="Episodes per condition per strategy (for discrete: per-profile)")
+    parser.add_argument("--output_file", type=str, default="eval_results.json")
+    parser.add_argument("--in_dist_seeds", type=str, default=None,
+                        help="Comma-separated seeds for in-distribution eval")
+    parser.add_argument("--ood_seeds", type=str, default=None,
+                        help="Comma-separated seeds for OOD eval")
     args = parser.parse_args()
+    in_dist_seeds = (
+        [int(s) for s in args.in_dist_seeds.split(",")] if args.in_dist_seeds
+        else IN_DIST_SEEDS_DEFAULT[:args.num_episodes * 2]
+    )
+    ood_seeds = (
+        [int(s) for s in args.ood_seeds.split(",")] if args.ood_seeds
+        else OOD_SEEDS_DEFAULT[:args.num_episodes * 2]
+    )
     model, tokenizer = None, None
     strategies = ["heuristic", "random"]
             print(f"Warning: Could not load model: {e}")
             print("Running baseline-only evaluation.")
     all_results = []
+    # Condition 1: Discrete 3 hardcoded profiles (legacy comparison)
+    discrete_runs = [
+        {"seed": ep, "profile": p, "profile_mode": "discrete"}
+        for p in DISCRETE_PROFILES for ep in range(args.num_episodes)
+    ]
+    all_results += eval_condition(
+        "discrete-3-profiles (legacy)",
+        strategies, discrete_runs,
+        model=model, tokenizer=tokenizer,
+    )
+    # Condition 2: In-distribution sampled profiles
+    in_dist_runs = [{"seed": s, "profile_mode": "continuous"} for s in in_dist_seeds]
+    all_results += eval_condition(
+        "continuous-in-distribution",
+        strategies, in_dist_runs,
+        model=model, tokenizer=tokenizer,
+    )
+    # Condition 3: OOD sampled profiles (the meta-learning generalization test)
+    ood_runs = [{"seed": s, "profile_mode": "continuous"} for s in ood_seeds]
+    all_results += eval_condition(
+        "continuous-OOD (generalization)",
+        strategies, ood_runs,
+        model=model, tokenizer=tokenizer,
+    )
+    # Per-profile breakdown for discrete (preserves legacy table format)
     print(f"\n{'=' * 70}")
+    print("DISCRETE-3-PROFILE BREAKDOWN")
     print(f"{'=' * 70}")
+    print(f"{'Profile':<25} ", end="")
+    for s in strategies:
+        print(f"{s:>10}", end="")
     print()
     print("-" * 70)
+    discrete = [r for r in all_results if r["condition"] == "discrete-3-profiles (legacy)"]
+    for profile in DISCRETE_PROFILES:
+        row = f"{profile:<25} "
+        for s in strategies:
+            rs = [r for r in discrete if r["profile_name"] == profile and r["strategy"] == s]
+            avg = sum(r["final_score"] for r in rs) / len(rs) if rs else 0.0
+            row += f"{avg:>10.3f}"
         print(row)
+    # Save
     with open(args.output_file, "w") as f:
         json.dump(all_results, f, indent=2)
     print(f"\nResults saved to: {args.output_file}")

training/reward_functions.py CHANGED Viewed

@@ -1,17 +1,29 @@
 """
-Reward functions for RhythmEnv GRPO training.
-Three-layer reward stack:
-1. format_valid — is the LLM output a parseable action name?
-2. action_legal — is it one of the 10 valid actions?
-3. env_reward — actual reward from stepping the environment
 Each function returns a list of floats (one per completion).
 """
-import sys
 import os
 import re
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
@@ -21,39 +33,107 @@ from server.rhythm_environment import RhythmEnvironment
 # All valid action names (uppercase for matching)
 VALID_ACTIONS = {at.value.upper(): at for at in ActionType}
-def extract_action(text: str) -> ActionType | None:
-    """Parse LLM output into an ActionType. Returns None if unparseable."""
     if not text:
-        return None
-    # Clean up: take first line, strip whitespace, uppercase
-    text = text.strip().split("\n")[0].strip().upper().replace(" ", "_")
-    # Direct match
-    if text in VALID_ACTIONS:
-        return VALID_ACTIONS[text]
-    # Partial match (e.g., "I choose DEEP_WORK" → DEEP_WORK)
-    for name, at in VALID_ACTIONS.items():
-        if name in text:
-            return at
-    return None
 def format_valid(completions, **kwargs) -> list[float]:
     """
-    Layer 1: Does the LLM output contain a parseable action?
-    +1.0 if valid action found, -2.0 if not.
-    This teaches the model to output clean action names.
     """
     scores = []
     for completion in completions:
         response = completion[0]["content"] if isinstance(completion, list) else completion
-        action = extract_action(response)
-        scores.append(1.0 if action is not None else -2.0)
     return scores
@@ -61,9 +141,7 @@ def action_legal(completions, **kwargs) -> list[float]:
     """
     Layer 2: Is the parsed action one of the 10 valid actions?
-    For our env, all 10 actions are always legal (no state-dependent validity).
-    This is simpler than the workday scheduler where START_TASK required idle state.
     +0.5 if legal, -1.0 if not parseable.
     """
     scores = []
@@ -74,24 +152,32 @@ def action_legal(completions, **kwargs) -> list[float]:
     return scores
 def env_reward(
     completions,
     prompts=None,
     seed=None,
     step_index=None,
     action_history=None,
     **kwargs,
 ) -> list[float]:
     """
     Layer 3: Actual environment reward from stepping with the chosen action.
     Replays the episode to the correct state, then steps with the model's action.
-    Returns the raw env reward (scaled).
-    This is the core training signal — the model learns which actions
-    produce higher rewards for different states and hidden profiles.
     seed/step_index/action_history are passed as dataset columns by GRPOTrainer.
     """
     scores = []
@@ -103,48 +189,87 @@ def env_reward(
             scores.append(-3.0)
             continue
-        # Prefer direct dataset column kwargs (passed by GRPOTrainer)
         if seed is not None and i < len(seed):
             ep_seed = seed[i]
             ep_history = action_history[i] if action_history is not None else []
         else:
-            # Legacy: try to read from prompt dict (fallback for old-style prompts)
             prompt_data = prompts[i] if prompts and i < len(prompts) else None
             if prompt_data and isinstance(prompt_data, dict) and "seed" in prompt_data:
                 ep_seed = prompt_data["seed"]
                 ep_history = prompt_data.get("action_history", [])
             else:
-                ep_seed = i % 50  # vary seed so training sees diverse states
                 ep_history = []
         try:
-            # Replay to the correct state
-            env = RhythmEnvironment()
-            env.reset(seed=ep_seed)
-            for past_action_name in ep_history:
-                past_action = RhythmAction(action_type=ActionType(past_action_name))
-                env.step(past_action)
-            # Step with the model's chosen action
-            action = RhythmAction(action_type=action_type)
-            obs = env.step(action)
-            # Return the env reward (already in reasonable range from REWARD_SCALE)
-            scores.append(obs.reward)
         except Exception:
-            scores.append(-3.0)
     return scores
 def env_reward_simple(completions, prompts=None, **kwargs) -> list[float]:
     """
-    State-aware heuristic reward (no env replay).
-    Parses meter values and time-of-day from the prompt text,
-    then scores whether the chosen action addresses the most pressing need.
-    Faster than env_reward but less precise.
     """
     scores = []
     for i, completion in enumerate(completions):
@@ -155,7 +280,6 @@ def env_reward_simple(completions, prompts=None, **kwargs) -> list[float]:
             scores.append(-2.0)
             continue
-        # Parse prompt for state info
         prompt_text = ""
         if prompts and i < len(prompts):
             p = prompts[i]
@@ -166,24 +290,20 @@ def env_reward_simple(completions, prompts=None, **kwargs) -> list[float]:
             else:
                 prompt_text = str(p)
-        # Extract meter values with regex
         meters = {}
         for meter_name in ["Vitality", "Cognition", "Progress", "Serenity", "Connection"]:
             match = re.search(rf"{meter_name}:\s*([\d.]+)", prompt_text)
             if match:
                 meters[meter_name.lower()] = float(match.group(1))
-        # Extract time slot
         is_morning = "Morning" in prompt_text
         is_night = "Night" in prompt_text
         score = 0.0
         v = meters.get("vitality", 0.5)
         s = meters.get("serenity", 0.5)
         cn = meters.get("connection", 0.5)
-        # Reward actions that address critically low meters
         if v < 0.2 and action_type in (ActionType.SLEEP, ActionType.EXERCISE):
             score += 0.5
         elif v < 0.2:
@@ -199,13 +319,11 @@ def env_reward_simple(completions, prompts=None, **kwargs) -> list[float]:
         elif cn < 0.2:
             score -= 0.1
-        # Time-appropriate bonuses
         if is_morning and action_type in (ActionType.DEEP_WORK, ActionType.LEARN):
             score += 0.2
         if is_night and action_type == ActionType.SLEEP:
             score += 0.3
-        # Penalize binge
         if action_type == ActionType.BINGE_WATCH:
             score -= 0.3

 """
+Reward functions for RhythmEnv GRPO training (meta-RL version).
+Four-layer reward stack:
+1. format_valid    — does the LLM output have a parseable action + belief format?
+2. action_legal    — is the action one of the 10 valid actions?
+3. env_reward      — actual environment reward (seed-replay) for the chosen action
+4. belief_accuracy — how close is the belief vector to the hidden profile's true vector?
+Action output format: "ACTION_NAME S M W"
+  - ACTION_NAME: one of 10 valid actions
+  - S, M, W: single digits 0-9 representing the agent's belief about the user
+    S = social preference (0=hates social, 9=loves social)
+    M = morning preference (0=night owl, 9=morning person)
+    W = work preference  (0=avoids work, 9=workaholic)
+Example: "DEEP_WORK 3 8 7"
+  → action=DEEP_WORK, belief=[0.33, 0.89, 0.78]
 Each function returns a list of floats (one per completion).
 """
+import math
 import os
 import re
+import sys
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
 # All valid action names (uppercase for matching)
 VALID_ACTIONS = {at.value.upper(): at for at in ActionType}
+# Default belief returned when the LLM doesn't provide one (neutral)
+DEFAULT_BELIEF = [0.5, 0.5, 0.5]
+def extract_action_and_belief(text: str) -> tuple[ActionType | None, list[float], bool]:
+    """Parse 'ACTION_NAME S M W' format.
+    Returns:
+        (action, belief, belief_provided):
+            action: parsed ActionType or None
+            belief: 3-dim vector in [0, 1], DEFAULT_BELIEF if not parseable
+            belief_provided: True iff at least one belief digit was parsed from output
+    """
     if not text:
+        return None, list(DEFAULT_BELIEF), False
+    line = text.strip().split("\n")[0].strip()
+    if not line:
+        return None, list(DEFAULT_BELIEF), False
+    # Normalize: uppercase, comma → space
+    parts = line.upper().replace(",", " ").split()
+    if not parts:
+        return None, list(DEFAULT_BELIEF), False
+    # Find action and its index in parts (try first token, then any token)
+    action: ActionType | None = None
+    action_idx = -1
+    for idx, p in enumerate(parts):
+        if p in VALID_ACTIONS:
+            action = VALID_ACTIONS[p]
+            action_idx = idx
+            break
+    if action is None:
+        for idx, p in enumerate(parts):
+            for name, at in VALID_ACTIONS.items():
+                if name in p:
+                    action = at
+                    action_idx = idx
+                    break
+            if action is not None:
+                break
+    # Parse next 3 tokens AFTER the action as belief digits/floats
+    belief: list[float] = []
+    belief_provided = False
+    if action_idx >= 0:
+        for i in range(3):
+            j = action_idx + 1 + i
+            if j < len(parts):
+                p = parts[j].strip().rstrip(".")
+                if not p:
+                    belief.append(0.5)
+                    continue
+                try:
+                    if len(p) == 1 and p.isdigit():
+                        belief.append(int(p) / 9.0)
+                        belief_provided = True
+                    else:
+                        val = float(p)
+                        if val > 1.0:
+                            val = val / 9.0  # interpret as 0-9 scale
+                        belief.append(max(0.0, min(1.0, val)))
+                        belief_provided = True
+                except (ValueError, IndexError):
+                    belief.append(0.5)
+            else:
+                belief.append(0.5)
+    else:
+        belief = list(DEFAULT_BELIEF)
+    if not belief:
+        belief = list(DEFAULT_BELIEF)
+    return action, belief, belief_provided
+def extract_action(text: str) -> ActionType | None:
+    """Backward-compatible action-only extractor (used by env_reward replay)."""
+    action, _, _ = extract_action_and_belief(text)
+    return action
 def format_valid(completions, **kwargs) -> list[float]:
     """
+    Layer 1: Is the output a parseable action + (optionally) belief digits?
+    +1.0 if action + 3 belief digits, +0.5 if action only, -2.0 if no action.
+    Graduated reward pushes the model toward emitting belief without harshly
+    punishing action-only output during early training.
     """
     scores = []
     for completion in completions:
         response = completion[0]["content"] if isinstance(completion, list) else completion
+        action, _, belief_provided = extract_action_and_belief(response)
+        if action is None:
+            scores.append(-2.0)
+        elif belief_provided:
+            scores.append(1.0)
+        else:
+            scores.append(0.5)
     return scores
     """
     Layer 2: Is the parsed action one of the 10 valid actions?
+    All 10 actions are always legal in this env (no state-dependent validity).
     +0.5 if legal, -1.0 if not parseable.
     """
     scores = []
     return scores
+def _replay_env(seed: int, history: list, profile_mode: str = "continuous") -> RhythmEnvironment:
+    """Build a replayed env at the given seed/history/mode."""
+    env = RhythmEnvironment()
+    env.reset(seed=seed, profile_mode=profile_mode)
+    for past_action_name in history:
+        env.step(RhythmAction(action_type=ActionType(past_action_name)))
+    return env
 def env_reward(
     completions,
     prompts=None,
     seed=None,
     step_index=None,
     action_history=None,
+    profile_mode=None,
     **kwargs,
 ) -> list[float]:
     """
     Layer 3: Actual environment reward from stepping with the chosen action.
     Replays the episode to the correct state, then steps with the model's action.
+    Returns the raw env reward (already scaled by REWARD_SCALE).
     seed/step_index/action_history are passed as dataset columns by GRPOTrainer.
+    profile_mode column controls whether replay uses continuous or discrete profile.
     """
     scores = []
             scores.append(-3.0)
             continue
+        # Per-row dataset values (preferred path)
         if seed is not None and i < len(seed):
             ep_seed = seed[i]
             ep_history = action_history[i] if action_history is not None else []
+            ep_mode = profile_mode[i] if (profile_mode is not None and i < len(profile_mode)) else "continuous"
         else:
             prompt_data = prompts[i] if prompts and i < len(prompts) else None
             if prompt_data and isinstance(prompt_data, dict) and "seed" in prompt_data:
                 ep_seed = prompt_data["seed"]
                 ep_history = prompt_data.get("action_history", [])
+                ep_mode = prompt_data.get("profile_mode", "continuous")
             else:
+                ep_seed = i % 50
                 ep_history = []
+                ep_mode = "continuous"
         try:
+            env = _replay_env(ep_seed, ep_history, ep_mode)
+            obs = env.step(RhythmAction(action_type=action_type))
+            scores.append(obs.reward)
+        except Exception:
+            scores.append(-3.0)
+    return scores
+def belief_accuracy(
+    completions,
+    prompts=None,
+    seed=None,
+    step_index=None,
+    action_history=None,
+    profile_mode=None,
+    **kwargs,
+) -> list[float]:
+    """
+    Layer 4: Belief-vector accuracy reward (META-LEARNING signal).
+    Compares the agent's emitted [social, morning, work] belief vector to the
+    hidden profile's true belief vector. Reward in [-0.5, +0.5]:
+      perfect match  → +0.5
+      neutral [0.5,0.5,0.5] → 0.0  (zero-effort baseline)
+      max wrong      → -0.5
+    Mean-absolute-error based (cleaner than cosine for [0,1] vectors).
+    Skipped (returns 0) if no seed available — keeps reward conservative.
+    """
+    scores = []
+    for i, completion in enumerate(completions):
+        response = completion[0]["content"] if isinstance(completion, list) else completion
+        _, belief, belief_provided = extract_action_and_belief(response)
+        if not belief_provided:
+            scores.append(-0.2)  # weak push toward emitting belief
+            continue
+        # Resolve seed/mode for replay
+        if seed is not None and i < len(seed):
+            ep_seed = seed[i]
+            ep_history = action_history[i] if action_history is not None else []
+            ep_mode = profile_mode[i] if (profile_mode is not None and i < len(profile_mode)) else "continuous"
+        else:
+            scores.append(0.0)
+            continue
+        try:
+            env = _replay_env(ep_seed, ep_history, ep_mode)
+            true_belief = env.get_belief_target()
+            mae = sum(abs(b - t) for b, t in zip(belief, true_belief)) / 3.0
+            similarity = 1.0 - mae  # in [0, 1]
+            scores.append(similarity - 0.5)  # in [-0.5, +0.5]
         except Exception:
+            scores.append(0.0)
     return scores
 def env_reward_simple(completions, prompts=None, **kwargs) -> list[float]:
     """
+    State-aware heuristic reward (no env replay), used for fast smoke-training.
+    Identical to original simple reward — the format change is handled by extract_action.
     """
     scores = []
     for i, completion in enumerate(completions):
             scores.append(-2.0)
             continue
         prompt_text = ""
         if prompts and i < len(prompts):
             p = prompts[i]
             else:
                 prompt_text = str(p)
         meters = {}
         for meter_name in ["Vitality", "Cognition", "Progress", "Serenity", "Connection"]:
             match = re.search(rf"{meter_name}:\s*([\d.]+)", prompt_text)
             if match:
                 meters[meter_name.lower()] = float(match.group(1))
         is_morning = "Morning" in prompt_text
         is_night = "Night" in prompt_text
         score = 0.0
         v = meters.get("vitality", 0.5)
         s = meters.get("serenity", 0.5)
         cn = meters.get("connection", 0.5)
         if v < 0.2 and action_type in (ActionType.SLEEP, ActionType.EXERCISE):
             score += 0.5
         elif v < 0.2:
         elif cn < 0.2:
             score -= 0.1
         if is_morning and action_type in (ActionType.DEEP_WORK, ActionType.LEARN):
             score += 0.2
         if is_night and action_type == ActionType.SLEEP:
             score += 0.3
         if action_type == ActionType.BINGE_WATCH:
             score -= 0.3

training/train.py CHANGED Viewed

@@ -1,15 +1,20 @@
 """
-RhythmEnv GRPO Training Script
-Trains an LLM agent to balance life meters using Group Relative Policy
-Optimization (GRPO) with Unsloth for efficient fine-tuning.
-Usage (Colab with GPU):
     !pip install unsloth transformers trl datasets
-    !python training/train.py
-Usage (local):
-    python training/train.py --max_steps 100 --model_name "unsloth/Qwen2.5-3B-Instruct"
 """
 import argparse
@@ -21,32 +26,37 @@ sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
 def main():
-    parser = argparse.ArgumentParser(description="Train RhythmEnv agent with GRPO")
-    parser.add_argument("--model_name", type=str, default="unsloth/Qwen2.5-3B-Instruct",
-                        help="Model to fine-tune (HF model ID)")
-    parser.add_argument("--max_steps", type=int, default=500,
-                        help="Number of GRPO training steps")
-    parser.add_argument("--num_episodes", type=int, default=200,
-                        help="Number of episodes for dataset generation")
-    parser.add_argument("--max_samples", type=int, default=2000,
                         help="Maximum training samples")
     parser.add_argument("--num_generations", type=int, default=4,
-                        help="Number of completions per prompt for GRPO")
-    parser.add_argument("--learning_rate", type=float, default=5e-5,
-                        help="Learning rate")
-    parser.add_argument("--output_dir", type=str, default="outputs/rhythmenv_trained",
-                        help="Output directory for model and logs")
     parser.add_argument("--use_simple_reward", action="store_true",
-                        help="Use simplified reward (no env replay, faster)")
-    parser.add_argument("--report_to", type=str, default="none",
-                        help="Logging: 'wandb', 'trackio', or 'none'")
     args = parser.parse_args()
     # ---------------------------------------------------------------
     # 1. Generate dataset
     # ---------------------------------------------------------------
     print("=" * 60)
-    print("Step 1: Generating training dataset...")
     print("=" * 60)
     from dataset import generate_dataset
@@ -56,15 +66,18 @@ def main():
         num_episodes=args.num_episodes,
         strategy="mixed",
         max_samples=args.max_samples,
     )
-    # Include replay metadata so env_reward can reconstruct the exact state
     dataset = Dataset.from_list([
         {
             "prompt": sample["prompt"],
             "seed": sample["seed"],
             "step_index": sample["step_index"],
             "action_history": sample["action_history"],
         }
         for sample in raw_samples
     ])
@@ -74,13 +87,12 @@ def main():
     # 2. Load model with Unsloth
     # ---------------------------------------------------------------
     print("\n" + "=" * 60)
-    print("Step 2: Loading model...")
     print("=" * 60)
     from unsloth import FastLanguageModel
-    max_seq_length = 768
-    lora_rank = 4
     model, tokenizer = FastLanguageModel.from_pretrained(
         model_name=args.model_name,
@@ -90,51 +102,53 @@ def main():
     model = FastLanguageModel.get_peft_model(
         model,
-        r=lora_rank,
         target_modules=[
             "q_proj", "k_proj", "v_proj", "o_proj",
             "gate_proj", "up_proj", "down_proj",
         ],
-        lora_alpha=lora_rank * 2,
         use_gradient_checkpointing="unsloth",
         random_state=3407,
     )
-    print(f"Model loaded: {args.model_name}")
     # ---------------------------------------------------------------
-    # 3. Set up reward functions
     # ---------------------------------------------------------------
     print("\n" + "=" * 60)
-    print("Step 3: Setting up reward functions...")
     print("=" * 60)
-    from reward_functions import format_valid, action_legal, env_reward, env_reward_simple
     if args.use_simple_reward:
-        reward_funcs = [format_valid, action_legal, env_reward_simple]
-        print("Using: format_valid + action_legal + env_reward_simple (fast mode)")
     else:
-        reward_funcs = [format_valid, action_legal, env_reward]
-        print("Using: format_valid + action_legal + env_reward (full replay)")
     # ---------------------------------------------------------------
-    # 4. Configure GRPO trainer
     # ---------------------------------------------------------------
     print("\n" + "=" * 60)
-    print("Step 4: Configuring GRPO trainer...")
     print("=" * 60)
     from trl import GRPOConfig, GRPOTrainer
-    max_prompt_length = 400
-    max_completion_length = 16  # Action names are 3-15 chars; cap prevents verbose drift
     training_args = GRPOConfig(
         temperature=1.0,
         learning_rate=args.learning_rate,
-        beta=0.1,               # KL penalty — higher = more conservative, prevents policy drift
-        max_grad_norm=0.5,      # Gradient clipping prevents large destabilizing updates
         weight_decay=0.001,
         warmup_ratio=0.1,
         lr_scheduler_type="linear",
@@ -146,20 +160,21 @@ def main():
         max_prompt_length=max_prompt_length,
         max_completion_length=max_completion_length,
         max_steps=args.max_steps,
-        save_steps=100,
         report_to=args.report_to,
         output_dir=args.output_dir,
     )
-    print(f"Max steps: {args.max_steps}")
-    print(f"Num generations: {args.num_generations}")
-    print(f"Learning rate: {args.learning_rate}")
     # ---------------------------------------------------------------
     # 5. Train
     # ---------------------------------------------------------------
     print("\n" + "=" * 60)
-    print("Step 5: Starting GRPO training...")
     print("=" * 60)
     trainer = GRPOTrainer(
@@ -173,31 +188,26 @@ def main():
     trainer.train()
     # ---------------------------------------------------------------
-    # 6. Save model
     # ---------------------------------------------------------------
     print("\n" + "=" * 60)
-    print("Step 6: Saving model...")
     print("=" * 60)
-    # Save merged model (correct way — don't upcast 4-bit naively)
     model.save_pretrained_merged(
         args.output_dir,
         tokenizer,
         save_method="merged_16bit",
     )
-    print(f"Model saved to: {args.output_dir}")
-    # Also save training config for reference
     config_path = os.path.join(args.output_dir, "training_config.json")
     with open(config_path, "w") as f:
         json.dump(vars(args), f, indent=2)
-    print(f"Training config saved to: {config_path}")
-    print("\n" + "=" * 60)
-    print("Training complete!")
-    print(f"Model: {args.output_dir}")
-    print("Next: run inference_eval.py to compare baseline vs trained")
-    print("=" * 60)
 if __name__ == "__main__":

 """
+RhythmEnv GRPO Training Script (Meta-RL version).
+Trains an LLM agent to BOTH (a) balance life meters AND (b) infer the hidden
+personality of the person it's helping. Four-layer reward stack:
+  format_valid     — output parseable as ACTION + 3 belief digits
+  action_legal     — action is one of 10 valid types
+  env_reward       — actual env reward for the chosen action (seed replay)
+  belief_accuracy  — how close the belief vector is to the hidden profile
+Usage (Colab T4):
     !pip install unsloth transformers trl datasets
+    !python training/train.py --max_steps 1500
+Usage (local — for setup checks; needs CUDA for actual training):
+    python training/train.py --max_steps 50 --use_simple_reward
 """
 import argparse
 def main():
+    parser = argparse.ArgumentParser(description="Train RhythmEnv agent with GRPO (meta-RL)")
+    parser.add_argument("--model_name", type=str, default="unsloth/Qwen2.5-3B-Instruct")
+    parser.add_argument("--max_steps", type=int, default=1500,
+                        help="Number of GRPO training steps (1500 recommended for meta-RL)")
+    parser.add_argument("--num_episodes", type=int, default=300,
+                        help="Number of episodes for dataset generation (more diversity = better meta-RL)")
+    parser.add_argument("--max_samples", type=int, default=3000,
                         help="Maximum training samples")
     parser.add_argument("--num_generations", type=int, default=4,
+                        help="Completions per prompt for GRPO (higher = lower variance, more compute)")
+    parser.add_argument("--learning_rate", type=float, default=5e-5)
+    parser.add_argument("--beta", type=float, default=0.1,
+                        help="KL penalty (raise to 0.2 if training is unstable)")
+    parser.add_argument("--lora_rank", type=int, default=8,
+                        help="LoRA rank (8 = more capacity than original 4 for meta-RL)")
+    parser.add_argument("--hint_fraction", type=float, default=0.15,
+                        help="Fraction of dataset with profile hint visible (curriculum warmup)")
+    parser.add_argument("--profile_mode", type=str, default="continuous",
+                        choices=["continuous", "discrete"],
+                        help="continuous = sampled per-episode (meta-RL); discrete = 3 hardcoded profiles")
+    parser.add_argument("--output_dir", type=str, default="outputs/rhythmenv_meta_trained")
     parser.add_argument("--use_simple_reward", action="store_true",
+                        help="Use heuristic reward instead of env-replay (smoke testing)")
+    parser.add_argument("--report_to", type=str, default="none")
     args = parser.parse_args()
     # ---------------------------------------------------------------
     # 1. Generate dataset
     # ---------------------------------------------------------------
     print("=" * 60)
+    print("Step 1: Generating training dataset (continuous profiles)")
     print("=" * 60)
     from dataset import generate_dataset
         num_episodes=args.num_episodes,
         strategy="mixed",
         max_samples=args.max_samples,
+        profile_mode=args.profile_mode,
+        hint_fraction=args.hint_fraction,
     )
+    # Replay metadata so env_reward + belief_accuracy can reconstruct state
     dataset = Dataset.from_list([
         {
             "prompt": sample["prompt"],
             "seed": sample["seed"],
             "step_index": sample["step_index"],
             "action_history": sample["action_history"],
+            "profile_mode": sample["profile_mode"],
         }
         for sample in raw_samples
     ])
     # 2. Load model with Unsloth
     # ---------------------------------------------------------------
     print("\n" + "=" * 60)
+    print(f"Step 2: Loading model {args.model_name}")
     print("=" * 60)
     from unsloth import FastLanguageModel
+    max_seq_length = 1024  # bumped from 768 to fit longer prompts with history
     model, tokenizer = FastLanguageModel.from_pretrained(
         model_name=args.model_name,
     model = FastLanguageModel.get_peft_model(
         model,
+        r=args.lora_rank,
         target_modules=[
             "q_proj", "k_proj", "v_proj", "o_proj",
             "gate_proj", "up_proj", "down_proj",
         ],
+        lora_alpha=args.lora_rank * 2,
         use_gradient_checkpointing="unsloth",
         random_state=3407,
     )
+    print(f"LoRA rank: {args.lora_rank}, alpha: {args.lora_rank * 2}")
     # ---------------------------------------------------------------
+    # 3. Reward functions (4-layer stack including belief_accuracy)
     # ---------------------------------------------------------------
     print("\n" + "=" * 60)
+    print("Step 3: Setting up reward functions")
     print("=" * 60)
+    from reward_functions import (
+        format_valid, action_legal, env_reward, env_reward_simple, belief_accuracy
+    )
     if args.use_simple_reward:
+        reward_funcs = [format_valid, action_legal, env_reward_simple, belief_accuracy]
+        print("Using: format_valid + action_legal + env_reward_simple + belief_accuracy")
     else:
+        reward_funcs = [format_valid, action_legal, env_reward, belief_accuracy]
+        print("Using: format_valid + action_legal + env_reward + belief_accuracy")
     # ---------------------------------------------------------------
+    # 4. GRPO trainer config
     # ---------------------------------------------------------------
     print("\n" + "=" * 60)
+    print("Step 4: Configuring GRPO trainer")
     print("=" * 60)
     from trl import GRPOConfig, GRPOTrainer
+    max_prompt_length = 600          # bumped from 400 for longer prompts (history + hint)
+    max_completion_length = 20       # bumped from 16 for "ACTION_NAME D D D" format
     training_args = GRPOConfig(
         temperature=1.0,
         learning_rate=args.learning_rate,
+        beta=args.beta,
+        max_grad_norm=0.5,
         weight_decay=0.001,
         warmup_ratio=0.1,
         lr_scheduler_type="linear",
         max_prompt_length=max_prompt_length,
         max_completion_length=max_completion_length,
         max_steps=args.max_steps,
+        save_steps=250,                 # checkpoint every 250 (was 100)
         report_to=args.report_to,
         output_dir=args.output_dir,
     )
+    print(f"max_steps={args.max_steps}, num_generations={args.num_generations}, "
+          f"lr={args.learning_rate}, beta={args.beta}")
+    print(f"max_prompt_length={max_prompt_length}, max_completion_length={max_completion_length}")
+    print(f"hint_fraction={args.hint_fraction} (curriculum warmup)")
     # ---------------------------------------------------------------
     # 5. Train
     # ---------------------------------------------------------------
     print("\n" + "=" * 60)
+    print("Step 5: Starting GRPO training")
     print("=" * 60)
     trainer = GRPOTrainer(
     trainer.train()
     # ---------------------------------------------------------------
+    # 6. Save merged model
     # ---------------------------------------------------------------
     print("\n" + "=" * 60)
+    print("Step 6: Saving model")
     print("=" * 60)
     model.save_pretrained_merged(
         args.output_dir,
         tokenizer,
         save_method="merged_16bit",
     )
     config_path = os.path.join(args.output_dir, "training_config.json")
     with open(config_path, "w") as f:
         json.dump(vars(args), f, indent=2)
+    print(f"Model saved to: {args.output_dir}")
+    print(f"Training config saved to: {config_path}")
+    print("\nNext: run inference_eval.py to compare baseline vs trained")
+    print("      python training/inference_eval.py --model_path " + args.output_dir)
 if __name__ == "__main__":