InosLihka Claude Opus 4.7 (1M context) commited on
Commit
ecbe0d8
·
1 Parent(s): 0a15ab5

env: meta-RL refactor (continuous profiles, action+belief, adaptation grader)

Browse files

Transforms rhythm_env from a 3-profile single-task RL benchmark into a
meta-RL environment where the agent must infer the user's hidden personality
from observations alone. Original 500-step training regressed vs heuristic
on all 3 profiles; this refactor addresses the four root causes:

1. server/rhythm_environment.py
- Continuous profile sampling: sample_profile(seed) draws 14 personality
parameters from bounded distributions per episode (memorization
impossible, agent must learn the *skill* of inference)
- Discrete 3-profile mode preserved via profile_mode='discrete' or
profile=<name> for legacy comparison
- profile_to_belief_vector(): reduces a profile to 3-dim ground truth
[social, morning, work] for belief-accuracy reward
- get_belief_target() / get_profile_hint(): privileged accessors for
training-side reward computation and curriculum hints
- Grader rewrite: dropped meter_balance (was rewarding random behavior),
added adaptation_score (0.30 weight) computed as late-half mean reward
minus early-half, gated by absolute late-half quality. This is the
direct meta-learning signal: only positive if the agent got better
mid-episode

2. training/reward_functions.py
- 4-layer reward stack adds belief_accuracy as the meta-RL gradient
- extract_action_and_belief() parses 'ACTION_NAME S M W' format where
S/M/W are 0-9 digits encoding agent's belief about the user
- format_valid graduated: action+belief +1.0, action only +0.5, no action -2.0
- env_reward and belief_accuracy use seed+history+profile_mode replay
for deterministic, non-fabricable rewards

3. training/dataset.py
- profile_mode parameter (default 'continuous')
- hint_fraction parameter for curriculum: a fraction of training samples
include the true profile vector in prompt as warmup
- System prompt updated to teach action+belief format with examples
- Observation prompt now includes recent step history

4. training/train.py + RhythmEnv_GRPO_Training.ipynb
- max_steps default 1500 (was 500), LoRA rank 8 (was 4)
- max_prompt_length 600, max_completion_length 20 (room for belief)
- hint_fraction=0.15 default
- Notebook: comparison plot uses eval_results.json schema, plots both
final_score and adaptation across 3 conditions (discrete/in-dist/OOD)

5. training/inference_eval.py
- 3-condition eval suite: discrete-3-profiles (legacy comparison),
continuous-in-distribution, continuous-OOD (generalization test)
- Tracks adaptation and belief_mae per episode
- OOD test is the meta-learning proof: trained agent must beat heuristic
on profiles drawn from a held-out parameter region

6. inference.py
- System prompt updated for action+belief format
- parse_llm_action delegates to training parser (consistent with training)
- Observation prompt includes step history

7. New diagnostics + baselines
- scripts/diagnostic_replay.py: verifies env_reward replay is deterministic
(passes 280/280 steps with 0 mismatches)
- scripts/pipeline_dryrun.py: validates 4-layer reward ordering on
synthetic completions (perfect=+2.60 > good=+2.21 > wrong_belief=+2.20
> action_only=+1.41 > garbage=-6.20)
- eval_baselines_meta.json: new grader baselines across 3 conditions
- eval_results_v1.json: preserved old grader baselines for comparison

8. README.md rewritten with meta-RL framing, parameter distribution table,
action+belief format docs, new baseline numbers (heuristic in-dist
0.587, OOD 0.580 — the bars to beat). All 31 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

README.md CHANGED
@@ -9,24 +9,25 @@ tags:
9
  - openenv
10
  ---
11
 
12
- # RhythmEnv — Life Simulator RL Environment
13
 
14
- An OpenEnv environment where AI agents learn to balance 5 life meters across a 7-day week while discovering hidden personality profiles through reward signals.
 
 
15
 
16
  ## Links
17
 
18
  - **Live Environment**: [HF Space — InosLihka/rhythm_env](https://huggingface.co/spaces/InosLihka/rhythm_env)
19
  - **Training Notebook**: [Colab — GRPO Training](training/RhythmEnv_GRPO_Training.ipynb)
20
  - **Blog Post**: [Teaching an AI to Know You (Without Asking)](docs/blog_post.md)
21
- <!-- - **Training Results**: See [Results](#training-results) section below -->
22
 
23
  ## Why a Life Simulator?
24
 
25
- Personal AI assistants give generic advice. They don't know *you*. RhythmEnv is an environment where an agent learns your specific patterns through experiencenot configuration.
26
 
27
- Every person has a hidden "DNA" that determines how activities affect their well-being: introverts drain faster from socializing, workaholics find peace through progress, night owls peak in the evening. The agent must discover these hidden dynamics by observing how the same actions produce different rewards across episodes.
28
 
29
- This is **Theme 3.2 (Personalized Tasks)** + **Theme 2 (Long-Horizon Planning)** — the agent must plan across a full week while adapting to a hidden personality it must infer.
30
 
31
  ## Quick Start
32
 
@@ -90,91 +91,142 @@ All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receiv
90
  - **Random events** (~8% per step): Prod Crash, Family Emergency, Illness, Good News
91
  - **Deterministic** given seed: same seed → same episode trajectory
92
 
93
- ## The Hidden Variable System (Core Innovation)
94
 
95
- ### What the Agent Sees
96
- - All 5 meter values
97
  - Current day, slot, timestep
98
- - Active event (if any)
99
- - Total scalar reward per step
 
100
 
101
  ### What the Agent Does NOT See
102
- - **Personality profile name** — assigned per episode, controls everything below
103
- - **Reward weight decomposition** — same meter changes produce different rewards for different profiles
104
- - **Action modifiers** — social drain, cognitive bonuses, shame spirals vary by profile
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
 
106
- ### 3 Hidden Personality Profiles
107
 
108
- | Profile | Hidden Trait Values | "Good Day" (Reward Weights) |
109
- |---------|----------------|-------------------|
110
- | **Introvert Morning Person** | Social drains 3× Vitality, Morning cognition ×2.0, Binge shame spiral, Solo activities +0.10 Serenity | **Serenity 60%**, Progress 20% |
111
- | **Extrovert Night Owl** | Social drains only 0.2× Vitality, Evening cognition ×1.8, Morning penalty ×0.4, Social gives 2× Connection + Serenity bonus | **Connection 75%** |
112
- | **Workaholic Stoic** | Progress gives +0.10 Serenity, Idle drains -0.10 Serenity, Work gives +0.06 Vitality recovery, 1.5× Vitality decay | **Progress 70%** |
 
 
 
 
 
113
 
114
  ### The Discovery Challenge
115
 
116
  The agent must:
117
- 1. **Probe** — try different actions to see unexpected meter changes
118
- 2. **Infer** — correlate reward patterns to discover the hidden personality
119
- 3. **Adapt** — change its entire strategy based on what it learned
120
-
121
- A trained agent for the Introvert learns to: front-load deep work in the morning, avoid socializing, never binge watch. The same agent for the Extrovert learns to: socialize freely, work in the evening, skip mornings.
122
 
123
- ## Reward Architecture
124
 
125
- **Per-step reward** = `sum(meter_delta × hidden_weight) × 15`
 
 
 
 
 
126
 
127
- The reward weights are hidden and differ per profile. The agent sees only the total scalar reward.
128
 
129
- **Critical threshold**: If any meter drops below 0.1, a -0.30 penalty is applied.
130
 
131
- **Final grader** (0-1 score at episode end):
132
  ```
133
- score = 0.30 × meter_balance + 0.25 × crash_free + 0.20 × progress + 0.15 × connection + 0.10 × efficiency
 
134
  ```
135
 
136
- ## Baseline Scores
137
 
 
 
 
138
  | Profile | Heuristic | Random |
139
  |---------|-----------|--------|
140
- | Introvert Morning | **0.765** | 0.693 |
141
- | Extrovert Night Owl | **0.819** | 0.863 |
142
- | Workaholic Stoic | **0.761** | 0.602 |
143
-
144
- The heuristic uses fixed rules (sleep at night, work in morning, socialize when connection is low). It doesn't adapt to profiles — a trained agent that discovers the hidden personality should significantly outperform it.
145
 
146
- ## Training Results
 
 
 
 
147
 
148
- Training uses **GRPO (Group Relative Policy Optimization)** with [Unsloth](https://github.com/unslothai/unsloth) for efficient 4-bit fine-tuning on Qwen 2.5-3B (4-bit + LoRA rank 4, free Colab T4, ~3 hours for 500 steps).
 
 
 
 
149
 
150
- **Three-layer reward stack:**
151
- 1. `format_valid` — did the model output a parseable action name? (+1.0 / -2.0)
152
- 2. `action_legal` — is it one of the 10 valid actions? (+0.5 / -1.0)
153
- 3. `env_reward` — actual reward from seed-based episode replay (cannot be fabricated)
154
 
155
- **Key config:** `beta=0.1` (KL penalty), `lr=5e-5`, `max_grad_norm=0.5`, `max_completion_length=16`
156
 
157
- ### Training Curves
158
 
159
- ![Training Loss](plots/training_loss.png)
160
- *GRPO training loss over 500 steps.*
 
 
161
 
162
- ![Reward Curve](plots/reward_curve.png)
163
- *Mean reward with ±1 std band. Upward trend indicates the agent learning to balance meters.*
164
 
165
- ### Baseline vs Trained Comparison
166
 
167
- ![Baseline vs Trained](plots/baseline_vs_trained.png)
168
 
169
- | Profile | Random | Heuristic | Trained (500 steps) | vs Heuristic |
170
- |---------|--------|-----------|---------------------|--------------|
171
- | Introvert Morning | 0.693 | 0.765 | **0.617** | -0.148 |
172
- | Extrovert Night Owl | 0.863 | 0.819 | **0.725** | -0.094 |
173
- | Workaholic Stoic | 0.602 | 0.761 | **0.539** | -0.222 |
174
 
175
- **What the agent learned**: Serenity and Connection management improved substantially — the introvert's serenity maxed at 1.0, the extrovert's connection reached 1.0. The model correctly learned the *sign* of personality differences (introvert avoids socialising; extrovert embraces it).
 
 
 
 
 
176
 
177
- **What still needs training**: The workaholic profile's key hidden modifier — that `DEEP_WORK` *recovers* vitality instead of draining it (+0.06 instead of -0.05) was not discovered in 500 steps. Without this insight, the trained agent under-uses deep work for the workaholic. A longer run (1000+ steps) with higher DEEP_WORK sampling density in the workaholic training episodes is the clear next step.
178
 
179
  ## Setup Instructions
180
 
@@ -210,9 +262,9 @@ python inference.py
210
 
211
  | Method | Endpoint | Description |
212
  |--------|----------|-------------|
213
- | `POST` | `/reset` | Start a new episode (`seed`, `profile` optional) |
214
  | `POST` | `/step` | Execute an action (`action_type`) |
215
- | `GET` | `/state` | Get current state (includes hidden profile for debugging) |
216
  | `GET` | `/health` | Health check |
217
  | `GET` | `/metadata` | Environment metadata |
218
  | `GET` | `/schema` | Action/observation JSON schemas |
 
9
  - openenv
10
  ---
11
 
12
+ # RhythmEnv — Meta-RL Life Simulator
13
 
14
+ An OpenEnv environment where an LLM agent learns *how to learn a person*. Each episode samples a different hidden personality from a continuous parameter space — the agent must infer who it's helping from rewards alone, then adapt its strategy mid-episode.
15
+
16
+ This is **meta-reinforcement learning** for personalization: the agent isn't trained to optimize one person's life, it's trained to acquire the *skill of figuring out a new person* from a handful of interactions.
17
 
18
  ## Links
19
 
20
  - **Live Environment**: [HF Space — InosLihka/rhythm_env](https://huggingface.co/spaces/InosLihka/rhythm_env)
21
  - **Training Notebook**: [Colab — GRPO Training](training/RhythmEnv_GRPO_Training.ipynb)
22
  - **Blog Post**: [Teaching an AI to Know You (Without Asking)](docs/blog_post.md)
 
23
 
24
  ## Why a Life Simulator?
25
 
26
+ Personal AI assistants give generic advice. They don't know *you*. RhythmEnv trains an agent on a continuous distribution of simulated personalities so that, when it meets a real user, it already knows the *structure* of personality inference it personalizes in ~5 interactions instead of 50+.
27
 
28
+ Every sampled person has a hidden "DNA" a vector of preferences (social tolerance, morning energy, work motivation) plus action modifiers drawn from distributions. The agent sees only the resulting meter changes and rewards. To do well, it must infer the hidden vector and adapt.
29
 
30
+ This is **Theme 3.2 (Personalized Tasks)** + **Theme 2 (Long-Horizon Planning)** — the agent plans across a full week while inferring a hidden personality from observation alone.
31
 
32
  ## Quick Start
33
 
 
91
  - **Random events** (~8% per step): Prod Crash, Family Emergency, Illness, Good News
92
  - **Deterministic** given seed: same seed → same episode trajectory
93
 
94
+ ## The Meta-Learning Setup (Core Innovation)
95
 
96
+ ### What the Agent Sees Each Step
97
+ - All 5 meter values + per-meter deltas from the last action
98
  - Current day, slot, timestep
99
+ - Active random event (if any)
100
+ - Rolling 7-step history of (action, reward, deltas)
101
+ - Total scalar reward
102
 
103
  ### What the Agent Does NOT See
104
+ - **The hidden personality vector** — sampled per episode, controls everything below
105
+ - **Reward weight decomposition** — same meter changes produce different rewards for different people
106
+ - **Action modifiers** — social drain, cognitive bonuses, shame spirals vary continuously
107
+
108
+ ### Continuous Personality Space
109
+
110
+ Each `reset()` samples a fresh personality from parameter distributions:
111
+
112
+ | Parameter | Distribution | Effect |
113
+ |---|---|---|
114
+ | `social_vitality_multiplier` | U(0.2, 3.0) | low = extrovert, high = introvert |
115
+ | `morning_cognition_bonus` | U(0.4, 2.0) or none | morning person trait |
116
+ | `evening_night_cognition_bonus` | U(0.6, 1.8) or none | night owl trait |
117
+ | `binge_shame` | Bernoulli(0.5) | conscientiousness |
118
+ | `progress_serenity_bonus` | U(0, 0.10) | workaholic peace-from-work |
119
+ | `work_vitality_recovery` | U(0, 0.06) | work energizes vs drains |
120
+ | `solo_serenity_bonus` | U(0, 0.10) | introvert solo recharge |
121
+ | `social_connection_multiplier` | U(1, 2) | extrovert social fluency |
122
+ | `social_serenity_bonus` | U(0, 0.06) | extrovert peace-from-people |
123
+ | `connection_decay_rate` | U(0.005, 0.02) | how fast relationships fade |
124
+ | `vitality_decay_rate` | U(0, 0.04) | base energy decline |
125
+ | `event_impact_multiplier` | U(0.5, 1.0) | resilience to setbacks |
126
+ | `stress_tolerance` | U(0.15, 0.30) | when stress spiral kicks in |
127
+ | `reward_weights` | Dirichlet biased to non-vit/cog | what THIS person values |
128
+
129
+ This produces an effectively infinite personality space — memorization is impossible, the agent must learn the *skill* of inference.
130
+
131
+ ### The 3 Discrete Profiles (legacy / eval anchor)
132
+
133
+ For comparison with the original benchmark, the env also exposes 3 hardcoded personalities reachable via `profile_mode='discrete'` or `profile=<name>`:
134
+
135
+ - **Introvert Morning Person** → belief vector ≈ `[0.0 social, 1.0 morning, 0.07 work]`
136
+ - **Extrovert Night Owl** → belief vector ≈ `[1.0 social, 0.20 morning, 0.02 work]`
137
+ - **Workaholic Stoic** → belief vector ≈ `[0.36 social, 0.50 morning, 1.0 work]`
138
+
139
+ These exist as anchor points in the continuous space.
140
 
141
+ ### The Action+Belief Output Format
142
 
143
+ Each step the agent outputs ONE LINE: `ACTION_NAME S M W`
144
+
145
+ Where `S M W` are single digits (0-9) representing the agent's CURRENT belief about the user:
146
+ - **S** = social preference (0=hates social, 9=loves social)
147
+ - **M** = morning preference (0=night owl, 9=morning person)
148
+ - **W** = work preference (0=avoids work, 9=workaholic)
149
+
150
+ Example: `DEEP_WORK 3 8 7` means *do deep work, AND I think this person is moderately introverted, a strong morning person, and somewhat work-oriented*.
151
+
152
+ The belief output is the meta-learning lever. The agent receives a separate reward for belief accuracy (cosine-vs-MAE against the hidden true vector), forcing it to *explicitly model the user* rather than just pattern-match action→reward.
153
 
154
  ### The Discovery Challenge
155
 
156
  The agent must:
157
+ 1. **Probe** — try different actions in early steps to see how the person responds
158
+ 2. **Infer** — update its belief vector each step based on observed rewards
159
+ 3. **Adapt** — late in the episode, exploit the belief by choosing actions matching the inferred personality
 
 
160
 
161
+ ## Reward Architecture (4-layer training stack)
162
 
163
+ | Layer | Function | Range | Purpose |
164
+ |---|---|---|---|
165
+ | 1 | `format_valid` | -2 to +1 | parseable as ACTION + 3 belief digits |
166
+ | 2 | `action_legal` | -1 to +0.5 | action is one of 10 valid types |
167
+ | 3 | `env_reward` | -3 to ~+1.5 | actual env reward via seed-based replay |
168
+ | 4 | `belief_accuracy` | -0.5 to +0.5 | cosine-MAE vs true profile vector |
169
 
170
+ **Per-step env reward** = `sum(meter_delta × hidden_weight) × 15` weights are sampled per profile.
171
 
172
+ **Critical threshold**: any meter < 0.1 -0.30 penalty.
173
 
174
+ **Final grader (meta-RL aligned)** `final_score ∈ [0, 1]`:
175
  ```
176
+ score = 0.20 × crash_free + 0.25 × progress + 0.15 × connection
177
+ + 0.30 × adaptation_score + 0.10 × efficiency
178
  ```
179
 
180
+ **`adaptation_score` is the meta-learning signal**: late-half mean reward minus early-half mean, gated by absolute late-half quality so a "terrible-then-mediocre" exploit cannot win. It directly rewards "the agent got better as it learned the user." Random and heuristic baselines have *negative* adaptation; only an agent that actually learns mid-episode can win this term.
181
 
182
+ ## Baseline Scores (new grader, 5 episodes per condition)
183
+
184
+ ### Discrete-3-profiles (legacy comparison)
185
  | Profile | Heuristic | Random |
186
  |---------|-----------|--------|
187
+ | Introvert Morning | **0.574** | 0.535 |
188
+ | Extrovert Night Owl | 0.594 | **0.641** |
189
+ | Workaholic Stoic | **0.585** | 0.485 |
 
 
190
 
191
+ ### Continuous in-distribution (the meta-RL training surface)
192
+ | Strategy | avg_score | avg_adaptation |
193
+ |---|---|---|
194
+ | Heuristic | **0.587** | -0.242 |
195
+ | Random | 0.516 | -0.304 |
196
 
197
+ ### Continuous OOD (held-out parameter region generalization test)
198
+ | Strategy | avg_score | avg_adaptation |
199
+ |---|---|---|
200
+ | Heuristic | **0.580** | -0.083 |
201
+ | Random | 0.508 | -0.218 |
202
 
203
+ The trained agent must beat heuristic on continuous in-distribution AND on the OOD region — beating only in-distribution would suggest memorization, not meta-learning. Positive `adaptation` is direct evidence of mid-episode learning.
 
 
 
204
 
205
+ ## Training
206
 
207
+ Training uses **GRPO (Group Relative Policy Optimization)** with [Unsloth](https://github.com/unslothai/unsloth) for efficient 4-bit fine-tuning on Qwen 2.5-3B + LoRA rank 8 (free Colab T4, ~6 hours for 1500 steps).
208
 
209
+ **Run:**
210
+ ```bash
211
+ python training/train.py --max_steps 1500
212
+ ```
213
 
214
+ Defaults are tuned for meta-RL on T4: continuous profile mode, hint_fraction=0.15 (curriculum warmup), num_generations=4, lr=5e-5, beta=0.1, max_prompt_length=600, max_completion_length=20.
 
215
 
216
+ **Curriculum:** ~15% of training samples include the true profile vector in the prompt as a *training hint*, teaching the model the *skill* of using profile signals before forcing inference. The remaining 85% require inference from observation alone.
217
 
218
+ ### Why this should beat the original 500-step result
219
 
220
+ The original v1 single-task training (500 steps, LoRA rank 4) **regressed vs heuristic on all 3 profiles**. Root cause analysis identified four issues that this version fixes:
 
 
 
 
221
 
222
+ | Original problem | Fix |
223
+ |---|---|
224
+ | Grader's `meter_balance` (0.30) rewarded random behavior | Replaced with `adaptation_score` — only positive if agent learned mid-episode |
225
+ | Only 3 hardcoded profiles → memorizable | Continuous parameter sampling per episode → memorization impossible |
226
+ | No explicit "model the user" output → no inference gradient | Action+belief format with `belief_accuracy` reward → forces user-modeling |
227
+ | 500 steps + LoRA rank 4 too small | 1500 steps + LoRA rank 8, curriculum warmup |
228
 
229
+ Verification will be by running both `discrete-3-profiles` (legacy comparison vs old eval_results.json) and `continuous-OOD` (meta-learning generalization test).
230
 
231
  ## Setup Instructions
232
 
 
262
 
263
  | Method | Endpoint | Description |
264
  |--------|----------|-------------|
265
+ | `POST` | `/reset` | Start a new episode. Optional kwargs: `seed`, `profile=<name>` (legacy 3-profile), `profile_mode='discrete'\|'continuous'` (default continuous = sampled meta-RL distribution) |
266
  | `POST` | `/step` | Execute an action (`action_type`) |
267
+ | `GET` | `/state` | Get current state (includes hidden profile name for debugging) |
268
  | `GET` | `/health` | Health check |
269
  | `GET` | `/metadata` | Environment metadata |
270
  | `GET` | `/schema` | Action/observation JSON schemas |
eval_baselines_meta.json ADDED
@@ -0,0 +1,3642 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "condition": "discrete-3-profiles (legacy)",
4
+ "profile_name": "introvert_morning",
5
+ "profile_mode": "discrete",
6
+ "strategy": "heuristic",
7
+ "seed": 0,
8
+ "final_score": 0.5745,
9
+ "total_reward": 5.84,
10
+ "adaptation": -0.28,
11
+ "vitality": 0.76,
12
+ "cognition": 0.41,
13
+ "progress": 1.0,
14
+ "serenity": 1.0,
15
+ "connection": 0.43,
16
+ "actions": [
17
+ "deep_work",
18
+ "admin_work",
19
+ "meditate",
20
+ "sleep",
21
+ "deep_work",
22
+ "admin_work",
23
+ "meditate",
24
+ "sleep",
25
+ "deep_work",
26
+ "admin_work",
27
+ "meditate",
28
+ "sleep",
29
+ "deep_work",
30
+ "admin_work",
31
+ "socialize",
32
+ "sleep",
33
+ "deep_work",
34
+ "admin_work",
35
+ "meditate",
36
+ "sleep",
37
+ "deep_work",
38
+ "admin_work",
39
+ "socialize",
40
+ "sleep",
41
+ "deep_work",
42
+ "admin_work",
43
+ "meditate",
44
+ "sleep"
45
+ ],
46
+ "true_belief": [
47
+ 0.0,
48
+ 1.0,
49
+ 0.069
50
+ ],
51
+ "final_belief": null,
52
+ "belief_mae": null
53
+ },
54
+ {
55
+ "condition": "discrete-3-profiles (legacy)",
56
+ "profile_name": "introvert_morning",
57
+ "profile_mode": "discrete",
58
+ "strategy": "heuristic",
59
+ "seed": 1,
60
+ "final_score": 0.5765,
61
+ "total_reward": 6.88,
62
+ "adaptation": -0.142,
63
+ "vitality": 0.64,
64
+ "cognition": 0.33,
65
+ "progress": 1.0,
66
+ "serenity": 0.87,
67
+ "connection": 0.43,
68
+ "actions": [
69
+ "deep_work",
70
+ "admin_work",
71
+ "meditate",
72
+ "sleep",
73
+ "deep_work",
74
+ "admin_work",
75
+ "meditate",
76
+ "sleep",
77
+ "deep_work",
78
+ "admin_work",
79
+ "meditate",
80
+ "sleep",
81
+ "deep_work",
82
+ "admin_work",
83
+ "meditate",
84
+ "sleep",
85
+ "deep_work",
86
+ "admin_work",
87
+ "meditate",
88
+ "sleep",
89
+ "deep_work",
90
+ "admin_work",
91
+ "socialize",
92
+ "sleep",
93
+ "deep_work",
94
+ "admin_work",
95
+ "socialize",
96
+ "sleep"
97
+ ],
98
+ "true_belief": [
99
+ 0.0,
100
+ 1.0,
101
+ 0.069
102
+ ],
103
+ "final_belief": null,
104
+ "belief_mae": null
105
+ },
106
+ {
107
+ "condition": "discrete-3-profiles (legacy)",
108
+ "profile_name": "introvert_morning",
109
+ "profile_mode": "discrete",
110
+ "strategy": "heuristic",
111
+ "seed": 2,
112
+ "final_score": 0.574,
113
+ "total_reward": 8.97,
114
+ "adaptation": -0.053,
115
+ "vitality": 0.84,
116
+ "cognition": 0.44,
117
+ "progress": 1.0,
118
+ "serenity": 1.0,
119
+ "connection": 0.39,
120
+ "actions": [
121
+ "deep_work",
122
+ "admin_work",
123
+ "meditate",
124
+ "sleep",
125
+ "deep_work",
126
+ "admin_work",
127
+ "meditate",
128
+ "sleep",
129
+ "deep_work",
130
+ "admin_work",
131
+ "socialize",
132
+ "sleep",
133
+ "deep_work",
134
+ "admin_work",
135
+ "socialize",
136
+ "sleep",
137
+ "exercise",
138
+ "admin_work",
139
+ "socialize",
140
+ "sleep",
141
+ "deep_work",
142
+ "admin_work",
143
+ "meditate",
144
+ "sleep",
145
+ "exercise",
146
+ "admin_work",
147
+ "meditate",
148
+ "sleep"
149
+ ],
150
+ "true_belief": [
151
+ 0.0,
152
+ 1.0,
153
+ 0.069
154
+ ],
155
+ "final_belief": null,
156
+ "belief_mae": null
157
+ },
158
+ {
159
+ "condition": "discrete-3-profiles (legacy)",
160
+ "profile_name": "introvert_morning",
161
+ "profile_mode": "discrete",
162
+ "strategy": "heuristic",
163
+ "seed": 3,
164
+ "final_score": 0.567,
165
+ "total_reward": 10.07,
166
+ "adaptation": -0.213,
167
+ "vitality": 0.74,
168
+ "cognition": 0.25,
169
+ "progress": 0.96,
170
+ "serenity": 0.97,
171
+ "connection": 0.41,
172
+ "actions": [
173
+ "deep_work",
174
+ "admin_work",
175
+ "meditate",
176
+ "sleep",
177
+ "deep_work",
178
+ "admin_work",
179
+ "meditate",
180
+ "sleep",
181
+ "deep_work",
182
+ "admin_work",
183
+ "meditate",
184
+ "sleep",
185
+ "deep_work",
186
+ "admin_work",
187
+ "socialize",
188
+ "sleep",
189
+ "deep_work",
190
+ "admin_work",
191
+ "meditate",
192
+ "sleep",
193
+ "deep_work",
194
+ "admin_work",
195
+ "socialize",
196
+ "sleep",
197
+ "exercise",
198
+ "admin_work",
199
+ "meditate",
200
+ "sleep"
201
+ ],
202
+ "true_belief": [
203
+ 0.0,
204
+ 1.0,
205
+ 0.069
206
+ ],
207
+ "final_belief": null,
208
+ "belief_mae": null
209
+ },
210
+ {
211
+ "condition": "discrete-3-profiles (legacy)",
212
+ "profile_name": "introvert_morning",
213
+ "profile_mode": "discrete",
214
+ "strategy": "heuristic",
215
+ "seed": 4,
216
+ "final_score": 0.5796,
217
+ "total_reward": 5.25,
218
+ "adaptation": -0.257,
219
+ "vitality": 0.61,
220
+ "cognition": 0.32,
221
+ "progress": 1.0,
222
+ "serenity": 0.99,
223
+ "connection": 0.47,
224
+ "actions": [
225
+ "deep_work",
226
+ "admin_work",
227
+ "meditate",
228
+ "sleep",
229
+ "deep_work",
230
+ "admin_work",
231
+ "meditate",
232
+ "sleep",
233
+ "deep_work",
234
+ "admin_work",
235
+ "meditate",
236
+ "sleep",
237
+ "deep_work",
238
+ "admin_work",
239
+ "meditate",
240
+ "sleep",
241
+ "deep_work",
242
+ "admin_work",
243
+ "socialize",
244
+ "sleep",
245
+ "deep_work",
246
+ "admin_work",
247
+ "meditate",
248
+ "sleep",
249
+ "deep_work",
250
+ "admin_work",
251
+ "socialize",
252
+ "sleep"
253
+ ],
254
+ "true_belief": [
255
+ 0.0,
256
+ 1.0,
257
+ 0.069
258
+ ],
259
+ "final_belief": null,
260
+ "belief_mae": null
261
+ },
262
+ {
263
+ "condition": "discrete-3-profiles (legacy)",
264
+ "profile_name": "extrovert_night_owl",
265
+ "profile_mode": "discrete",
266
+ "strategy": "heuristic",
267
+ "seed": 0,
268
+ "final_score": 0.6045,
269
+ "total_reward": 4.21,
270
+ "adaptation": 0.184,
271
+ "vitality": 1.0,
272
+ "cognition": 1.0,
273
+ "progress": 1.0,
274
+ "serenity": 1.0,
275
+ "connection": 0.43,
276
+ "actions": [
277
+ "deep_work",
278
+ "learn",
279
+ "meditate",
280
+ "sleep",
281
+ "deep_work",
282
+ "learn",
283
+ "meditate",
284
+ "sleep",
285
+ "deep_work",
286
+ "admin_work",
287
+ "meditate",
288
+ "sleep",
289
+ "deep_work",
290
+ "admin_work",
291
+ "socialize",
292
+ "sleep",
293
+ "deep_work",
294
+ "admin_work",
295
+ "meditate",
296
+ "sleep",
297
+ "deep_work",
298
+ "admin_work",
299
+ "meditate",
300
+ "sleep",
301
+ "deep_work",
302
+ "admin_work",
303
+ "meditate",
304
+ "sleep"
305
+ ],
306
+ "true_belief": [
307
+ 1.0,
308
+ 0.2,
309
+ 0.023
310
+ ],
311
+ "final_belief": null,
312
+ "belief_mae": null
313
+ },
314
+ {
315
+ "condition": "discrete-3-profiles (legacy)",
316
+ "profile_name": "extrovert_night_owl",
317
+ "profile_mode": "discrete",
318
+ "strategy": "heuristic",
319
+ "seed": 1,
320
+ "final_score": 0.6095,
321
+ "total_reward": 4.59,
322
+ "adaptation": 0.22,
323
+ "vitality": 1.0,
324
+ "cognition": 1.0,
325
+ "progress": 0.97,
326
+ "serenity": 1.0,
327
+ "connection": 0.46,
328
+ "actions": [
329
+ "deep_work",
330
+ "learn",
331
+ "meditate",
332
+ "sleep",
333
+ "deep_work",
334
+ "learn",
335
+ "meditate",
336
+ "sleep",
337
+ "deep_work",
338
+ "admin_work",
339
+ "meditate",
340
+ "sleep",
341
+ "deep_work",
342
+ "admin_work",
343
+ "meditate",
344
+ "sleep",
345
+ "deep_work",
346
+ "admin_work",
347
+ "meditate",
348
+ "sleep",
349
+ "deep_work",
350
+ "admin_work",
351
+ "socialize",
352
+ "sleep",
353
+ "deep_work",
354
+ "admin_work",
355
+ "meditate",
356
+ "sleep"
357
+ ],
358
+ "true_belief": [
359
+ 1.0,
360
+ 0.2,
361
+ 0.023
362
+ ],
363
+ "final_belief": null,
364
+ "belief_mae": null
365
+ },
366
+ {
367
+ "condition": "discrete-3-profiles (legacy)",
368
+ "profile_name": "extrovert_night_owl",
369
+ "profile_mode": "discrete",
370
+ "strategy": "heuristic",
371
+ "seed": 2,
372
+ "final_score": 0.5901,
373
+ "total_reward": 5.85,
374
+ "adaptation": 0.117,
375
+ "vitality": 0.97,
376
+ "cognition": 0.82,
377
+ "progress": 0.93,
378
+ "serenity": 1.0,
379
+ "connection": 0.53,
380
+ "actions": [
381
+ "deep_work",
382
+ "learn",
383
+ "meditate",
384
+ "sleep",
385
+ "deep_work",
386
+ "learn",
387
+ "meditate",
388
+ "sleep",
389
+ "deep_work",
390
+ "learn",
391
+ "socialize",
392
+ "sleep",
393
+ "deep_work",
394
+ "admin_work",
395
+ "meditate",
396
+ "sleep",
397
+ "deep_work",
398
+ "admin_work",
399
+ "meditate",
400
+ "sleep",
401
+ "deep_work",
402
+ "admin_work",
403
+ "socialize",
404
+ "sleep",
405
+ "deep_work",
406
+ "admin_work",
407
+ "meditate",
408
+ "sleep"
409
+ ],
410
+ "true_belief": [
411
+ 1.0,
412
+ 0.2,
413
+ 0.023
414
+ ],
415
+ "final_belief": null,
416
+ "belief_mae": null
417
+ },
418
+ {
419
+ "condition": "discrete-3-profiles (legacy)",
420
+ "profile_name": "extrovert_night_owl",
421
+ "profile_mode": "discrete",
422
+ "strategy": "heuristic",
423
+ "seed": 3,
424
+ "final_score": 0.5527,
425
+ "total_reward": 4.53,
426
+ "adaptation": 0.205,
427
+ "vitality": 0.92,
428
+ "cognition": 0.86,
429
+ "progress": 0.78,
430
+ "serenity": 0.97,
431
+ "connection": 0.42,
432
+ "actions": [
433
+ "deep_work",
434
+ "learn",
435
+ "meditate",
436
+ "sleep",
437
+ "deep_work",
438
+ "learn",
439
+ "meditate",
440
+ "sleep",
441
+ "deep_work",
442
+ "learn",
443
+ "meditate",
444
+ "sleep",
445
+ "deep_work",
446
+ "admin_work",
447
+ "socialize",
448
+ "sleep",
449
+ "deep_work",
450
+ "admin_work",
451
+ "meditate",
452
+ "sleep",
453
+ "deep_work",
454
+ "admin_work",
455
+ "meditate",
456
+ "sleep",
457
+ "deep_work",
458
+ "admin_work",
459
+ "meditate",
460
+ "sleep"
461
+ ],
462
+ "true_belief": [
463
+ 1.0,
464
+ 0.2,
465
+ 0.023
466
+ ],
467
+ "final_belief": null,
468
+ "belief_mae": null
469
+ },
470
+ {
471
+ "condition": "discrete-3-profiles (legacy)",
472
+ "profile_name": "extrovert_night_owl",
473
+ "profile_mode": "discrete",
474
+ "strategy": "heuristic",
475
+ "seed": 4,
476
+ "final_score": 0.6138,
477
+ "total_reward": 4.23,
478
+ "adaptation": 0.196,
479
+ "vitality": 1.0,
480
+ "cognition": 1.0,
481
+ "progress": 1.0,
482
+ "serenity": 1.0,
483
+ "connection": 0.47,
484
+ "actions": [
485
+ "deep_work",
486
+ "learn",
487
+ "meditate",
488
+ "sleep",
489
+ "deep_work",
490
+ "learn",
491
+ "meditate",
492
+ "sleep",
493
+ "deep_work",
494
+ "admin_work",
495
+ "meditate",
496
+ "sleep",
497
+ "deep_work",
498
+ "admin_work",
499
+ "meditate",
500
+ "sleep",
501
+ "deep_work",
502
+ "admin_work",
503
+ "socialize",
504
+ "sleep",
505
+ "deep_work",
506
+ "admin_work",
507
+ "meditate",
508
+ "sleep",
509
+ "deep_work",
510
+ "admin_work",
511
+ "meditate",
512
+ "sleep"
513
+ ],
514
+ "true_belief": [
515
+ 1.0,
516
+ 0.2,
517
+ 0.023
518
+ ],
519
+ "final_belief": null,
520
+ "belief_mae": null
521
+ },
522
+ {
523
+ "condition": "discrete-3-profiles (legacy)",
524
+ "profile_name": "workaholic_stoic",
525
+ "profile_mode": "discrete",
526
+ "strategy": "heuristic",
527
+ "seed": 0,
528
+ "final_score": 0.5825,
529
+ "total_reward": 11.98,
530
+ "adaptation": -0.568,
531
+ "vitality": 0.59,
532
+ "cognition": 0.23,
533
+ "progress": 1.0,
534
+ "serenity": 0.95,
535
+ "connection": 0.41,
536
+ "actions": [
537
+ "deep_work",
538
+ "learn",
539
+ "meditate",
540
+ "sleep",
541
+ "deep_work",
542
+ "admin_work",
543
+ "socialize",
544
+ "sleep",
545
+ "deep_work",
546
+ "admin_work",
547
+ "socialize",
548
+ "sleep",
549
+ "deep_work",
550
+ "admin_work",
551
+ "meditate",
552
+ "sleep",
553
+ "deep_work",
554
+ "admin_work",
555
+ "socialize",
556
+ "sleep",
557
+ "exercise",
558
+ "admin_work",
559
+ "socialize",
560
+ "sleep",
561
+ "deep_work",
562
+ "admin_work",
563
+ "socialize",
564
+ "sleep"
565
+ ],
566
+ "true_belief": [
567
+ 0.357,
568
+ 0.5,
569
+ 1.0
570
+ ],
571
+ "final_belief": null,
572
+ "belief_mae": null
573
+ },
574
+ {
575
+ "condition": "discrete-3-profiles (legacy)",
576
+ "profile_name": "workaholic_stoic",
577
+ "profile_mode": "discrete",
578
+ "strategy": "heuristic",
579
+ "seed": 1,
580
+ "final_score": 0.5799,
581
+ "total_reward": 12.67,
582
+ "adaptation": -0.548,
583
+ "vitality": 0.72,
584
+ "cognition": 0.32,
585
+ "progress": 1.0,
586
+ "serenity": 0.95,
587
+ "connection": 0.38,
588
+ "actions": [
589
+ "deep_work",
590
+ "learn",
591
+ "meditate",
592
+ "sleep",
593
+ "deep_work",
594
+ "admin_work",
595
+ "socialize",
596
+ "sleep",
597
+ "deep_work",
598
+ "admin_work",
599
+ "meditate",
600
+ "sleep",
601
+ "deep_work",
602
+ "admin_work",
603
+ "meditate",
604
+ "sleep",
605
+ "deep_work",
606
+ "admin_work",
607
+ "socialize",
608
+ "sleep",
609
+ "deep_work",
610
+ "admin_work",
611
+ "socialize",
612
+ "sleep",
613
+ "exercise",
614
+ "admin_work",
615
+ "socialize",
616
+ "sleep"
617
+ ],
618
+ "true_belief": [
619
+ 0.357,
620
+ 0.5,
621
+ 1.0
622
+ ],
623
+ "final_belief": null,
624
+ "belief_mae": null
625
+ },
626
+ {
627
+ "condition": "discrete-3-profiles (legacy)",
628
+ "profile_name": "workaholic_stoic",
629
+ "profile_mode": "discrete",
630
+ "strategy": "heuristic",
631
+ "seed": 2,
632
+ "final_score": 0.5901,
633
+ "total_reward": 13.16,
634
+ "adaptation": -0.481,
635
+ "vitality": 0.74,
636
+ "cognition": 0.34,
637
+ "progress": 1.0,
638
+ "serenity": 0.95,
639
+ "connection": 0.44,
640
+ "actions": [
641
+ "deep_work",
642
+ "learn",
643
+ "meditate",
644
+ "sleep",
645
+ "deep_work",
646
+ "admin_work",
647
+ "socialize",
648
+ "sleep",
649
+ "deep_work",
650
+ "admin_work",
651
+ "socialize",
652
+ "sleep",
653
+ "deep_work",
654
+ "admin_work",
655
+ "socialize",
656
+ "sleep",
657
+ "exercise",
658
+ "admin_work",
659
+ "socialize",
660
+ "sleep",
661
+ "exercise",
662
+ "admin_work",
663
+ "socialize",
664
+ "sleep",
665
+ "exercise",
666
+ "admin_work",
667
+ "socialize",
668
+ "sleep"
669
+ ],
670
+ "true_belief": [
671
+ 0.357,
672
+ 0.5,
673
+ 1.0
674
+ ],
675
+ "final_belief": null,
676
+ "belief_mae": null
677
+ },
678
+ {
679
+ "condition": "discrete-3-profiles (legacy)",
680
+ "profile_name": "workaholic_stoic",
681
+ "profile_mode": "discrete",
682
+ "strategy": "heuristic",
683
+ "seed": 3,
684
+ "final_score": 0.5824,
685
+ "total_reward": 13.79,
686
+ "adaptation": -0.436,
687
+ "vitality": 0.63,
688
+ "cognition": 0.21,
689
+ "progress": 1.0,
690
+ "serenity": 0.95,
691
+ "connection": 0.39,
692
+ "actions": [
693
+ "deep_work",
694
+ "learn",
695
+ "meditate",
696
+ "sleep",
697
+ "deep_work",
698
+ "admin_work",
699
+ "socialize",
700
+ "sleep",
701
+ "deep_work",
702
+ "admin_work",
703
+ "socialize",
704
+ "sleep",
705
+ "deep_work",
706
+ "admin_work",
707
+ "meditate",
708
+ "sleep",
709
+ "deep_work",
710
+ "admin_work",
711
+ "socialize",
712
+ "sleep",
713
+ "exercise",
714
+ "admin_work",
715
+ "socialize",
716
+ "sleep",
717
+ "exercise",
718
+ "admin_work",
719
+ "socialize",
720
+ "sleep"
721
+ ],
722
+ "true_belief": [
723
+ 0.357,
724
+ 0.5,
725
+ 1.0
726
+ ],
727
+ "final_belief": null,
728
+ "belief_mae": null
729
+ },
730
+ {
731
+ "condition": "discrete-3-profiles (legacy)",
732
+ "profile_name": "workaholic_stoic",
733
+ "profile_mode": "discrete",
734
+ "strategy": "heuristic",
735
+ "seed": 4,
736
+ "final_score": 0.5881,
737
+ "total_reward": 11.87,
738
+ "adaptation": -0.59,
739
+ "vitality": 0.54,
740
+ "cognition": 0.21,
741
+ "progress": 1.0,
742
+ "serenity": 0.95,
743
+ "connection": 0.45,
744
+ "actions": [
745
+ "deep_work",
746
+ "learn",
747
+ "meditate",
748
+ "sleep",
749
+ "deep_work",
750
+ "admin_work",
751
+ "meditate",
752
+ "sleep",
753
+ "deep_work",
754
+ "admin_work",
755
+ "socialize",
756
+ "sleep",
757
+ "deep_work",
758
+ "admin_work",
759
+ "socialize",
760
+ "sleep",
761
+ "deep_work",
762
+ "admin_work",
763
+ "socialize",
764
+ "sleep",
765
+ "exercise",
766
+ "admin_work",
767
+ "socialize",
768
+ "sleep",
769
+ "deep_work",
770
+ "admin_work",
771
+ "socialize",
772
+ "sleep"
773
+ ],
774
+ "true_belief": [
775
+ 0.357,
776
+ 0.5,
777
+ 1.0
778
+ ],
779
+ "final_belief": null,
780
+ "belief_mae": null
781
+ },
782
+ {
783
+ "condition": "discrete-3-profiles (legacy)",
784
+ "profile_name": "introvert_morning",
785
+ "profile_mode": "discrete",
786
+ "strategy": "random",
787
+ "seed": 0,
788
+ "final_score": 0.5229,
789
+ "total_reward": 1.82,
790
+ "adaptation": 0.086,
791
+ "vitality": 0.45,
792
+ "cognition": 0.56,
793
+ "progress": 0.66,
794
+ "serenity": 1.0,
795
+ "connection": 0.71,
796
+ "actions": [
797
+ "socialize",
798
+ "me_time",
799
+ "binge_watch",
800
+ "socialize",
801
+ "exercise",
802
+ "family_time",
803
+ "sleep",
804
+ "admin_work",
805
+ "meditate",
806
+ "binge_watch",
807
+ "admin_work",
808
+ "deep_work",
809
+ "sleep",
810
+ "meditate",
811
+ "sleep",
812
+ "family_time",
813
+ "exercise",
814
+ "deep_work",
815
+ "admin_work",
816
+ "meditate",
817
+ "socialize",
818
+ "binge_watch",
819
+ "exercise",
820
+ "meditate",
821
+ "learn",
822
+ "socialize",
823
+ "admin_work",
824
+ "sleep"
825
+ ],
826
+ "true_belief": [
827
+ 0.0,
828
+ 1.0,
829
+ 0.069
830
+ ],
831
+ "final_belief": null,
832
+ "belief_mae": null
833
+ },
834
+ {
835
+ "condition": "discrete-3-profiles (legacy)",
836
+ "profile_name": "introvert_morning",
837
+ "profile_mode": "discrete",
838
+ "strategy": "random",
839
+ "seed": 1,
840
+ "final_score": 0.5328,
841
+ "total_reward": 4.02,
842
+ "adaptation": -0.005,
843
+ "vitality": 0.08,
844
+ "cognition": 0.5,
845
+ "progress": 0.76,
846
+ "serenity": 0.97,
847
+ "connection": 0.68,
848
+ "actions": [
849
+ "exercise",
850
+ "meditate",
851
+ "sleep",
852
+ "meditate",
853
+ "meditate",
854
+ "me_time",
855
+ "meditate",
856
+ "learn",
857
+ "meditate",
858
+ "meditate",
859
+ "socialize",
860
+ "socialize",
861
+ "deep_work",
862
+ "meditate",
863
+ "socialize",
864
+ "deep_work",
865
+ "meditate",
866
+ "sleep",
867
+ "learn",
868
+ "socialize",
869
+ "deep_work",
870
+ "socialize",
871
+ "learn",
872
+ "sleep",
873
+ "family_time",
874
+ "meditate",
875
+ "meditate",
876
+ "admin_work"
877
+ ],
878
+ "true_belief": [
879
+ 0.0,
880
+ 1.0,
881
+ 0.069
882
+ ],
883
+ "final_belief": null,
884
+ "belief_mae": null
885
+ },
886
+ {
887
+ "condition": "discrete-3-profiles (legacy)",
888
+ "profile_name": "introvert_morning",
889
+ "profile_mode": "discrete",
890
+ "strategy": "random",
891
+ "seed": 2,
892
+ "final_score": 0.5339,
893
+ "total_reward": 6.12,
894
+ "adaptation": -0.395,
895
+ "vitality": 0.61,
896
+ "cognition": 0.22,
897
+ "progress": 1.0,
898
+ "serenity": 0.86,
899
+ "connection": 0.17,
900
+ "actions": [
901
+ "me_time",
902
+ "meditate",
903
+ "learn",
904
+ "meditate",
905
+ "learn",
906
+ "family_time",
907
+ "deep_work",
908
+ "family_time",
909
+ "me_time",
910
+ "admin_work",
911
+ "sleep",
912
+ "meditate",
913
+ "sleep",
914
+ "admin_work",
915
+ "meditate",
916
+ "me_time",
917
+ "sleep",
918
+ "sleep",
919
+ "binge_watch",
920
+ "admin_work",
921
+ "deep_work",
922
+ "admin_work",
923
+ "admin_work",
924
+ "binge_watch",
925
+ "learn",
926
+ "sleep",
927
+ "me_time",
928
+ "deep_work"
929
+ ],
930
+ "true_belief": [
931
+ 0.0,
932
+ 1.0,
933
+ 0.069
934
+ ],
935
+ "final_belief": null,
936
+ "belief_mae": null
937
+ },
938
+ {
939
+ "condition": "discrete-3-profiles (legacy)",
940
+ "profile_name": "introvert_morning",
941
+ "profile_mode": "discrete",
942
+ "strategy": "random",
943
+ "seed": 3,
944
+ "final_score": 0.5229,
945
+ "total_reward": -1.03,
946
+ "adaptation": -0.594,
947
+ "vitality": 0.0,
948
+ "cognition": 0.03,
949
+ "progress": 0.78,
950
+ "serenity": 0.83,
951
+ "connection": 0.84,
952
+ "actions": [
953
+ "learn",
954
+ "learn",
955
+ "meditate",
956
+ "sleep",
957
+ "family_time",
958
+ "exercise",
959
+ "learn",
960
+ "deep_work",
961
+ "deep_work",
962
+ "family_time",
963
+ "admin_work",
964
+ "deep_work",
965
+ "deep_work",
966
+ "me_time",
967
+ "socialize",
968
+ "family_time",
969
+ "deep_work",
970
+ "admin_work",
971
+ "me_time",
972
+ "meditate",
973
+ "socialize",
974
+ "binge_watch",
975
+ "family_time",
976
+ "socialize",
977
+ "exercise",
978
+ "family_time",
979
+ "exercise",
980
+ "socialize"
981
+ ],
982
+ "true_belief": [
983
+ 0.0,
984
+ 1.0,
985
+ 0.069
986
+ ],
987
+ "final_belief": null,
988
+ "belief_mae": null
989
+ },
990
+ {
991
+ "condition": "discrete-3-profiles (legacy)",
992
+ "profile_name": "introvert_morning",
993
+ "profile_mode": "discrete",
994
+ "strategy": "random",
995
+ "seed": 4,
996
+ "final_score": 0.5644,
997
+ "total_reward": 1.47,
998
+ "adaptation": -0.166,
999
+ "vitality": 0.0,
1000
+ "cognition": 0.0,
1001
+ "progress": 1.0,
1002
+ "serenity": 0.94,
1003
+ "connection": 0.5,
1004
+ "actions": [
1005
+ "socialize",
1006
+ "binge_watch",
1007
+ "meditate",
1008
+ "deep_work",
1009
+ "sleep",
1010
+ "deep_work",
1011
+ "learn",
1012
+ "learn",
1013
+ "meditate",
1014
+ "me_time",
1015
+ "binge_watch",
1016
+ "binge_watch",
1017
+ "me_time",
1018
+ "exercise",
1019
+ "admin_work",
1020
+ "meditate",
1021
+ "deep_work",
1022
+ "sleep",
1023
+ "family_time",
1024
+ "exercise",
1025
+ "me_time",
1026
+ "learn",
1027
+ "learn",
1028
+ "admin_work",
1029
+ "family_time",
1030
+ "admin_work",
1031
+ "deep_work",
1032
+ "socialize"
1033
+ ],
1034
+ "true_belief": [
1035
+ 0.0,
1036
+ 1.0,
1037
+ 0.069
1038
+ ],
1039
+ "final_belief": null,
1040
+ "belief_mae": null
1041
+ },
1042
+ {
1043
+ "condition": "discrete-3-profiles (legacy)",
1044
+ "profile_name": "extrovert_night_owl",
1045
+ "profile_mode": "discrete",
1046
+ "strategy": "random",
1047
+ "seed": 0,
1048
+ "final_score": 0.6522,
1049
+ "total_reward": 8.5,
1050
+ "adaptation": -0.121,
1051
+ "vitality": 1.0,
1052
+ "cognition": 0.83,
1053
+ "progress": 1.0,
1054
+ "serenity": 1.0,
1055
+ "connection": 0.98,
1056
+ "actions": [
1057
+ "socialize",
1058
+ "me_time",
1059
+ "binge_watch",
1060
+ "socialize",
1061
+ "exercise",
1062
+ "family_time",
1063
+ "sleep",
1064
+ "admin_work",
1065
+ "meditate",
1066
+ "binge_watch",
1067
+ "admin_work",
1068
+ "deep_work",
1069
+ "sleep",
1070
+ "meditate",
1071
+ "sleep",
1072
+ "family_time",
1073
+ "exercise",
1074
+ "deep_work",
1075
+ "admin_work",
1076
+ "meditate",
1077
+ "socialize",
1078
+ "binge_watch",
1079
+ "exercise",
1080
+ "meditate",
1081
+ "learn",
1082
+ "socialize",
1083
+ "admin_work",
1084
+ "sleep"
1085
+ ],
1086
+ "true_belief": [
1087
+ 1.0,
1088
+ 0.2,
1089
+ 0.023
1090
+ ],
1091
+ "final_belief": null,
1092
+ "belief_mae": null
1093
+ },
1094
+ {
1095
+ "condition": "discrete-3-profiles (legacy)",
1096
+ "profile_name": "extrovert_night_owl",
1097
+ "profile_mode": "discrete",
1098
+ "strategy": "random",
1099
+ "seed": 1,
1100
+ "final_score": 0.6748,
1101
+ "total_reward": 8.2,
1102
+ "adaptation": 0.121,
1103
+ "vitality": 0.75,
1104
+ "cognition": 0.69,
1105
+ "progress": 1.0,
1106
+ "serenity": 0.97,
1107
+ "connection": 0.97,
1108
+ "actions": [
1109
+ "exercise",
1110
+ "meditate",
1111
+ "sleep",
1112
+ "meditate",
1113
+ "meditate",
1114
+ "me_time",
1115
+ "meditate",
1116
+ "learn",
1117
+ "meditate",
1118
+ "meditate",
1119
+ "socialize",
1120
+ "socialize",
1121
+ "deep_work",
1122
+ "meditate",
1123
+ "socialize",
1124
+ "deep_work",
1125
+ "meditate",
1126
+ "sleep",
1127
+ "learn",
1128
+ "socialize",
1129
+ "deep_work",
1130
+ "socialize",
1131
+ "learn",
1132
+ "sleep",
1133
+ "family_time",
1134
+ "meditate",
1135
+ "meditate",
1136
+ "admin_work"
1137
+ ],
1138
+ "true_belief": [
1139
+ 1.0,
1140
+ 0.2,
1141
+ 0.023
1142
+ ],
1143
+ "final_belief": null,
1144
+ "belief_mae": null
1145
+ },
1146
+ {
1147
+ "condition": "discrete-3-profiles (legacy)",
1148
+ "profile_name": "extrovert_night_owl",
1149
+ "profile_mode": "discrete",
1150
+ "strategy": "random",
1151
+ "seed": 2,
1152
+ "final_score": 0.5646,
1153
+ "total_reward": 4.12,
1154
+ "adaptation": -0.388,
1155
+ "vitality": 0.75,
1156
+ "cognition": 0.32,
1157
+ "progress": 1.0,
1158
+ "serenity": 0.95,
1159
+ "connection": 0.4,
1160
+ "actions": [
1161
+ "me_time",
1162
+ "meditate",
1163
+ "learn",
1164
+ "meditate",
1165
+ "learn",
1166
+ "family_time",
1167
+ "deep_work",
1168
+ "family_time",
1169
+ "me_time",
1170
+ "admin_work",
1171
+ "sleep",
1172
+ "meditate",
1173
+ "sleep",
1174
+ "admin_work",
1175
+ "meditate",
1176
+ "me_time",
1177
+ "sleep",
1178
+ "sleep",
1179
+ "binge_watch",
1180
+ "admin_work",
1181
+ "deep_work",
1182
+ "admin_work",
1183
+ "admin_work",
1184
+ "binge_watch",
1185
+ "learn",
1186
+ "sleep",
1187
+ "me_time",
1188
+ "deep_work"
1189
+ ],
1190
+ "true_belief": [
1191
+ 1.0,
1192
+ 0.2,
1193
+ 0.023
1194
+ ],
1195
+ "final_belief": null,
1196
+ "belief_mae": null
1197
+ },
1198
+ {
1199
+ "condition": "discrete-3-profiles (legacy)",
1200
+ "profile_name": "extrovert_night_owl",
1201
+ "profile_mode": "discrete",
1202
+ "strategy": "random",
1203
+ "seed": 3,
1204
+ "final_score": 0.5716,
1205
+ "total_reward": 3.74,
1206
+ "adaptation": -0.54,
1207
+ "vitality": 0.19,
1208
+ "cognition": 0.0,
1209
+ "progress": 0.78,
1210
+ "serenity": 1.0,
1211
+ "connection": 1.0,
1212
+ "actions": [
1213
+ "learn",
1214
+ "learn",
1215
+ "meditate",
1216
+ "sleep",
1217
+ "family_time",
1218
+ "exercise",
1219
+ "learn",
1220
+ "deep_work",
1221
+ "deep_work",
1222
+ "family_time",
1223
+ "admin_work",
1224
+ "deep_work",
1225
+ "deep_work",
1226
+ "me_time",
1227
+ "socialize",
1228
+ "family_time",
1229
+ "deep_work",
1230
+ "admin_work",
1231
+ "me_time",
1232
+ "meditate",
1233
+ "socialize",
1234
+ "binge_watch",
1235
+ "family_time",
1236
+ "socialize",
1237
+ "exercise",
1238
+ "family_time",
1239
+ "exercise",
1240
+ "socialize"
1241
+ ],
1242
+ "true_belief": [
1243
+ 1.0,
1244
+ 0.2,
1245
+ 0.023
1246
+ ],
1247
+ "final_belief": null,
1248
+ "belief_mae": null
1249
+ },
1250
+ {
1251
+ "condition": "discrete-3-profiles (legacy)",
1252
+ "profile_name": "extrovert_night_owl",
1253
+ "profile_mode": "discrete",
1254
+ "strategy": "random",
1255
+ "seed": 4,
1256
+ "final_score": 0.7395,
1257
+ "total_reward": 7.38,
1258
+ "adaptation": 0.393,
1259
+ "vitality": 0.4,
1260
+ "cognition": 0.0,
1261
+ "progress": 1.0,
1262
+ "serenity": 0.99,
1263
+ "connection": 0.99,
1264
+ "actions": [
1265
+ "socialize",
1266
+ "binge_watch",
1267
+ "meditate",
1268
+ "deep_work",
1269
+ "sleep",
1270
+ "deep_work",
1271
+ "learn",
1272
+ "learn",
1273
+ "meditate",
1274
+ "me_time",
1275
+ "binge_watch",
1276
+ "binge_watch",
1277
+ "me_time",
1278
+ "exercise",
1279
+ "admin_work",
1280
+ "meditate",
1281
+ "deep_work",
1282
+ "sleep",
1283
+ "family_time",
1284
+ "exercise",
1285
+ "me_time",
1286
+ "learn",
1287
+ "learn",
1288
+ "admin_work",
1289
+ "family_time",
1290
+ "admin_work",
1291
+ "deep_work",
1292
+ "socialize"
1293
+ ],
1294
+ "true_belief": [
1295
+ 1.0,
1296
+ 0.2,
1297
+ 0.023
1298
+ ],
1299
+ "final_belief": null,
1300
+ "belief_mae": null
1301
+ },
1302
+ {
1303
+ "condition": "discrete-3-profiles (legacy)",
1304
+ "profile_name": "workaholic_stoic",
1305
+ "profile_mode": "discrete",
1306
+ "strategy": "random",
1307
+ "seed": 0,
1308
+ "final_score": 0.4941,
1309
+ "total_reward": 4.07,
1310
+ "adaptation": 0.272,
1311
+ "vitality": 0.4,
1312
+ "cognition": 0.52,
1313
+ "progress": 0.55,
1314
+ "serenity": 0.95,
1315
+ "connection": 0.41,
1316
+ "actions": [
1317
+ "socialize",
1318
+ "me_time",
1319
+ "binge_watch",
1320
+ "socialize",
1321
+ "exercise",
1322
+ "family_time",
1323
+ "sleep",
1324
+ "admin_work",
1325
+ "meditate",
1326
+ "binge_watch",
1327
+ "admin_work",
1328
+ "deep_work",
1329
+ "sleep",
1330
+ "meditate",
1331
+ "sleep",
1332
+ "family_time",
1333
+ "exercise",
1334
+ "deep_work",
1335
+ "admin_work",
1336
+ "meditate",
1337
+ "socialize",
1338
+ "binge_watch",
1339
+ "exercise",
1340
+ "meditate",
1341
+ "learn",
1342
+ "socialize",
1343
+ "admin_work",
1344
+ "sleep"
1345
+ ],
1346
+ "true_belief": [
1347
+ 0.357,
1348
+ 0.5,
1349
+ 1.0
1350
+ ],
1351
+ "final_belief": null,
1352
+ "belief_mae": null
1353
+ },
1354
+ {
1355
+ "condition": "discrete-3-profiles (legacy)",
1356
+ "profile_name": "workaholic_stoic",
1357
+ "profile_mode": "discrete",
1358
+ "strategy": "random",
1359
+ "seed": 1,
1360
+ "final_score": 0.4906,
1361
+ "total_reward": 5.39,
1362
+ "adaptation": 0.146,
1363
+ "vitality": 0.04,
1364
+ "cognition": 0.55,
1365
+ "progress": 0.6,
1366
+ "serenity": 1.0,
1367
+ "connection": 0.44,
1368
+ "actions": [
1369
+ "exercise",
1370
+ "meditate",
1371
+ "sleep",
1372
+ "meditate",
1373
+ "meditate",
1374
+ "me_time",
1375
+ "meditate",
1376
+ "learn",
1377
+ "meditate",
1378
+ "meditate",
1379
+ "socialize",
1380
+ "socialize",
1381
+ "deep_work",
1382
+ "meditate",
1383
+ "socialize",
1384
+ "deep_work",
1385
+ "meditate",
1386
+ "sleep",
1387
+ "learn",
1388
+ "socialize",
1389
+ "deep_work",
1390
+ "socialize",
1391
+ "learn",
1392
+ "sleep",
1393
+ "family_time",
1394
+ "meditate",
1395
+ "meditate",
1396
+ "admin_work"
1397
+ ],
1398
+ "true_belief": [
1399
+ 0.357,
1400
+ 0.5,
1401
+ 1.0
1402
+ ],
1403
+ "final_belief": null,
1404
+ "belief_mae": null
1405
+ },
1406
+ {
1407
+ "condition": "discrete-3-profiles (legacy)",
1408
+ "profile_name": "workaholic_stoic",
1409
+ "profile_mode": "discrete",
1410
+ "strategy": "random",
1411
+ "seed": 2,
1412
+ "final_score": 0.468,
1413
+ "total_reward": 7.33,
1414
+ "adaptation": -0.065,
1415
+ "vitality": 0.4,
1416
+ "cognition": 0.23,
1417
+ "progress": 0.88,
1418
+ "serenity": 0.99,
1419
+ "connection": 0.0,
1420
+ "actions": [
1421
+ "me_time",
1422
+ "meditate",
1423
+ "learn",
1424
+ "meditate",
1425
+ "learn",
1426
+ "family_time",
1427
+ "deep_work",
1428
+ "family_time",
1429
+ "me_time",
1430
+ "admin_work",
1431
+ "sleep",
1432
+ "meditate",
1433
+ "sleep",
1434
+ "admin_work",
1435
+ "meditate",
1436
+ "me_time",
1437
+ "sleep",
1438
+ "sleep",
1439
+ "binge_watch",
1440
+ "admin_work",
1441
+ "deep_work",
1442
+ "admin_work",
1443
+ "admin_work",
1444
+ "binge_watch",
1445
+ "learn",
1446
+ "sleep",
1447
+ "me_time",
1448
+ "deep_work"
1449
+ ],
1450
+ "true_belief": [
1451
+ 0.357,
1452
+ 0.5,
1453
+ 1.0
1454
+ ],
1455
+ "final_belief": null,
1456
+ "belief_mae": null
1457
+ },
1458
+ {
1459
+ "condition": "discrete-3-profiles (legacy)",
1460
+ "profile_name": "workaholic_stoic",
1461
+ "profile_mode": "discrete",
1462
+ "strategy": "random",
1463
+ "seed": 3,
1464
+ "final_score": 0.4734,
1465
+ "total_reward": 1.7,
1466
+ "adaptation": -0.989,
1467
+ "vitality": 0.0,
1468
+ "cognition": 0.0,
1469
+ "progress": 0.71,
1470
+ "serenity": 1.0,
1471
+ "connection": 0.56,
1472
+ "actions": [
1473
+ "learn",
1474
+ "learn",
1475
+ "meditate",
1476
+ "sleep",
1477
+ "family_time",
1478
+ "exercise",
1479
+ "learn",
1480
+ "deep_work",
1481
+ "deep_work",
1482
+ "family_time",
1483
+ "admin_work",
1484
+ "deep_work",
1485
+ "deep_work",
1486
+ "me_time",
1487
+ "socialize",
1488
+ "family_time",
1489
+ "deep_work",
1490
+ "admin_work",
1491
+ "me_time",
1492
+ "meditate",
1493
+ "socialize",
1494
+ "binge_watch",
1495
+ "family_time",
1496
+ "socialize",
1497
+ "exercise",
1498
+ "family_time",
1499
+ "exercise",
1500
+ "socialize"
1501
+ ],
1502
+ "true_belief": [
1503
+ 0.357,
1504
+ 0.5,
1505
+ 1.0
1506
+ ],
1507
+ "final_belief": null,
1508
+ "belief_mae": null
1509
+ },
1510
+ {
1511
+ "condition": "discrete-3-profiles (legacy)",
1512
+ "profile_name": "workaholic_stoic",
1513
+ "profile_mode": "discrete",
1514
+ "strategy": "random",
1515
+ "seed": 4,
1516
+ "final_score": 0.5002,
1517
+ "total_reward": 6.83,
1518
+ "adaptation": 0.031,
1519
+ "vitality": 0.0,
1520
+ "cognition": 0.0,
1521
+ "progress": 0.86,
1522
+ "serenity": 1.0,
1523
+ "connection": 0.19,
1524
+ "actions": [
1525
+ "socialize",
1526
+ "binge_watch",
1527
+ "meditate",
1528
+ "deep_work",
1529
+ "sleep",
1530
+ "deep_work",
1531
+ "learn",
1532
+ "learn",
1533
+ "meditate",
1534
+ "me_time",
1535
+ "binge_watch",
1536
+ "binge_watch",
1537
+ "me_time",
1538
+ "exercise",
1539
+ "admin_work",
1540
+ "meditate",
1541
+ "deep_work",
1542
+ "sleep",
1543
+ "family_time",
1544
+ "exercise",
1545
+ "me_time",
1546
+ "learn",
1547
+ "learn",
1548
+ "admin_work",
1549
+ "family_time",
1550
+ "admin_work",
1551
+ "deep_work",
1552
+ "socialize"
1553
+ ],
1554
+ "true_belief": [
1555
+ 0.357,
1556
+ 0.5,
1557
+ 1.0
1558
+ ],
1559
+ "final_belief": null,
1560
+ "belief_mae": null
1561
+ },
1562
+ {
1563
+ "condition": "continuous-in-distribution",
1564
+ "profile_name": "sampled_100",
1565
+ "profile_mode": "continuous",
1566
+ "strategy": "heuristic",
1567
+ "seed": 100,
1568
+ "final_score": 0.5861,
1569
+ "total_reward": 10.74,
1570
+ "adaptation": -0.456,
1571
+ "vitality": 0.47,
1572
+ "cognition": 0.26,
1573
+ "progress": 1.0,
1574
+ "serenity": 1.0,
1575
+ "connection": 0.45,
1576
+ "actions": [
1577
+ "deep_work",
1578
+ "learn",
1579
+ "meditate",
1580
+ "sleep",
1581
+ "deep_work",
1582
+ "admin_work",
1583
+ "meditate",
1584
+ "sleep",
1585
+ "deep_work",
1586
+ "admin_work",
1587
+ "socialize",
1588
+ "sleep",
1589
+ "deep_work",
1590
+ "admin_work",
1591
+ "socialize",
1592
+ "sleep",
1593
+ "exercise",
1594
+ "admin_work",
1595
+ "meditate",
1596
+ "sleep",
1597
+ "deep_work",
1598
+ "admin_work",
1599
+ "socialize",
1600
+ "sleep",
1601
+ "deep_work",
1602
+ "admin_work",
1603
+ "meditate",
1604
+ "sleep"
1605
+ ],
1606
+ "true_belief": [
1607
+ 0.613,
1608
+ 0.5,
1609
+ 0.503
1610
+ ],
1611
+ "final_belief": null,
1612
+ "belief_mae": null
1613
+ },
1614
+ {
1615
+ "condition": "continuous-in-distribution",
1616
+ "profile_name": "sampled_101",
1617
+ "profile_mode": "continuous",
1618
+ "strategy": "heuristic",
1619
+ "seed": 101,
1620
+ "final_score": 0.5802,
1621
+ "total_reward": 3.85,
1622
+ "adaptation": -0.041,
1623
+ "vitality": 1.0,
1624
+ "cognition": 0.51,
1625
+ "progress": 1.0,
1626
+ "serenity": 1.0,
1627
+ "connection": 0.5,
1628
+ "actions": [
1629
+ "deep_work",
1630
+ "learn",
1631
+ "meditate",
1632
+ "sleep",
1633
+ "deep_work",
1634
+ "learn",
1635
+ "meditate",
1636
+ "sleep",
1637
+ "deep_work",
1638
+ "admin_work",
1639
+ "meditate",
1640
+ "sleep",
1641
+ "deep_work",
1642
+ "admin_work",
1643
+ "socialize",
1644
+ "sleep",
1645
+ "deep_work",
1646
+ "admin_work",
1647
+ "meditate",
1648
+ "sleep",
1649
+ "deep_work",
1650
+ "admin_work",
1651
+ "meditate",
1652
+ "sleep",
1653
+ "deep_work",
1654
+ "admin_work",
1655
+ "meditate",
1656
+ "sleep"
1657
+ ],
1658
+ "true_belief": [
1659
+ 0.545,
1660
+ 0.277,
1661
+ 0.461
1662
+ ],
1663
+ "final_belief": null,
1664
+ "belief_mae": null
1665
+ },
1666
+ {
1667
+ "condition": "continuous-in-distribution",
1668
+ "profile_name": "sampled_102",
1669
+ "profile_mode": "continuous",
1670
+ "strategy": "heuristic",
1671
+ "seed": 102,
1672
+ "final_score": 0.5912,
1673
+ "total_reward": 10.16,
1674
+ "adaptation": -0.432,
1675
+ "vitality": 0.44,
1676
+ "cognition": 0.24,
1677
+ "progress": 1.0,
1678
+ "serenity": 0.98,
1679
+ "connection": 0.49,
1680
+ "actions": [
1681
+ "deep_work",
1682
+ "learn",
1683
+ "meditate",
1684
+ "sleep",
1685
+ "deep_work",
1686
+ "admin_work",
1687
+ "meditate",
1688
+ "sleep",
1689
+ "deep_work",
1690
+ "admin_work",
1691
+ "socialize",
1692
+ "sleep",
1693
+ "deep_work",
1694
+ "admin_work",
1695
+ "meditate",
1696
+ "sleep",
1697
+ "deep_work",
1698
+ "admin_work",
1699
+ "meditate",
1700
+ "sleep",
1701
+ "deep_work",
1702
+ "admin_work",
1703
+ "meditate",
1704
+ "sleep",
1705
+ "deep_work",
1706
+ "admin_work",
1707
+ "socialize",
1708
+ "sleep"
1709
+ ],
1710
+ "true_belief": [
1711
+ 0.252,
1712
+ 0.5,
1713
+ 0.558
1714
+ ],
1715
+ "final_belief": null,
1716
+ "belief_mae": null
1717
+ },
1718
+ {
1719
+ "condition": "continuous-in-distribution",
1720
+ "profile_name": "sampled_103",
1721
+ "profile_mode": "continuous",
1722
+ "strategy": "heuristic",
1723
+ "seed": 103,
1724
+ "final_score": 0.59,
1725
+ "total_reward": 9.67,
1726
+ "adaptation": -0.465,
1727
+ "vitality": 0.32,
1728
+ "cognition": 0.83,
1729
+ "progress": 1.0,
1730
+ "serenity": 1.0,
1731
+ "connection": 0.48,
1732
+ "actions": [
1733
+ "deep_work",
1734
+ "learn",
1735
+ "meditate",
1736
+ "sleep",
1737
+ "deep_work",
1738
+ "admin_work",
1739
+ "meditate",
1740
+ "sleep",
1741
+ "deep_work",
1742
+ "admin_work",
1743
+ "socialize",
1744
+ "sleep",
1745
+ "deep_work",
1746
+ "admin_work",
1747
+ "meditate",
1748
+ "sleep",
1749
+ "exercise",
1750
+ "admin_work",
1751
+ "meditate",
1752
+ "sleep",
1753
+ "deep_work",
1754
+ "admin_work",
1755
+ "meditate",
1756
+ "sleep",
1757
+ "deep_work",
1758
+ "admin_work",
1759
+ "socialize",
1760
+ "sleep"
1761
+ ],
1762
+ "true_belief": [
1763
+ 0.629,
1764
+ 0.951,
1765
+ 0.487
1766
+ ],
1767
+ "final_belief": null,
1768
+ "belief_mae": null
1769
+ },
1770
+ {
1771
+ "condition": "continuous-in-distribution",
1772
+ "profile_name": "sampled_104",
1773
+ "profile_mode": "continuous",
1774
+ "strategy": "heuristic",
1775
+ "seed": 104,
1776
+ "final_score": 0.5856,
1777
+ "total_reward": 9.49,
1778
+ "adaptation": -0.184,
1779
+ "vitality": 0.72,
1780
+ "cognition": 0.63,
1781
+ "progress": 1.0,
1782
+ "serenity": 1.0,
1783
+ "connection": 0.46,
1784
+ "actions": [
1785
+ "deep_work",
1786
+ "learn",
1787
+ "meditate",
1788
+ "sleep",
1789
+ "deep_work",
1790
+ "admin_work",
1791
+ "meditate",
1792
+ "sleep",
1793
+ "deep_work",
1794
+ "family_time",
1795
+ "meditate",
1796
+ "sleep",
1797
+ "deep_work",
1798
+ "admin_work",
1799
+ "socialize",
1800
+ "sleep",
1801
+ "deep_work",
1802
+ "admin_work",
1803
+ "socialize",
1804
+ "sleep",
1805
+ "deep_work",
1806
+ "admin_work",
1807
+ "meditate",
1808
+ "sleep",
1809
+ "deep_work",
1810
+ "admin_work",
1811
+ "socialize",
1812
+ "sleep"
1813
+ ],
1814
+ "true_belief": [
1815
+ 0.302,
1816
+ 0.5,
1817
+ 0.693
1818
+ ],
1819
+ "final_belief": null,
1820
+ "belief_mae": null
1821
+ },
1822
+ {
1823
+ "condition": "continuous-in-distribution",
1824
+ "profile_name": "sampled_105",
1825
+ "profile_mode": "continuous",
1826
+ "strategy": "heuristic",
1827
+ "seed": 105,
1828
+ "final_score": 0.5866,
1829
+ "total_reward": 5.04,
1830
+ "adaptation": -0.353,
1831
+ "vitality": 0.32,
1832
+ "cognition": 0.5,
1833
+ "progress": 1.0,
1834
+ "serenity": 0.96,
1835
+ "connection": 0.53,
1836
+ "actions": [
1837
+ "deep_work",
1838
+ "learn",
1839
+ "meditate",
1840
+ "sleep",
1841
+ "deep_work",
1842
+ "admin_work",
1843
+ "meditate",
1844
+ "sleep",
1845
+ "deep_work",
1846
+ "admin_work",
1847
+ "meditate",
1848
+ "sleep",
1849
+ "deep_work",
1850
+ "admin_work",
1851
+ "socialize",
1852
+ "sleep",
1853
+ "exercise",
1854
+ "admin_work",
1855
+ "meditate",
1856
+ "sleep",
1857
+ "deep_work",
1858
+ "admin_work",
1859
+ "meditate",
1860
+ "sleep",
1861
+ "exercise",
1862
+ "admin_work",
1863
+ "socialize",
1864
+ "sleep"
1865
+ ],
1866
+ "true_belief": [
1867
+ 0.458,
1868
+ 0.5,
1869
+ 0.315
1870
+ ],
1871
+ "final_belief": null,
1872
+ "belief_mae": null
1873
+ },
1874
+ {
1875
+ "condition": "continuous-in-distribution",
1876
+ "profile_name": "sampled_106",
1877
+ "profile_mode": "continuous",
1878
+ "strategy": "heuristic",
1879
+ "seed": 106,
1880
+ "final_score": 0.5881,
1881
+ "total_reward": 6.6,
1882
+ "adaptation": -0.196,
1883
+ "vitality": 0.46,
1884
+ "cognition": 0.22,
1885
+ "progress": 1.0,
1886
+ "serenity": 0.98,
1887
+ "connection": 0.51,
1888
+ "actions": [
1889
+ "deep_work",
1890
+ "learn",
1891
+ "meditate",
1892
+ "sleep",
1893
+ "deep_work",
1894
+ "admin_work",
1895
+ "meditate",
1896
+ "sleep",
1897
+ "deep_work",
1898
+ "admin_work",
1899
+ "meditate",
1900
+ "sleep",
1901
+ "deep_work",
1902
+ "admin_work",
1903
+ "meditate",
1904
+ "sleep",
1905
+ "deep_work",
1906
+ "admin_work",
1907
+ "meditate",
1908
+ "sleep",
1909
+ "deep_work",
1910
+ "admin_work",
1911
+ "meditate",
1912
+ "sleep",
1913
+ "deep_work",
1914
+ "admin_work",
1915
+ "socialize",
1916
+ "sleep"
1917
+ ],
1918
+ "true_belief": [
1919
+ 0.337,
1920
+ 0.5,
1921
+ 0.467
1922
+ ],
1923
+ "final_belief": null,
1924
+ "belief_mae": null
1925
+ },
1926
+ {
1927
+ "condition": "continuous-in-distribution",
1928
+ "profile_name": "sampled_107",
1929
+ "profile_mode": "continuous",
1930
+ "strategy": "heuristic",
1931
+ "seed": 107,
1932
+ "final_score": 0.5826,
1933
+ "total_reward": 8.01,
1934
+ "adaptation": -0.072,
1935
+ "vitality": 0.31,
1936
+ "cognition": 0.61,
1937
+ "progress": 1.0,
1938
+ "serenity": 0.76,
1939
+ "connection": 0.46,
1940
+ "actions": [
1941
+ "deep_work",
1942
+ "learn",
1943
+ "meditate",
1944
+ "sleep",
1945
+ "deep_work",
1946
+ "admin_work",
1947
+ "meditate",
1948
+ "sleep",
1949
+ "deep_work",
1950
+ "admin_work",
1951
+ "meditate",
1952
+ "sleep",
1953
+ "deep_work",
1954
+ "admin_work",
1955
+ "meditate",
1956
+ "sleep",
1957
+ "deep_work",
1958
+ "admin_work",
1959
+ "socialize",
1960
+ "sleep",
1961
+ "deep_work",
1962
+ "admin_work",
1963
+ "socialize",
1964
+ "sleep",
1965
+ "deep_work",
1966
+ "admin_work",
1967
+ "socialize",
1968
+ "sleep"
1969
+ ],
1970
+ "true_belief": [
1971
+ 0.315,
1972
+ 0.376,
1973
+ 0.501
1974
+ ],
1975
+ "final_belief": null,
1976
+ "belief_mae": null
1977
+ },
1978
+ {
1979
+ "condition": "continuous-in-distribution",
1980
+ "profile_name": "sampled_108",
1981
+ "profile_mode": "continuous",
1982
+ "strategy": "heuristic",
1983
+ "seed": 108,
1984
+ "final_score": 0.5881,
1985
+ "total_reward": 7.99,
1986
+ "adaptation": -0.055,
1987
+ "vitality": 0.28,
1988
+ "cognition": 0.37,
1989
+ "progress": 1.0,
1990
+ "serenity": 1.0,
1991
+ "connection": 0.49,
1992
+ "actions": [
1993
+ "deep_work",
1994
+ "learn",
1995
+ "meditate",
1996
+ "sleep",
1997
+ "deep_work",
1998
+ "admin_work",
1999
+ "socialize",
2000
+ "sleep",
2001
+ "deep_work",
2002
+ "admin_work",
2003
+ "meditate",
2004
+ "sleep",
2005
+ "deep_work",
2006
+ "admin_work",
2007
+ "meditate",
2008
+ "sleep",
2009
+ "exercise",
2010
+ "admin_work",
2011
+ "socialize",
2012
+ "sleep",
2013
+ "exercise",
2014
+ "admin_work",
2015
+ "meditate",
2016
+ "sleep",
2017
+ "deep_work",
2018
+ "admin_work",
2019
+ "socialize",
2020
+ "sleep"
2021
+ ],
2022
+ "true_belief": [
2023
+ 0.741,
2024
+ 0.5,
2025
+ 0.143
2026
+ ],
2027
+ "final_belief": null,
2028
+ "belief_mae": null
2029
+ },
2030
+ {
2031
+ "condition": "continuous-in-distribution",
2032
+ "profile_name": "sampled_109",
2033
+ "profile_mode": "continuous",
2034
+ "strategy": "heuristic",
2035
+ "seed": 109,
2036
+ "final_score": 0.5936,
2037
+ "total_reward": 8.46,
2038
+ "adaptation": -0.162,
2039
+ "vitality": 0.37,
2040
+ "cognition": 0.36,
2041
+ "progress": 1.0,
2042
+ "serenity": 0.98,
2043
+ "connection": 0.52,
2044
+ "actions": [
2045
+ "deep_work",
2046
+ "learn",
2047
+ "meditate",
2048
+ "sleep",
2049
+ "deep_work",
2050
+ "admin_work",
2051
+ "meditate",
2052
+ "sleep",
2053
+ "deep_work",
2054
+ "admin_work",
2055
+ "meditate",
2056
+ "sleep",
2057
+ "deep_work",
2058
+ "admin_work",
2059
+ "socialize",
2060
+ "sleep",
2061
+ "deep_work",
2062
+ "admin_work",
2063
+ "meditate",
2064
+ "sleep",
2065
+ "exercise",
2066
+ "admin_work",
2067
+ "meditate",
2068
+ "sleep",
2069
+ "exercise",
2070
+ "admin_work",
2071
+ "socialize",
2072
+ "sleep"
2073
+ ],
2074
+ "true_belief": [
2075
+ 0.46,
2076
+ 0.718,
2077
+ 0.273
2078
+ ],
2079
+ "final_belief": null,
2080
+ "belief_mae": null
2081
+ },
2082
+ {
2083
+ "condition": "continuous-in-distribution",
2084
+ "profile_name": "sampled_100",
2085
+ "profile_mode": "continuous",
2086
+ "strategy": "random",
2087
+ "seed": 100,
2088
+ "final_score": 0.5315,
2089
+ "total_reward": 0.26,
2090
+ "adaptation": -0.375,
2091
+ "vitality": 0.0,
2092
+ "cognition": 0.0,
2093
+ "progress": 0.81,
2094
+ "serenity": 1.0,
2095
+ "connection": 0.84,
2096
+ "actions": [
2097
+ "exercise",
2098
+ "meditate",
2099
+ "socialize",
2100
+ "learn",
2101
+ "binge_watch",
2102
+ "admin_work",
2103
+ "family_time",
2104
+ "family_time",
2105
+ "socialize",
2106
+ "family_time",
2107
+ "deep_work",
2108
+ "admin_work",
2109
+ "me_time",
2110
+ "deep_work",
2111
+ "family_time",
2112
+ "meditate",
2113
+ "meditate",
2114
+ "admin_work",
2115
+ "learn",
2116
+ "deep_work",
2117
+ "admin_work",
2118
+ "learn",
2119
+ "learn",
2120
+ "deep_work",
2121
+ "family_time",
2122
+ "me_time",
2123
+ "me_time",
2124
+ "learn"
2125
+ ],
2126
+ "true_belief": [
2127
+ 0.613,
2128
+ 0.5,
2129
+ 0.503
2130
+ ],
2131
+ "final_belief": null,
2132
+ "belief_mae": null
2133
+ },
2134
+ {
2135
+ "condition": "continuous-in-distribution",
2136
+ "profile_name": "sampled_101",
2137
+ "profile_mode": "continuous",
2138
+ "strategy": "random",
2139
+ "seed": 101,
2140
+ "final_score": 0.5347,
2141
+ "total_reward": 6.42,
2142
+ "adaptation": -0.236,
2143
+ "vitality": 0.75,
2144
+ "cognition": 0.48,
2145
+ "progress": 0.51,
2146
+ "serenity": 1.0,
2147
+ "connection": 0.99,
2148
+ "actions": [
2149
+ "sleep",
2150
+ "binge_watch",
2151
+ "me_time",
2152
+ "deep_work",
2153
+ "admin_work",
2154
+ "socialize",
2155
+ "family_time",
2156
+ "socialize",
2157
+ "deep_work",
2158
+ "deep_work",
2159
+ "meditate",
2160
+ "socialize",
2161
+ "binge_watch",
2162
+ "me_time",
2163
+ "meditate",
2164
+ "sleep",
2165
+ "sleep",
2166
+ "socialize",
2167
+ "meditate",
2168
+ "me_time",
2169
+ "exercise",
2170
+ "family_time",
2171
+ "sleep",
2172
+ "sleep",
2173
+ "socialize",
2174
+ "deep_work",
2175
+ "family_time",
2176
+ "sleep"
2177
+ ],
2178
+ "true_belief": [
2179
+ 0.545,
2180
+ 0.277,
2181
+ 0.461
2182
+ ],
2183
+ "final_belief": null,
2184
+ "belief_mae": null
2185
+ },
2186
+ {
2187
+ "condition": "continuous-in-distribution",
2188
+ "profile_name": "sampled_102",
2189
+ "profile_mode": "continuous",
2190
+ "strategy": "random",
2191
+ "seed": 102,
2192
+ "final_score": 0.4745,
2193
+ "total_reward": -0.52,
2194
+ "adaptation": 0.19,
2195
+ "vitality": 0.2,
2196
+ "cognition": 0.13,
2197
+ "progress": 0.45,
2198
+ "serenity": 0.98,
2199
+ "connection": 0.74,
2200
+ "actions": [
2201
+ "family_time",
2202
+ "learn",
2203
+ "meditate",
2204
+ "me_time",
2205
+ "exercise",
2206
+ "socialize",
2207
+ "binge_watch",
2208
+ "socialize",
2209
+ "family_time",
2210
+ "admin_work",
2211
+ "me_time",
2212
+ "binge_watch",
2213
+ "deep_work",
2214
+ "meditate",
2215
+ "socialize",
2216
+ "binge_watch",
2217
+ "socialize",
2218
+ "sleep",
2219
+ "meditate",
2220
+ "sleep",
2221
+ "me_time",
2222
+ "learn",
2223
+ "learn",
2224
+ "learn",
2225
+ "me_time",
2226
+ "deep_work",
2227
+ "socialize",
2228
+ "sleep"
2229
+ ],
2230
+ "true_belief": [
2231
+ 0.252,
2232
+ 0.5,
2233
+ 0.558
2234
+ ],
2235
+ "final_belief": null,
2236
+ "belief_mae": null
2237
+ },
2238
+ {
2239
+ "condition": "continuous-in-distribution",
2240
+ "profile_name": "sampled_103",
2241
+ "profile_mode": "continuous",
2242
+ "strategy": "random",
2243
+ "seed": 103,
2244
+ "final_score": 0.5636,
2245
+ "total_reward": 1.31,
2246
+ "adaptation": -0.591,
2247
+ "vitality": 0.0,
2248
+ "cognition": 0.0,
2249
+ "progress": 1.0,
2250
+ "serenity": 1.0,
2251
+ "connection": 0.69,
2252
+ "actions": [
2253
+ "learn",
2254
+ "me_time",
2255
+ "binge_watch",
2256
+ "meditate",
2257
+ "socialize",
2258
+ "deep_work",
2259
+ "binge_watch",
2260
+ "me_time",
2261
+ "deep_work",
2262
+ "admin_work",
2263
+ "socialize",
2264
+ "deep_work",
2265
+ "me_time",
2266
+ "socialize",
2267
+ "learn",
2268
+ "family_time",
2269
+ "learn",
2270
+ "meditate",
2271
+ "admin_work",
2272
+ "sleep",
2273
+ "learn",
2274
+ "socialize",
2275
+ "admin_work",
2276
+ "me_time",
2277
+ "deep_work",
2278
+ "deep_work",
2279
+ "admin_work",
2280
+ "learn"
2281
+ ],
2282
+ "true_belief": [
2283
+ 0.629,
2284
+ 0.951,
2285
+ 0.487
2286
+ ],
2287
+ "final_belief": null,
2288
+ "belief_mae": null
2289
+ },
2290
+ {
2291
+ "condition": "continuous-in-distribution",
2292
+ "profile_name": "sampled_104",
2293
+ "profile_mode": "continuous",
2294
+ "strategy": "random",
2295
+ "seed": 104,
2296
+ "final_score": 0.5435,
2297
+ "total_reward": 6.85,
2298
+ "adaptation": 0.231,
2299
+ "vitality": 0.05,
2300
+ "cognition": 0.68,
2301
+ "progress": 0.46,
2302
+ "serenity": 1.0,
2303
+ "connection": 0.91,
2304
+ "actions": [
2305
+ "admin_work",
2306
+ "sleep",
2307
+ "family_time",
2308
+ "meditate",
2309
+ "binge_watch",
2310
+ "meditate",
2311
+ "family_time",
2312
+ "socialize",
2313
+ "exercise",
2314
+ "me_time",
2315
+ "sleep",
2316
+ "admin_work",
2317
+ "me_time",
2318
+ "meditate",
2319
+ "socialize",
2320
+ "meditate",
2321
+ "exercise",
2322
+ "meditate",
2323
+ "family_time",
2324
+ "deep_work",
2325
+ "deep_work",
2326
+ "socialize",
2327
+ "sleep",
2328
+ "me_time",
2329
+ "socialize",
2330
+ "family_time",
2331
+ "socialize",
2332
+ "meditate"
2333
+ ],
2334
+ "true_belief": [
2335
+ 0.302,
2336
+ 0.5,
2337
+ 0.693
2338
+ ],
2339
+ "final_belief": null,
2340
+ "belief_mae": null
2341
+ },
2342
+ {
2343
+ "condition": "continuous-in-distribution",
2344
+ "profile_name": "sampled_105",
2345
+ "profile_mode": "continuous",
2346
+ "strategy": "random",
2347
+ "seed": 105,
2348
+ "final_score": 0.468,
2349
+ "total_reward": -6.09,
2350
+ "adaptation": -0.579,
2351
+ "vitality": 0.12,
2352
+ "cognition": 0.01,
2353
+ "progress": 0.53,
2354
+ "serenity": 1.0,
2355
+ "connection": 0.99,
2356
+ "actions": [
2357
+ "learn",
2358
+ "binge_watch",
2359
+ "family_time",
2360
+ "me_time",
2361
+ "socialize",
2362
+ "admin_work",
2363
+ "deep_work",
2364
+ "meditate",
2365
+ "sleep",
2366
+ "socialize",
2367
+ "binge_watch",
2368
+ "learn",
2369
+ "binge_watch",
2370
+ "family_time",
2371
+ "admin_work",
2372
+ "deep_work",
2373
+ "socialize",
2374
+ "admin_work",
2375
+ "deep_work",
2376
+ "binge_watch",
2377
+ "socialize",
2378
+ "socialize",
2379
+ "learn",
2380
+ "meditate",
2381
+ "me_time",
2382
+ "family_time",
2383
+ "socialize",
2384
+ "exercise"
2385
+ ],
2386
+ "true_belief": [
2387
+ 0.458,
2388
+ 0.5,
2389
+ 0.315
2390
+ ],
2391
+ "final_belief": null,
2392
+ "belief_mae": null
2393
+ },
2394
+ {
2395
+ "condition": "continuous-in-distribution",
2396
+ "profile_name": "sampled_106",
2397
+ "profile_mode": "continuous",
2398
+ "strategy": "random",
2399
+ "seed": 106,
2400
+ "final_score": 0.527,
2401
+ "total_reward": -3.21,
2402
+ "adaptation": -0.268,
2403
+ "vitality": 0.24,
2404
+ "cognition": 0.0,
2405
+ "progress": 0.73,
2406
+ "serenity": 1.0,
2407
+ "connection": 0.92,
2408
+ "actions": [
2409
+ "deep_work",
2410
+ "meditate",
2411
+ "family_time",
2412
+ "meditate",
2413
+ "learn",
2414
+ "socialize",
2415
+ "socialize",
2416
+ "socialize",
2417
+ "learn",
2418
+ "learn",
2419
+ "exercise",
2420
+ "deep_work",
2421
+ "meditate",
2422
+ "family_time",
2423
+ "meditate",
2424
+ "admin_work",
2425
+ "family_time",
2426
+ "admin_work",
2427
+ "family_time",
2428
+ "me_time",
2429
+ "admin_work",
2430
+ "socialize",
2431
+ "sleep",
2432
+ "me_time",
2433
+ "deep_work",
2434
+ "me_time",
2435
+ "sleep",
2436
+ "learn"
2437
+ ],
2438
+ "true_belief": [
2439
+ 0.337,
2440
+ 0.5,
2441
+ 0.467
2442
+ ],
2443
+ "final_belief": null,
2444
+ "belief_mae": null
2445
+ },
2446
+ {
2447
+ "condition": "continuous-in-distribution",
2448
+ "profile_name": "sampled_107",
2449
+ "profile_mode": "continuous",
2450
+ "strategy": "random",
2451
+ "seed": 107,
2452
+ "final_score": 0.5088,
2453
+ "total_reward": -3.12,
2454
+ "adaptation": -0.481,
2455
+ "vitality": 0.03,
2456
+ "cognition": 0.0,
2457
+ "progress": 0.98,
2458
+ "serenity": 0.34,
2459
+ "connection": 0.36,
2460
+ "actions": [
2461
+ "admin_work",
2462
+ "admin_work",
2463
+ "learn",
2464
+ "meditate",
2465
+ "deep_work",
2466
+ "learn",
2467
+ "me_time",
2468
+ "sleep",
2469
+ "deep_work",
2470
+ "learn",
2471
+ "admin_work",
2472
+ "socialize",
2473
+ "deep_work",
2474
+ "socialize",
2475
+ "socialize",
2476
+ "exercise",
2477
+ "binge_watch",
2478
+ "binge_watch",
2479
+ "admin_work",
2480
+ "admin_work",
2481
+ "family_time",
2482
+ "sleep",
2483
+ "admin_work",
2484
+ "family_time",
2485
+ "deep_work",
2486
+ "deep_work",
2487
+ "me_time",
2488
+ "binge_watch"
2489
+ ],
2490
+ "true_belief": [
2491
+ 0.315,
2492
+ 0.376,
2493
+ 0.501
2494
+ ],
2495
+ "final_belief": null,
2496
+ "belief_mae": null
2497
+ },
2498
+ {
2499
+ "condition": "continuous-in-distribution",
2500
+ "profile_name": "sampled_108",
2501
+ "profile_mode": "continuous",
2502
+ "strategy": "random",
2503
+ "seed": 108,
2504
+ "final_score": 0.4869,
2505
+ "total_reward": 1.19,
2506
+ "adaptation": -0.472,
2507
+ "vitality": 0.0,
2508
+ "cognition": 0.0,
2509
+ "progress": 0.48,
2510
+ "serenity": 1.0,
2511
+ "connection": 1.0,
2512
+ "actions": [
2513
+ "learn",
2514
+ "sleep",
2515
+ "socialize",
2516
+ "binge_watch",
2517
+ "socialize",
2518
+ "me_time",
2519
+ "meditate",
2520
+ "learn",
2521
+ "meditate",
2522
+ "deep_work",
2523
+ "socialize",
2524
+ "family_time",
2525
+ "admin_work",
2526
+ "meditate",
2527
+ "family_time",
2528
+ "socialize",
2529
+ "binge_watch",
2530
+ "family_time",
2531
+ "binge_watch",
2532
+ "binge_watch",
2533
+ "learn",
2534
+ "socialize",
2535
+ "exercise",
2536
+ "learn",
2537
+ "family_time",
2538
+ "family_time",
2539
+ "deep_work",
2540
+ "socialize"
2541
+ ],
2542
+ "true_belief": [
2543
+ 0.741,
2544
+ 0.5,
2545
+ 0.143
2546
+ ],
2547
+ "final_belief": null,
2548
+ "belief_mae": null
2549
+ },
2550
+ {
2551
+ "condition": "continuous-in-distribution",
2552
+ "profile_name": "sampled_109",
2553
+ "profile_mode": "continuous",
2554
+ "strategy": "random",
2555
+ "seed": 109,
2556
+ "final_score": 0.5165,
2557
+ "total_reward": 2.09,
2558
+ "adaptation": -0.463,
2559
+ "vitality": 0.4,
2560
+ "cognition": 0.03,
2561
+ "progress": 0.76,
2562
+ "serenity": 0.83,
2563
+ "connection": 0.65,
2564
+ "actions": [
2565
+ "binge_watch",
2566
+ "exercise",
2567
+ "deep_work",
2568
+ "learn",
2569
+ "sleep",
2570
+ "sleep",
2571
+ "learn",
2572
+ "exercise",
2573
+ "deep_work",
2574
+ "deep_work",
2575
+ "me_time",
2576
+ "meditate",
2577
+ "socialize",
2578
+ "meditate",
2579
+ "socialize",
2580
+ "socialize",
2581
+ "sleep",
2582
+ "learn",
2583
+ "admin_work",
2584
+ "socialize",
2585
+ "socialize",
2586
+ "admin_work",
2587
+ "me_time",
2588
+ "exercise",
2589
+ "exercise",
2590
+ "exercise",
2591
+ "sleep",
2592
+ "binge_watch"
2593
+ ],
2594
+ "true_belief": [
2595
+ 0.46,
2596
+ 0.718,
2597
+ 0.273
2598
+ ],
2599
+ "final_belief": null,
2600
+ "belief_mae": null
2601
+ },
2602
+ {
2603
+ "condition": "continuous-OOD (generalization)",
2604
+ "profile_name": "sampled_10000",
2605
+ "profile_mode": "continuous",
2606
+ "strategy": "heuristic",
2607
+ "seed": 10000,
2608
+ "final_score": 0.5738,
2609
+ "total_reward": 7.69,
2610
+ "adaptation": -0.184,
2611
+ "vitality": 0.39,
2612
+ "cognition": 0.58,
2613
+ "progress": 1.0,
2614
+ "serenity": 1.0,
2615
+ "connection": 0.4,
2616
+ "actions": [
2617
+ "deep_work",
2618
+ "learn",
2619
+ "meditate",
2620
+ "sleep",
2621
+ "deep_work",
2622
+ "admin_work",
2623
+ "socialize",
2624
+ "sleep",
2625
+ "deep_work",
2626
+ "admin_work",
2627
+ "meditate",
2628
+ "sleep",
2629
+ "deep_work",
2630
+ "admin_work",
2631
+ "socialize",
2632
+ "sleep",
2633
+ "exercise",
2634
+ "admin_work",
2635
+ "socialize",
2636
+ "sleep",
2637
+ "deep_work",
2638
+ "admin_work",
2639
+ "socialize",
2640
+ "sleep",
2641
+ "exercise",
2642
+ "admin_work",
2643
+ "socialize",
2644
+ "sleep"
2645
+ ],
2646
+ "true_belief": [
2647
+ 0.304,
2648
+ 0.844,
2649
+ 0.388
2650
+ ],
2651
+ "final_belief": null,
2652
+ "belief_mae": null
2653
+ },
2654
+ {
2655
+ "condition": "continuous-OOD (generalization)",
2656
+ "profile_name": "sampled_10001",
2657
+ "profile_mode": "continuous",
2658
+ "strategy": "heuristic",
2659
+ "seed": 10001,
2660
+ "final_score": 0.5621,
2661
+ "total_reward": 3.59,
2662
+ "adaptation": -0.04,
2663
+ "vitality": 1.0,
2664
+ "cognition": 1.0,
2665
+ "progress": 1.0,
2666
+ "serenity": 0.93,
2667
+ "connection": 0.38,
2668
+ "actions": [
2669
+ "deep_work",
2670
+ "learn",
2671
+ "meditate",
2672
+ "sleep",
2673
+ "deep_work",
2674
+ "learn",
2675
+ "meditate",
2676
+ "sleep",
2677
+ "deep_work",
2678
+ "admin_work",
2679
+ "meditate",
2680
+ "sleep",
2681
+ "deep_work",
2682
+ "admin_work",
2683
+ "meditate",
2684
+ "sleep",
2685
+ "deep_work",
2686
+ "admin_work",
2687
+ "socialize",
2688
+ "sleep",
2689
+ "deep_work",
2690
+ "admin_work",
2691
+ "meditate",
2692
+ "sleep",
2693
+ "deep_work",
2694
+ "admin_work",
2695
+ "meditate",
2696
+ "sleep"
2697
+ ],
2698
+ "true_belief": [
2699
+ 0.272,
2700
+ 0.291,
2701
+ 0.323
2702
+ ],
2703
+ "final_belief": null,
2704
+ "belief_mae": null
2705
+ },
2706
+ {
2707
+ "condition": "continuous-OOD (generalization)",
2708
+ "profile_name": "sampled_10002",
2709
+ "profile_mode": "continuous",
2710
+ "strategy": "heuristic",
2711
+ "seed": 10002,
2712
+ "final_score": 0.6146,
2713
+ "total_reward": 7.08,
2714
+ "adaptation": 0.158,
2715
+ "vitality": 0.79,
2716
+ "cognition": 0.33,
2717
+ "progress": 1.0,
2718
+ "serenity": 1.0,
2719
+ "connection": 0.47,
2720
+ "actions": [
2721
+ "deep_work",
2722
+ "learn",
2723
+ "meditate",
2724
+ "sleep",
2725
+ "deep_work",
2726
+ "admin_work",
2727
+ "meditate",
2728
+ "sleep",
2729
+ "deep_work",
2730
+ "admin_work",
2731
+ "socialize",
2732
+ "sleep",
2733
+ "exercise",
2734
+ "admin_work",
2735
+ "meditate",
2736
+ "sleep",
2737
+ "deep_work",
2738
+ "admin_work",
2739
+ "socialize",
2740
+ "sleep",
2741
+ "exercise",
2742
+ "admin_work",
2743
+ "meditate",
2744
+ "sleep",
2745
+ "exercise",
2746
+ "admin_work",
2747
+ "socialize",
2748
+ "sleep"
2749
+ ],
2750
+ "true_belief": [
2751
+ 0.369,
2752
+ 0.899,
2753
+ 0.069
2754
+ ],
2755
+ "final_belief": null,
2756
+ "belief_mae": null
2757
+ },
2758
+ {
2759
+ "condition": "continuous-OOD (generalization)",
2760
+ "profile_name": "sampled_10003",
2761
+ "profile_mode": "continuous",
2762
+ "strategy": "heuristic",
2763
+ "seed": 10003,
2764
+ "final_score": 0.5761,
2765
+ "total_reward": 11.98,
2766
+ "adaptation": -0.053,
2767
+ "vitality": 1.0,
2768
+ "cognition": 0.43,
2769
+ "progress": 0.98,
2770
+ "serenity": 1.0,
2771
+ "connection": 0.41,
2772
+ "actions": [
2773
+ "deep_work",
2774
+ "learn",
2775
+ "meditate",
2776
+ "sleep",
2777
+ "deep_work",
2778
+ "learn",
2779
+ "meditate",
2780
+ "sleep",
2781
+ "deep_work",
2782
+ "admin_work",
2783
+ "socialize",
2784
+ "sleep",
2785
+ "deep_work",
2786
+ "admin_work",
2787
+ "meditate",
2788
+ "sleep",
2789
+ "deep_work",
2790
+ "admin_work",
2791
+ "meditate",
2792
+ "sleep",
2793
+ "exercise",
2794
+ "admin_work",
2795
+ "meditate",
2796
+ "sleep",
2797
+ "deep_work",
2798
+ "admin_work",
2799
+ "meditate",
2800
+ "sleep"
2801
+ ],
2802
+ "true_belief": [
2803
+ 0.388,
2804
+ 0.532,
2805
+ 0.55
2806
+ ],
2807
+ "final_belief": null,
2808
+ "belief_mae": null
2809
+ },
2810
+ {
2811
+ "condition": "continuous-OOD (generalization)",
2812
+ "profile_name": "sampled_10004",
2813
+ "profile_mode": "continuous",
2814
+ "strategy": "heuristic",
2815
+ "seed": 10004,
2816
+ "final_score": 0.5789,
2817
+ "total_reward": 5.48,
2818
+ "adaptation": -0.104,
2819
+ "vitality": 0.4,
2820
+ "cognition": 0.2,
2821
+ "progress": 1.0,
2822
+ "serenity": 1.0,
2823
+ "connection": 0.46,
2824
+ "actions": [
2825
+ "deep_work",
2826
+ "learn",
2827
+ "meditate",
2828
+ "sleep",
2829
+ "deep_work",
2830
+ "admin_work",
2831
+ "meditate",
2832
+ "sleep",
2833
+ "deep_work",
2834
+ "admin_work",
2835
+ "meditate",
2836
+ "sleep",
2837
+ "deep_work",
2838
+ "admin_work",
2839
+ "socialize",
2840
+ "sleep",
2841
+ "deep_work",
2842
+ "admin_work",
2843
+ "meditate",
2844
+ "sleep",
2845
+ "deep_work",
2846
+ "admin_work",
2847
+ "meditate",
2848
+ "sleep",
2849
+ "deep_work",
2850
+ "admin_work",
2851
+ "socialize",
2852
+ "sleep"
2853
+ ],
2854
+ "true_belief": [
2855
+ 0.586,
2856
+ 0.812,
2857
+ 0.436
2858
+ ],
2859
+ "final_belief": null,
2860
+ "belief_mae": null
2861
+ },
2862
+ {
2863
+ "condition": "continuous-OOD (generalization)",
2864
+ "profile_name": "sampled_10005",
2865
+ "profile_mode": "continuous",
2866
+ "strategy": "heuristic",
2867
+ "seed": 10005,
2868
+ "final_score": 0.5806,
2869
+ "total_reward": 9.41,
2870
+ "adaptation": 0.055,
2871
+ "vitality": 0.58,
2872
+ "cognition": 0.44,
2873
+ "progress": 0.95,
2874
+ "serenity": 0.97,
2875
+ "connection": 0.43,
2876
+ "actions": [
2877
+ "deep_work",
2878
+ "learn",
2879
+ "meditate",
2880
+ "sleep",
2881
+ "deep_work",
2882
+ "admin_work",
2883
+ "meditate",
2884
+ "sleep",
2885
+ "deep_work",
2886
+ "admin_work",
2887
+ "socialize",
2888
+ "sleep",
2889
+ "exercise",
2890
+ "admin_work",
2891
+ "socialize",
2892
+ "sleep",
2893
+ "exercise",
2894
+ "admin_work",
2895
+ "meditate",
2896
+ "sleep",
2897
+ "deep_work",
2898
+ "admin_work",
2899
+ "socialize",
2900
+ "sleep",
2901
+ "exercise",
2902
+ "admin_work",
2903
+ "meditate",
2904
+ "sleep"
2905
+ ],
2906
+ "true_belief": [
2907
+ 0.273,
2908
+ 0.5,
2909
+ 0.524
2910
+ ],
2911
+ "final_belief": null,
2912
+ "belief_mae": null
2913
+ },
2914
+ {
2915
+ "condition": "continuous-OOD (generalization)",
2916
+ "profile_name": "sampled_10006",
2917
+ "profile_mode": "continuous",
2918
+ "strategy": "heuristic",
2919
+ "seed": 10006,
2920
+ "final_score": 0.5823,
2921
+ "total_reward": 7.59,
2922
+ "adaptation": -0.385,
2923
+ "vitality": 0.32,
2924
+ "cognition": 0.13,
2925
+ "progress": 1.0,
2926
+ "serenity": 0.96,
2927
+ "connection": 0.47,
2928
+ "actions": [
2929
+ "deep_work",
2930
+ "learn",
2931
+ "meditate",
2932
+ "sleep",
2933
+ "deep_work",
2934
+ "admin_work",
2935
+ "meditate",
2936
+ "sleep",
2937
+ "deep_work",
2938
+ "admin_work",
2939
+ "meditate",
2940
+ "sleep",
2941
+ "deep_work",
2942
+ "admin_work",
2943
+ "meditate",
2944
+ "sleep",
2945
+ "deep_work",
2946
+ "admin_work",
2947
+ "meditate",
2948
+ "sleep",
2949
+ "deep_work",
2950
+ "admin_work",
2951
+ "meditate",
2952
+ "sleep",
2953
+ "deep_work",
2954
+ "admin_work",
2955
+ "socialize",
2956
+ "sleep"
2957
+ ],
2958
+ "true_belief": [
2959
+ 0.546,
2960
+ 0.798,
2961
+ 0.413
2962
+ ],
2963
+ "final_belief": null,
2964
+ "belief_mae": null
2965
+ },
2966
+ {
2967
+ "condition": "continuous-OOD (generalization)",
2968
+ "profile_name": "sampled_10007",
2969
+ "profile_mode": "continuous",
2970
+ "strategy": "heuristic",
2971
+ "seed": 10007,
2972
+ "final_score": 0.5894,
2973
+ "total_reward": 6.29,
2974
+ "adaptation": -0.086,
2975
+ "vitality": 0.67,
2976
+ "cognition": 0.78,
2977
+ "progress": 1.0,
2978
+ "serenity": 1.0,
2979
+ "connection": 0.52,
2980
+ "actions": [
2981
+ "deep_work",
2982
+ "learn",
2983
+ "meditate",
2984
+ "sleep",
2985
+ "deep_work",
2986
+ "admin_work",
2987
+ "meditate",
2988
+ "sleep",
2989
+ "deep_work",
2990
+ "admin_work",
2991
+ "socialize",
2992
+ "sleep",
2993
+ "deep_work",
2994
+ "admin_work",
2995
+ "meditate",
2996
+ "sleep",
2997
+ "deep_work",
2998
+ "admin_work",
2999
+ "meditate",
3000
+ "sleep",
3001
+ "deep_work",
3002
+ "admin_work",
3003
+ "meditate",
3004
+ "sleep",
3005
+ "deep_work",
3006
+ "admin_work",
3007
+ "socialize",
3008
+ "sleep"
3009
+ ],
3010
+ "true_belief": [
3011
+ 0.567,
3012
+ 0.5,
3013
+ 0.172
3014
+ ],
3015
+ "final_belief": null,
3016
+ "belief_mae": null
3017
+ },
3018
+ {
3019
+ "condition": "continuous-OOD (generalization)",
3020
+ "profile_name": "sampled_10008",
3021
+ "profile_mode": "continuous",
3022
+ "strategy": "heuristic",
3023
+ "seed": 10008,
3024
+ "final_score": 0.5752,
3025
+ "total_reward": 7.39,
3026
+ "adaptation": -0.164,
3027
+ "vitality": 1.0,
3028
+ "cognition": 0.44,
3029
+ "progress": 1.0,
3030
+ "serenity": 1.0,
3031
+ "connection": 0.41,
3032
+ "actions": [
3033
+ "deep_work",
3034
+ "learn",
3035
+ "meditate",
3036
+ "sleep",
3037
+ "deep_work",
3038
+ "admin_work",
3039
+ "socialize",
3040
+ "sleep",
3041
+ "deep_work",
3042
+ "admin_work",
3043
+ "meditate",
3044
+ "sleep",
3045
+ "deep_work",
3046
+ "admin_work",
3047
+ "meditate",
3048
+ "sleep",
3049
+ "deep_work",
3050
+ "admin_work",
3051
+ "socialize",
3052
+ "sleep",
3053
+ "exercise",
3054
+ "admin_work",
3055
+ "meditate",
3056
+ "sleep",
3057
+ "deep_work",
3058
+ "admin_work",
3059
+ "meditate",
3060
+ "sleep"
3061
+ ],
3062
+ "true_belief": [
3063
+ 0.844,
3064
+ 0.307,
3065
+ 0.464
3066
+ ],
3067
+ "final_belief": null,
3068
+ "belief_mae": null
3069
+ },
3070
+ {
3071
+ "condition": "continuous-OOD (generalization)",
3072
+ "profile_name": "sampled_10009",
3073
+ "profile_mode": "continuous",
3074
+ "strategy": "heuristic",
3075
+ "seed": 10009,
3076
+ "final_score": 0.5636,
3077
+ "total_reward": 4.99,
3078
+ "adaptation": -0.023,
3079
+ "vitality": 0.89,
3080
+ "cognition": 0.43,
3081
+ "progress": 0.98,
3082
+ "serenity": 1.0,
3083
+ "connection": 0.4,
3084
+ "actions": [
3085
+ "deep_work",
3086
+ "learn",
3087
+ "meditate",
3088
+ "sleep",
3089
+ "deep_work",
3090
+ "admin_work",
3091
+ "meditate",
3092
+ "sleep",
3093
+ "deep_work",
3094
+ "admin_work",
3095
+ "socialize",
3096
+ "sleep",
3097
+ "deep_work",
3098
+ "admin_work",
3099
+ "meditate",
3100
+ "sleep",
3101
+ "deep_work",
3102
+ "admin_work",
3103
+ "meditate",
3104
+ "sleep",
3105
+ "deep_work",
3106
+ "admin_work",
3107
+ "meditate",
3108
+ "sleep",
3109
+ "exercise",
3110
+ "admin_work",
3111
+ "meditate",
3112
+ "sleep"
3113
+ ],
3114
+ "true_belief": [
3115
+ 0.333,
3116
+ 0.383,
3117
+ 0.498
3118
+ ],
3119
+ "final_belief": null,
3120
+ "belief_mae": null
3121
+ },
3122
+ {
3123
+ "condition": "continuous-OOD (generalization)",
3124
+ "profile_name": "sampled_10000",
3125
+ "profile_mode": "continuous",
3126
+ "strategy": "random",
3127
+ "seed": 10000,
3128
+ "final_score": 0.5259,
3129
+ "total_reward": 3.27,
3130
+ "adaptation": -0.098,
3131
+ "vitality": 0.3,
3132
+ "cognition": 0.29,
3133
+ "progress": 1.0,
3134
+ "serenity": 1.0,
3135
+ "connection": 0.23,
3136
+ "actions": [
3137
+ "binge_watch",
3138
+ "binge_watch",
3139
+ "deep_work",
3140
+ "learn",
3141
+ "meditate",
3142
+ "deep_work",
3143
+ "socialize",
3144
+ "sleep",
3145
+ "meditate",
3146
+ "admin_work",
3147
+ "admin_work",
3148
+ "deep_work",
3149
+ "me_time",
3150
+ "binge_watch",
3151
+ "me_time",
3152
+ "exercise",
3153
+ "sleep",
3154
+ "exercise",
3155
+ "binge_watch",
3156
+ "meditate",
3157
+ "learn",
3158
+ "sleep",
3159
+ "exercise",
3160
+ "socialize",
3161
+ "socialize",
3162
+ "sleep",
3163
+ "family_time",
3164
+ "deep_work"
3165
+ ],
3166
+ "true_belief": [
3167
+ 0.304,
3168
+ 0.844,
3169
+ 0.388
3170
+ ],
3171
+ "final_belief": null,
3172
+ "belief_mae": null
3173
+ },
3174
+ {
3175
+ "condition": "continuous-OOD (generalization)",
3176
+ "profile_name": "sampled_10001",
3177
+ "profile_mode": "continuous",
3178
+ "strategy": "random",
3179
+ "seed": 10001,
3180
+ "final_score": 0.5465,
3181
+ "total_reward": 3.05,
3182
+ "adaptation": -0.102,
3183
+ "vitality": 0.87,
3184
+ "cognition": 0.2,
3185
+ "progress": 0.95,
3186
+ "serenity": 0.98,
3187
+ "connection": 0.39,
3188
+ "actions": [
3189
+ "me_time",
3190
+ "me_time",
3191
+ "exercise",
3192
+ "learn",
3193
+ "socialize",
3194
+ "family_time",
3195
+ "family_time",
3196
+ "me_time",
3197
+ "binge_watch",
3198
+ "binge_watch",
3199
+ "binge_watch",
3200
+ "learn",
3201
+ "sleep",
3202
+ "deep_work",
3203
+ "sleep",
3204
+ "binge_watch",
3205
+ "meditate",
3206
+ "admin_work",
3207
+ "me_time",
3208
+ "admin_work",
3209
+ "sleep",
3210
+ "deep_work",
3211
+ "meditate",
3212
+ "deep_work",
3213
+ "exercise",
3214
+ "admin_work",
3215
+ "binge_watch",
3216
+ "exercise"
3217
+ ],
3218
+ "true_belief": [
3219
+ 0.272,
3220
+ 0.291,
3221
+ 0.323
3222
+ ],
3223
+ "final_belief": null,
3224
+ "belief_mae": null
3225
+ },
3226
+ {
3227
+ "condition": "continuous-OOD (generalization)",
3228
+ "profile_name": "sampled_10002",
3229
+ "profile_mode": "continuous",
3230
+ "strategy": "random",
3231
+ "seed": 10002,
3232
+ "final_score": 0.4923,
3233
+ "total_reward": -2.3,
3234
+ "adaptation": -0.662,
3235
+ "vitality": 0.15,
3236
+ "cognition": 0.02,
3237
+ "progress": 0.87,
3238
+ "serenity": 1.0,
3239
+ "connection": 0.43,
3240
+ "actions": [
3241
+ "socialize",
3242
+ "exercise",
3243
+ "deep_work",
3244
+ "learn",
3245
+ "exercise",
3246
+ "exercise",
3247
+ "meditate",
3248
+ "admin_work",
3249
+ "meditate",
3250
+ "exercise",
3251
+ "admin_work",
3252
+ "exercise",
3253
+ "binge_watch",
3254
+ "family_time",
3255
+ "admin_work",
3256
+ "deep_work",
3257
+ "deep_work",
3258
+ "socialize",
3259
+ "socialize",
3260
+ "meditate",
3261
+ "deep_work",
3262
+ "me_time",
3263
+ "admin_work",
3264
+ "me_time",
3265
+ "deep_work",
3266
+ "learn",
3267
+ "me_time",
3268
+ "exercise"
3269
+ ],
3270
+ "true_belief": [
3271
+ 0.369,
3272
+ 0.899,
3273
+ 0.069
3274
+ ],
3275
+ "final_belief": null,
3276
+ "belief_mae": null
3277
+ },
3278
+ {
3279
+ "condition": "continuous-OOD (generalization)",
3280
+ "profile_name": "sampled_10003",
3281
+ "profile_mode": "continuous",
3282
+ "strategy": "random",
3283
+ "seed": 10003,
3284
+ "final_score": 0.3943,
3285
+ "total_reward": -7.11,
3286
+ "adaptation": -0.261,
3287
+ "vitality": 0.13,
3288
+ "cognition": 0.0,
3289
+ "progress": 0.25,
3290
+ "serenity": 0.86,
3291
+ "connection": 0.95,
3292
+ "actions": [
3293
+ "socialize",
3294
+ "family_time",
3295
+ "socialize",
3296
+ "deep_work",
3297
+ "socialize",
3298
+ "socialize",
3299
+ "me_time",
3300
+ "admin_work",
3301
+ "deep_work",
3302
+ "family_time",
3303
+ "admin_work",
3304
+ "family_time",
3305
+ "me_time",
3306
+ "sleep",
3307
+ "family_time",
3308
+ "family_time",
3309
+ "family_time",
3310
+ "learn",
3311
+ "exercise",
3312
+ "binge_watch",
3313
+ "me_time",
3314
+ "family_time",
3315
+ "binge_watch",
3316
+ "family_time",
3317
+ "learn",
3318
+ "binge_watch",
3319
+ "sleep",
3320
+ "learn"
3321
+ ],
3322
+ "true_belief": [
3323
+ 0.388,
3324
+ 0.532,
3325
+ 0.55
3326
+ ],
3327
+ "final_belief": null,
3328
+ "belief_mae": null
3329
+ },
3330
+ {
3331
+ "condition": "continuous-OOD (generalization)",
3332
+ "profile_name": "sampled_10004",
3333
+ "profile_mode": "continuous",
3334
+ "strategy": "random",
3335
+ "seed": 10004,
3336
+ "final_score": 0.527,
3337
+ "total_reward": 6.44,
3338
+ "adaptation": -0.077,
3339
+ "vitality": 0.84,
3340
+ "cognition": 0.93,
3341
+ "progress": 0.59,
3342
+ "serenity": 1.0,
3343
+ "connection": 0.83,
3344
+ "actions": [
3345
+ "meditate",
3346
+ "sleep",
3347
+ "me_time",
3348
+ "exercise",
3349
+ "me_time",
3350
+ "deep_work",
3351
+ "socialize",
3352
+ "me_time",
3353
+ "me_time",
3354
+ "family_time",
3355
+ "sleep",
3356
+ "admin_work",
3357
+ "exercise",
3358
+ "admin_work",
3359
+ "family_time",
3360
+ "me_time",
3361
+ "sleep",
3362
+ "meditate",
3363
+ "exercise",
3364
+ "admin_work",
3365
+ "me_time",
3366
+ "admin_work",
3367
+ "exercise",
3368
+ "family_time",
3369
+ "sleep",
3370
+ "meditate",
3371
+ "deep_work",
3372
+ "meditate"
3373
+ ],
3374
+ "true_belief": [
3375
+ 0.586,
3376
+ 0.812,
3377
+ 0.436
3378
+ ],
3379
+ "final_belief": null,
3380
+ "belief_mae": null
3381
+ },
3382
+ {
3383
+ "condition": "continuous-OOD (generalization)",
3384
+ "profile_name": "sampled_10005",
3385
+ "profile_mode": "continuous",
3386
+ "strategy": "random",
3387
+ "seed": 10005,
3388
+ "final_score": 0.4318,
3389
+ "total_reward": -4.79,
3390
+ "adaptation": -0.17,
3391
+ "vitality": 0.03,
3392
+ "cognition": 0.02,
3393
+ "progress": 0.65,
3394
+ "serenity": 1.0,
3395
+ "connection": 0.44,
3396
+ "actions": [
3397
+ "sleep",
3398
+ "family_time",
3399
+ "meditate",
3400
+ "family_time",
3401
+ "meditate",
3402
+ "binge_watch",
3403
+ "admin_work",
3404
+ "deep_work",
3405
+ "learn",
3406
+ "socialize",
3407
+ "exercise",
3408
+ "admin_work",
3409
+ "family_time",
3410
+ "deep_work",
3411
+ "me_time",
3412
+ "meditate",
3413
+ "exercise",
3414
+ "me_time",
3415
+ "deep_work",
3416
+ "exercise",
3417
+ "me_time",
3418
+ "deep_work",
3419
+ "admin_work",
3420
+ "binge_watch",
3421
+ "binge_watch",
3422
+ "binge_watch",
3423
+ "deep_work",
3424
+ "meditate"
3425
+ ],
3426
+ "true_belief": [
3427
+ 0.273,
3428
+ 0.5,
3429
+ 0.524
3430
+ ],
3431
+ "final_belief": null,
3432
+ "belief_mae": null
3433
+ },
3434
+ {
3435
+ "condition": "continuous-OOD (generalization)",
3436
+ "profile_name": "sampled_10006",
3437
+ "profile_mode": "continuous",
3438
+ "strategy": "random",
3439
+ "seed": 10006,
3440
+ "final_score": 0.4614,
3441
+ "total_reward": -5.54,
3442
+ "adaptation": -0.802,
3443
+ "vitality": 0.04,
3444
+ "cognition": 0.0,
3445
+ "progress": 0.6,
3446
+ "serenity": 0.26,
3447
+ "connection": 0.72,
3448
+ "actions": [
3449
+ "me_time",
3450
+ "socialize",
3451
+ "me_time",
3452
+ "learn",
3453
+ "learn",
3454
+ "sleep",
3455
+ "deep_work",
3456
+ "family_time",
3457
+ "binge_watch",
3458
+ "me_time",
3459
+ "deep_work",
3460
+ "admin_work",
3461
+ "exercise",
3462
+ "learn",
3463
+ "meditate",
3464
+ "binge_watch",
3465
+ "binge_watch",
3466
+ "socialize",
3467
+ "deep_work",
3468
+ "deep_work",
3469
+ "socialize",
3470
+ "deep_work",
3471
+ "binge_watch",
3472
+ "binge_watch",
3473
+ "family_time",
3474
+ "binge_watch",
3475
+ "exercise",
3476
+ "socialize"
3477
+ ],
3478
+ "true_belief": [
3479
+ 0.546,
3480
+ 0.798,
3481
+ 0.413
3482
+ ],
3483
+ "final_belief": null,
3484
+ "belief_mae": null
3485
+ },
3486
+ {
3487
+ "condition": "continuous-OOD (generalization)",
3488
+ "profile_name": "sampled_10007",
3489
+ "profile_mode": "continuous",
3490
+ "strategy": "random",
3491
+ "seed": 10007,
3492
+ "final_score": 0.6107,
3493
+ "total_reward": 3.38,
3494
+ "adaptation": 0.0,
3495
+ "vitality": 0.01,
3496
+ "cognition": 0.09,
3497
+ "progress": 0.87,
3498
+ "serenity": 1.0,
3499
+ "connection": 1.0,
3500
+ "actions": [
3501
+ "family_time",
3502
+ "me_time",
3503
+ "binge_watch",
3504
+ "me_time",
3505
+ "socialize",
3506
+ "sleep",
3507
+ "family_time",
3508
+ "admin_work",
3509
+ "exercise",
3510
+ "learn",
3511
+ "deep_work",
3512
+ "me_time",
3513
+ "binge_watch",
3514
+ "admin_work",
3515
+ "admin_work",
3516
+ "learn",
3517
+ "me_time",
3518
+ "learn",
3519
+ "exercise",
3520
+ "me_time",
3521
+ "family_time",
3522
+ "family_time",
3523
+ "sleep",
3524
+ "me_time",
3525
+ "deep_work",
3526
+ "family_time",
3527
+ "family_time",
3528
+ "socialize"
3529
+ ],
3530
+ "true_belief": [
3531
+ 0.567,
3532
+ 0.5,
3533
+ 0.172
3534
+ ],
3535
+ "final_belief": null,
3536
+ "belief_mae": null
3537
+ },
3538
+ {
3539
+ "condition": "continuous-OOD (generalization)",
3540
+ "profile_name": "sampled_10008",
3541
+ "profile_mode": "continuous",
3542
+ "strategy": "random",
3543
+ "seed": 10008,
3544
+ "final_score": 0.5869,
3545
+ "total_reward": 6.77,
3546
+ "adaptation": 0.138,
3547
+ "vitality": 0.54,
3548
+ "cognition": 0.16,
3549
+ "progress": 0.7,
3550
+ "serenity": 1.0,
3551
+ "connection": 0.85,
3552
+ "actions": [
3553
+ "exercise",
3554
+ "exercise",
3555
+ "exercise",
3556
+ "learn",
3557
+ "admin_work",
3558
+ "learn",
3559
+ "deep_work",
3560
+ "me_time",
3561
+ "binge_watch",
3562
+ "exercise",
3563
+ "socialize",
3564
+ "meditate",
3565
+ "socialize",
3566
+ "exercise",
3567
+ "meditate",
3568
+ "meditate",
3569
+ "exercise",
3570
+ "socialize",
3571
+ "learn",
3572
+ "socialize",
3573
+ "meditate",
3574
+ "binge_watch",
3575
+ "sleep",
3576
+ "meditate",
3577
+ "deep_work",
3578
+ "family_time",
3579
+ "deep_work",
3580
+ "exercise"
3581
+ ],
3582
+ "true_belief": [
3583
+ 0.844,
3584
+ 0.307,
3585
+ 0.464
3586
+ ],
3587
+ "final_belief": null,
3588
+ "belief_mae": null
3589
+ },
3590
+ {
3591
+ "condition": "continuous-OOD (generalization)",
3592
+ "profile_name": "sampled_10009",
3593
+ "profile_mode": "continuous",
3594
+ "strategy": "random",
3595
+ "seed": 10009,
3596
+ "final_score": 0.5031,
3597
+ "total_reward": 0.74,
3598
+ "adaptation": -0.142,
3599
+ "vitality": 0.06,
3600
+ "cognition": 0.26,
3601
+ "progress": 0.53,
3602
+ "serenity": 1.0,
3603
+ "connection": 0.9,
3604
+ "actions": [
3605
+ "sleep",
3606
+ "socialize",
3607
+ "sleep",
3608
+ "admin_work",
3609
+ "me_time",
3610
+ "learn",
3611
+ "exercise",
3612
+ "deep_work",
3613
+ "learn",
3614
+ "exercise",
3615
+ "socialize",
3616
+ "me_time",
3617
+ "learn",
3618
+ "family_time",
3619
+ "sleep",
3620
+ "sleep",
3621
+ "socialize",
3622
+ "deep_work",
3623
+ "socialize",
3624
+ "exercise",
3625
+ "me_time",
3626
+ "me_time",
3627
+ "exercise",
3628
+ "family_time",
3629
+ "binge_watch",
3630
+ "admin_work",
3631
+ "me_time",
3632
+ "me_time"
3633
+ ],
3634
+ "true_belief": [
3635
+ 0.333,
3636
+ 0.383,
3637
+ 0.498
3638
+ ],
3639
+ "final_belief": null,
3640
+ "belief_mae": null
3641
+ }
3642
+ ]
eval_results_v1.json ADDED
@@ -0,0 +1,758 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "profile": "introvert_morning",
4
+ "strategy": "heuristic",
5
+ "seed": 0,
6
+ "final_score": 0.7696,
7
+ "total_reward": 5.84,
8
+ "vitality": 0.76,
9
+ "cognition": 0.41,
10
+ "progress": 1.0,
11
+ "serenity": 1.0,
12
+ "connection": 0.43,
13
+ "actions": [
14
+ "deep_work",
15
+ "admin_work",
16
+ "meditate",
17
+ "sleep",
18
+ "deep_work",
19
+ "admin_work",
20
+ "meditate",
21
+ "sleep",
22
+ "deep_work",
23
+ "admin_work",
24
+ "meditate",
25
+ "sleep",
26
+ "deep_work",
27
+ "admin_work",
28
+ "socialize",
29
+ "sleep",
30
+ "deep_work",
31
+ "admin_work",
32
+ "meditate",
33
+ "sleep",
34
+ "deep_work",
35
+ "admin_work",
36
+ "socialize",
37
+ "sleep",
38
+ "deep_work",
39
+ "admin_work",
40
+ "meditate",
41
+ "sleep"
42
+ ]
43
+ },
44
+ {
45
+ "profile": "introvert_morning",
46
+ "strategy": "heuristic",
47
+ "seed": 1,
48
+ "final_score": 0.7526,
49
+ "total_reward": 6.88,
50
+ "vitality": 0.64,
51
+ "cognition": 0.33,
52
+ "progress": 1.0,
53
+ "serenity": 0.87,
54
+ "connection": 0.43,
55
+ "actions": [
56
+ "deep_work",
57
+ "admin_work",
58
+ "meditate",
59
+ "sleep",
60
+ "deep_work",
61
+ "admin_work",
62
+ "meditate",
63
+ "sleep",
64
+ "deep_work",
65
+ "admin_work",
66
+ "meditate",
67
+ "sleep",
68
+ "deep_work",
69
+ "admin_work",
70
+ "meditate",
71
+ "sleep",
72
+ "deep_work",
73
+ "admin_work",
74
+ "meditate",
75
+ "sleep",
76
+ "deep_work",
77
+ "admin_work",
78
+ "socialize",
79
+ "sleep",
80
+ "deep_work",
81
+ "admin_work",
82
+ "socialize",
83
+ "sleep"
84
+ ]
85
+ },
86
+ {
87
+ "profile": "introvert_morning",
88
+ "strategy": "heuristic",
89
+ "seed": 2,
90
+ "final_score": 0.7723,
91
+ "total_reward": 8.97,
92
+ "vitality": 0.84,
93
+ "cognition": 0.44,
94
+ "progress": 1.0,
95
+ "serenity": 1.0,
96
+ "connection": 0.39,
97
+ "actions": [
98
+ "deep_work",
99
+ "admin_work",
100
+ "meditate",
101
+ "sleep",
102
+ "deep_work",
103
+ "admin_work",
104
+ "meditate",
105
+ "sleep",
106
+ "deep_work",
107
+ "admin_work",
108
+ "socialize",
109
+ "sleep",
110
+ "deep_work",
111
+ "admin_work",
112
+ "socialize",
113
+ "sleep",
114
+ "exercise",
115
+ "admin_work",
116
+ "socialize",
117
+ "sleep",
118
+ "deep_work",
119
+ "admin_work",
120
+ "meditate",
121
+ "sleep",
122
+ "exercise",
123
+ "admin_work",
124
+ "meditate",
125
+ "sleep"
126
+ ]
127
+ },
128
+ {
129
+ "profile": "extrovert_night_owl",
130
+ "strategy": "heuristic",
131
+ "seed": 0,
132
+ "final_score": 0.8197,
133
+ "total_reward": 4.21,
134
+ "vitality": 1.0,
135
+ "cognition": 1.0,
136
+ "progress": 1.0,
137
+ "serenity": 1.0,
138
+ "connection": 0.43,
139
+ "actions": [
140
+ "deep_work",
141
+ "learn",
142
+ "meditate",
143
+ "sleep",
144
+ "deep_work",
145
+ "learn",
146
+ "meditate",
147
+ "sleep",
148
+ "deep_work",
149
+ "admin_work",
150
+ "meditate",
151
+ "sleep",
152
+ "deep_work",
153
+ "admin_work",
154
+ "socialize",
155
+ "sleep",
156
+ "deep_work",
157
+ "admin_work",
158
+ "meditate",
159
+ "sleep",
160
+ "deep_work",
161
+ "admin_work",
162
+ "meditate",
163
+ "sleep",
164
+ "deep_work",
165
+ "admin_work",
166
+ "meditate",
167
+ "sleep"
168
+ ]
169
+ },
170
+ {
171
+ "profile": "extrovert_night_owl",
172
+ "strategy": "heuristic",
173
+ "seed": 1,
174
+ "final_score": 0.8209,
175
+ "total_reward": 4.59,
176
+ "vitality": 1.0,
177
+ "cognition": 1.0,
178
+ "progress": 0.97,
179
+ "serenity": 1.0,
180
+ "connection": 0.46,
181
+ "actions": [
182
+ "deep_work",
183
+ "learn",
184
+ "meditate",
185
+ "sleep",
186
+ "deep_work",
187
+ "learn",
188
+ "meditate",
189
+ "sleep",
190
+ "deep_work",
191
+ "admin_work",
192
+ "meditate",
193
+ "sleep",
194
+ "deep_work",
195
+ "admin_work",
196
+ "meditate",
197
+ "sleep",
198
+ "deep_work",
199
+ "admin_work",
200
+ "meditate",
201
+ "sleep",
202
+ "deep_work",
203
+ "admin_work",
204
+ "socialize",
205
+ "sleep",
206
+ "deep_work",
207
+ "admin_work",
208
+ "meditate",
209
+ "sleep"
210
+ ]
211
+ },
212
+ {
213
+ "profile": "extrovert_night_owl",
214
+ "strategy": "heuristic",
215
+ "seed": 2,
216
+ "final_score": 0.8164,
217
+ "total_reward": 5.85,
218
+ "vitality": 0.97,
219
+ "cognition": 0.82,
220
+ "progress": 0.93,
221
+ "serenity": 1.0,
222
+ "connection": 0.53,
223
+ "actions": [
224
+ "deep_work",
225
+ "learn",
226
+ "meditate",
227
+ "sleep",
228
+ "deep_work",
229
+ "learn",
230
+ "meditate",
231
+ "sleep",
232
+ "deep_work",
233
+ "learn",
234
+ "socialize",
235
+ "sleep",
236
+ "deep_work",
237
+ "admin_work",
238
+ "meditate",
239
+ "sleep",
240
+ "deep_work",
241
+ "admin_work",
242
+ "meditate",
243
+ "sleep",
244
+ "deep_work",
245
+ "admin_work",
246
+ "socialize",
247
+ "sleep",
248
+ "deep_work",
249
+ "admin_work",
250
+ "meditate",
251
+ "sleep"
252
+ ]
253
+ },
254
+ {
255
+ "profile": "workaholic_stoic",
256
+ "strategy": "heuristic",
257
+ "seed": 0,
258
+ "final_score": 0.7461,
259
+ "total_reward": 11.98,
260
+ "vitality": 0.59,
261
+ "cognition": 0.23,
262
+ "progress": 1.0,
263
+ "serenity": 0.95,
264
+ "connection": 0.41,
265
+ "actions": [
266
+ "deep_work",
267
+ "learn",
268
+ "meditate",
269
+ "sleep",
270
+ "deep_work",
271
+ "admin_work",
272
+ "socialize",
273
+ "sleep",
274
+ "deep_work",
275
+ "admin_work",
276
+ "socialize",
277
+ "sleep",
278
+ "deep_work",
279
+ "admin_work",
280
+ "meditate",
281
+ "sleep",
282
+ "deep_work",
283
+ "admin_work",
284
+ "socialize",
285
+ "sleep",
286
+ "exercise",
287
+ "admin_work",
288
+ "socialize",
289
+ "sleep",
290
+ "deep_work",
291
+ "admin_work",
292
+ "socialize",
293
+ "sleep"
294
+ ]
295
+ },
296
+ {
297
+ "profile": "workaholic_stoic",
298
+ "strategy": "heuristic",
299
+ "seed": 1,
300
+ "final_score": 0.7585,
301
+ "total_reward": 12.67,
302
+ "vitality": 0.72,
303
+ "cognition": 0.32,
304
+ "progress": 1.0,
305
+ "serenity": 0.95,
306
+ "connection": 0.38,
307
+ "actions": [
308
+ "deep_work",
309
+ "learn",
310
+ "meditate",
311
+ "sleep",
312
+ "deep_work",
313
+ "admin_work",
314
+ "socialize",
315
+ "sleep",
316
+ "deep_work",
317
+ "admin_work",
318
+ "meditate",
319
+ "sleep",
320
+ "deep_work",
321
+ "admin_work",
322
+ "meditate",
323
+ "sleep",
324
+ "deep_work",
325
+ "admin_work",
326
+ "socialize",
327
+ "sleep",
328
+ "deep_work",
329
+ "admin_work",
330
+ "socialize",
331
+ "sleep",
332
+ "exercise",
333
+ "admin_work",
334
+ "socialize",
335
+ "sleep"
336
+ ]
337
+ },
338
+ {
339
+ "profile": "workaholic_stoic",
340
+ "strategy": "heuristic",
341
+ "seed": 2,
342
+ "final_score": 0.7782,
343
+ "total_reward": 13.16,
344
+ "vitality": 0.74,
345
+ "cognition": 0.34,
346
+ "progress": 1.0,
347
+ "serenity": 0.95,
348
+ "connection": 0.44,
349
+ "actions": [
350
+ "deep_work",
351
+ "learn",
352
+ "meditate",
353
+ "sleep",
354
+ "deep_work",
355
+ "admin_work",
356
+ "socialize",
357
+ "sleep",
358
+ "deep_work",
359
+ "admin_work",
360
+ "socialize",
361
+ "sleep",
362
+ "deep_work",
363
+ "admin_work",
364
+ "socialize",
365
+ "sleep",
366
+ "exercise",
367
+ "admin_work",
368
+ "socialize",
369
+ "sleep",
370
+ "exercise",
371
+ "admin_work",
372
+ "socialize",
373
+ "sleep",
374
+ "exercise",
375
+ "admin_work",
376
+ "socialize",
377
+ "sleep"
378
+ ]
379
+ },
380
+ {
381
+ "profile": "introvert_morning",
382
+ "strategy": "random",
383
+ "seed": 0,
384
+ "final_score": 0.7141,
385
+ "total_reward": 1.82,
386
+ "vitality": 0.45,
387
+ "cognition": 0.56,
388
+ "progress": 0.66,
389
+ "serenity": 1.0,
390
+ "connection": 0.71,
391
+ "actions": [
392
+ "socialize",
393
+ "me_time",
394
+ "binge_watch",
395
+ "socialize",
396
+ "exercise",
397
+ "family_time",
398
+ "sleep",
399
+ "admin_work",
400
+ "meditate",
401
+ "binge_watch",
402
+ "admin_work",
403
+ "deep_work",
404
+ "sleep",
405
+ "meditate",
406
+ "sleep",
407
+ "family_time",
408
+ "exercise",
409
+ "deep_work",
410
+ "admin_work",
411
+ "meditate",
412
+ "socialize",
413
+ "binge_watch",
414
+ "exercise",
415
+ "meditate",
416
+ "learn",
417
+ "socialize",
418
+ "admin_work",
419
+ "sleep"
420
+ ]
421
+ },
422
+ {
423
+ "profile": "introvert_morning",
424
+ "strategy": "random",
425
+ "seed": 1,
426
+ "final_score": 0.6924,
427
+ "total_reward": 4.02,
428
+ "vitality": 0.08,
429
+ "cognition": 0.5,
430
+ "progress": 0.76,
431
+ "serenity": 0.97,
432
+ "connection": 0.68,
433
+ "actions": [
434
+ "exercise",
435
+ "meditate",
436
+ "sleep",
437
+ "meditate",
438
+ "meditate",
439
+ "me_time",
440
+ "meditate",
441
+ "learn",
442
+ "meditate",
443
+ "meditate",
444
+ "socialize",
445
+ "socialize",
446
+ "deep_work",
447
+ "meditate",
448
+ "socialize",
449
+ "deep_work",
450
+ "meditate",
451
+ "sleep",
452
+ "learn",
453
+ "socialize",
454
+ "deep_work",
455
+ "socialize",
456
+ "learn",
457
+ "sleep",
458
+ "family_time",
459
+ "meditate",
460
+ "meditate",
461
+ "admin_work"
462
+ ]
463
+ },
464
+ {
465
+ "profile": "introvert_morning",
466
+ "strategy": "random",
467
+ "seed": 2,
468
+ "final_score": 0.6715,
469
+ "total_reward": 6.12,
470
+ "vitality": 0.61,
471
+ "cognition": 0.22,
472
+ "progress": 1.0,
473
+ "serenity": 0.86,
474
+ "connection": 0.17,
475
+ "actions": [
476
+ "me_time",
477
+ "meditate",
478
+ "learn",
479
+ "meditate",
480
+ "learn",
481
+ "family_time",
482
+ "deep_work",
483
+ "family_time",
484
+ "me_time",
485
+ "admin_work",
486
+ "sleep",
487
+ "meditate",
488
+ "sleep",
489
+ "admin_work",
490
+ "meditate",
491
+ "me_time",
492
+ "sleep",
493
+ "sleep",
494
+ "binge_watch",
495
+ "admin_work",
496
+ "deep_work",
497
+ "admin_work",
498
+ "admin_work",
499
+ "binge_watch",
500
+ "learn",
501
+ "sleep",
502
+ "me_time",
503
+ "deep_work"
504
+ ]
505
+ },
506
+ {
507
+ "profile": "extrovert_night_owl",
508
+ "strategy": "random",
509
+ "seed": 0,
510
+ "final_score": 0.9368,
511
+ "total_reward": 8.5,
512
+ "vitality": 1.0,
513
+ "cognition": 0.83,
514
+ "progress": 1.0,
515
+ "serenity": 1.0,
516
+ "connection": 0.98,
517
+ "actions": [
518
+ "socialize",
519
+ "me_time",
520
+ "binge_watch",
521
+ "socialize",
522
+ "exercise",
523
+ "family_time",
524
+ "sleep",
525
+ "admin_work",
526
+ "meditate",
527
+ "binge_watch",
528
+ "admin_work",
529
+ "deep_work",
530
+ "sleep",
531
+ "meditate",
532
+ "sleep",
533
+ "family_time",
534
+ "exercise",
535
+ "deep_work",
536
+ "admin_work",
537
+ "meditate",
538
+ "socialize",
539
+ "binge_watch",
540
+ "exercise",
541
+ "meditate",
542
+ "learn",
543
+ "socialize",
544
+ "admin_work",
545
+ "sleep"
546
+ ]
547
+ },
548
+ {
549
+ "profile": "extrovert_night_owl",
550
+ "strategy": "random",
551
+ "seed": 1,
552
+ "final_score": 0.9054,
553
+ "total_reward": 8.2,
554
+ "vitality": 0.75,
555
+ "cognition": 0.69,
556
+ "progress": 1.0,
557
+ "serenity": 0.97,
558
+ "connection": 0.97,
559
+ "actions": [
560
+ "exercise",
561
+ "meditate",
562
+ "sleep",
563
+ "meditate",
564
+ "meditate",
565
+ "me_time",
566
+ "meditate",
567
+ "learn",
568
+ "meditate",
569
+ "meditate",
570
+ "socialize",
571
+ "socialize",
572
+ "deep_work",
573
+ "meditate",
574
+ "socialize",
575
+ "deep_work",
576
+ "meditate",
577
+ "sleep",
578
+ "learn",
579
+ "socialize",
580
+ "deep_work",
581
+ "socialize",
582
+ "learn",
583
+ "sleep",
584
+ "family_time",
585
+ "meditate",
586
+ "meditate",
587
+ "admin_work"
588
+ ]
589
+ },
590
+ {
591
+ "profile": "extrovert_night_owl",
592
+ "strategy": "random",
593
+ "seed": 2,
594
+ "final_score": 0.7462,
595
+ "total_reward": 4.12,
596
+ "vitality": 0.75,
597
+ "cognition": 0.32,
598
+ "progress": 1.0,
599
+ "serenity": 0.95,
600
+ "connection": 0.4,
601
+ "actions": [
602
+ "me_time",
603
+ "meditate",
604
+ "learn",
605
+ "meditate",
606
+ "learn",
607
+ "family_time",
608
+ "deep_work",
609
+ "family_time",
610
+ "me_time",
611
+ "admin_work",
612
+ "sleep",
613
+ "meditate",
614
+ "sleep",
615
+ "admin_work",
616
+ "meditate",
617
+ "me_time",
618
+ "sleep",
619
+ "sleep",
620
+ "binge_watch",
621
+ "admin_work",
622
+ "deep_work",
623
+ "admin_work",
624
+ "admin_work",
625
+ "binge_watch",
626
+ "learn",
627
+ "sleep",
628
+ "me_time",
629
+ "deep_work"
630
+ ]
631
+ },
632
+ {
633
+ "profile": "workaholic_stoic",
634
+ "strategy": "random",
635
+ "seed": 0,
636
+ "final_score": 0.6185,
637
+ "total_reward": 4.07,
638
+ "vitality": 0.4,
639
+ "cognition": 0.52,
640
+ "progress": 0.55,
641
+ "serenity": 0.95,
642
+ "connection": 0.41,
643
+ "actions": [
644
+ "socialize",
645
+ "me_time",
646
+ "binge_watch",
647
+ "socialize",
648
+ "exercise",
649
+ "family_time",
650
+ "sleep",
651
+ "admin_work",
652
+ "meditate",
653
+ "binge_watch",
654
+ "admin_work",
655
+ "deep_work",
656
+ "sleep",
657
+ "meditate",
658
+ "sleep",
659
+ "family_time",
660
+ "exercise",
661
+ "deep_work",
662
+ "admin_work",
663
+ "meditate",
664
+ "socialize",
665
+ "binge_watch",
666
+ "exercise",
667
+ "meditate",
668
+ "learn",
669
+ "socialize",
670
+ "admin_work",
671
+ "sleep"
672
+ ]
673
+ },
674
+ {
675
+ "profile": "workaholic_stoic",
676
+ "strategy": "random",
677
+ "seed": 1,
678
+ "final_score": 0.6094,
679
+ "total_reward": 5.39,
680
+ "vitality": 0.04,
681
+ "cognition": 0.55,
682
+ "progress": 0.6,
683
+ "serenity": 1.0,
684
+ "connection": 0.44,
685
+ "actions": [
686
+ "exercise",
687
+ "meditate",
688
+ "sleep",
689
+ "meditate",
690
+ "meditate",
691
+ "me_time",
692
+ "meditate",
693
+ "learn",
694
+ "meditate",
695
+ "meditate",
696
+ "socialize",
697
+ "socialize",
698
+ "deep_work",
699
+ "meditate",
700
+ "socialize",
701
+ "deep_work",
702
+ "meditate",
703
+ "sleep",
704
+ "learn",
705
+ "socialize",
706
+ "deep_work",
707
+ "socialize",
708
+ "learn",
709
+ "sleep",
710
+ "family_time",
711
+ "meditate",
712
+ "meditate",
713
+ "admin_work"
714
+ ]
715
+ },
716
+ {
717
+ "profile": "workaholic_stoic",
718
+ "strategy": "random",
719
+ "seed": 2,
720
+ "final_score": 0.5782,
721
+ "total_reward": 7.33,
722
+ "vitality": 0.4,
723
+ "cognition": 0.23,
724
+ "progress": 0.88,
725
+ "serenity": 0.99,
726
+ "connection": 0.0,
727
+ "actions": [
728
+ "me_time",
729
+ "meditate",
730
+ "learn",
731
+ "meditate",
732
+ "learn",
733
+ "family_time",
734
+ "deep_work",
735
+ "family_time",
736
+ "me_time",
737
+ "admin_work",
738
+ "sleep",
739
+ "meditate",
740
+ "sleep",
741
+ "admin_work",
742
+ "meditate",
743
+ "me_time",
744
+ "sleep",
745
+ "sleep",
746
+ "binge_watch",
747
+ "admin_work",
748
+ "deep_work",
749
+ "admin_work",
750
+ "admin_work",
751
+ "binge_watch",
752
+ "learn",
753
+ "sleep",
754
+ "me_time",
755
+ "deep_work"
756
+ ]
757
+ }
758
+ ]
inference.py CHANGED
@@ -74,30 +74,32 @@ SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
74
  DAY_NAMES = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
75
 
76
  SYSTEM_PROMPT = textwrap.dedent("""\
77
- You are a life management agent. You balance 5 life meters over a 7-day week.
78
- Each day has 4 slots: Morning (0), Afternoon (1), Evening (2), Night (3).
79
- Total: 28 steps.
80
-
81
- Meters (0.0 to 1.0):
82
- - Vitality: physical energy, sleep quality
83
- - Cognition: mental clarity, focus
84
- - Progress: career/skill advancement
85
- - Serenity: inner peace, stress management
86
- - Connection: relationship health
87
-
88
- Available actions (respond with EXACTLY the action name):
89
  DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE,
90
  FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH
91
 
92
- Key rules:
93
- - Low Vitality reduces effectiveness of all positive actions.
94
- - If ANY meter drops below 0.1, you get a heavy penalty.
95
- - Connection decays passively if you don't socialize.
96
- - Different episodes may have different hidden dynamics — adapt!
97
- - Time of day matters: cognition is generally best in the morning.
98
- - Balance all meters — neglecting any one will hurt your score.
99
 
100
- Respond with ONLY the action name, nothing else.""")
 
 
 
 
 
 
101
 
102
 
103
  # ---------------------------------------------------------------------------
@@ -183,11 +185,22 @@ def choose_action_heuristic(obs) -> RhythmAction:
183
 
184
 
185
  def choose_action_llm(obs, llm_client: OpenAI) -> RhythmAction:
186
- """Use LLM to pick an action, fall back to heuristic on failure."""
187
  day_name = DAY_NAMES[obs.day] if obs.day < 7 else f"Day {obs.day}"
188
  slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
189
  event_str = f"\nActive event: {obs.active_event}" if obs.active_event else ""
190
 
 
 
 
 
 
 
 
 
 
 
 
191
  user_prompt = textwrap.dedent(f"""\
192
  Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})
193
  Remaining steps: {obs.remaining_steps}
@@ -197,9 +210,9 @@ Meters:
197
  Cognition: {obs.cognition:.2f}
198
  Progress: {obs.progress:.2f}
199
  Serenity: {obs.serenity:.2f}
200
- Connection: {obs.connection:.2f}{event_str}
201
 
202
- Choose your action:""")
203
 
204
  try:
205
  completion = llm_client.chat.completions.create(
@@ -209,7 +222,7 @@ Choose your action:""")
209
  {"role": "user", "content": user_prompt},
210
  ],
211
  temperature=0.3,
212
- max_tokens=30,
213
  stream=False,
214
  )
215
  text = (completion.choices[0].message.content or "").strip()
@@ -219,20 +232,28 @@ Choose your action:""")
219
 
220
 
221
  def parse_llm_action(text: str) -> RhythmAction:
222
- """Parse LLM response text into a RhythmAction."""
223
- text = text.strip().upper().replace(" ", "_")
224
 
225
- # Try direct match
 
 
 
 
 
 
 
 
 
 
 
 
 
226
  for action_type in ActionType:
227
  if action_type.value.upper() == text:
228
  return RhythmAction(action_type=action_type)
229
-
230
- # Try partial match
231
  for action_type in ActionType:
232
  if action_type.value.upper() in text:
233
  return RhythmAction(action_type=action_type)
234
-
235
- # Default fallback
236
  return RhythmAction(action_type=ActionType.SLEEP)
237
 
238
 
 
74
  DAY_NAMES = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
75
 
76
  SYSTEM_PROMPT = textwrap.dedent("""\
77
+ You are a life-management agent helping a person with HIDDEN preferences.
78
+ You see 5 life meters and a rolling history. The same action affects different
79
+ people differently — you must INFER who you're helping from the rewards and
80
+ meter changes you observe.
81
+
82
+ Each step, output ONE LINE in this exact format:
83
+ ACTION_NAME S M W
84
+
85
+ where ACTION_NAME is one of:
 
 
 
86
  DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE,
87
  FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH
88
 
89
+ and S, M, W are single digits (0-9) representing your current belief:
90
+ S = social preference (0=hates being social, 9=loves being social)
91
+ M = morning preference (0=night owl, 9=morning person)
92
+ W = work preference (0=avoids work, 9=workaholic)
93
+
94
+ Example: DEEP_WORK 3 8 7
 
95
 
96
+ Tips:
97
+ - Update your belief from rewards: SOCIALIZE giving big reward → raise S.
98
+ - Early in the week, PROBE different actions to learn the person.
99
+ - Late in the week, EXPLOIT what you've learned.
100
+ - Watch for crashes: any meter under 0.1 → big penalty.
101
+ - Connection decays passively — actively maintain it.
102
+ Respond with ONLY the format line, no other text.""")
103
 
104
 
105
  # ---------------------------------------------------------------------------
 
185
 
186
 
187
  def choose_action_llm(obs, llm_client: OpenAI) -> RhythmAction:
188
+ """Use LLM to pick an action (and emit belief), fall back to heuristic on failure."""
189
  day_name = DAY_NAMES[obs.day] if obs.day < 7 else f"Day {obs.day}"
190
  slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
191
  event_str = f"\nActive event: {obs.active_event}" if obs.active_event else ""
192
 
193
+ history_lines = []
194
+ for h in (getattr(obs, "step_history", None) or [])[-5:]:
195
+ history_lines.append(
196
+ f" step {h.step}: {h.action} -> reward {h.reward:+.2f} "
197
+ f"(V{h.vitality_delta:+.2f} C{h.cognition_delta:+.2f} "
198
+ f"P{h.progress_delta:+.2f} S{h.serenity_delta:+.2f} Cn{h.connection_delta:+.2f})"
199
+ )
200
+ history_str = ""
201
+ if history_lines:
202
+ history_str = "\n\nRecent history:\n" + "\n".join(history_lines)
203
+
204
  user_prompt = textwrap.dedent(f"""\
205
  Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})
206
  Remaining steps: {obs.remaining_steps}
 
210
  Cognition: {obs.cognition:.2f}
211
  Progress: {obs.progress:.2f}
212
  Serenity: {obs.serenity:.2f}
213
+ Connection: {obs.connection:.2f}{event_str}{history_str}
214
 
215
+ Choose your action (format: ACTION S M W):""")
216
 
217
  try:
218
  completion = llm_client.chat.completions.create(
 
222
  {"role": "user", "content": user_prompt},
223
  ],
224
  temperature=0.3,
225
+ max_tokens=20,
226
  stream=False,
227
  )
228
  text = (completion.choices[0].message.content or "").strip()
 
232
 
233
 
234
  def parse_llm_action(text: str) -> RhythmAction:
235
+ """Parse LLM response (action+belief format) into a RhythmAction.
 
236
 
237
+ Belief digits are ignored at inference time — only used as a demo signal.
238
+ """
239
+ # Reuse the training parser for consistency
240
+ sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), "training"))
241
+ try:
242
+ from reward_functions import extract_action_and_belief
243
+ action, _belief, _provided = extract_action_and_belief(text)
244
+ if action is not None:
245
+ return RhythmAction(action_type=action)
246
+ except ImportError:
247
+ pass
248
+
249
+ # Fallback: legacy parsing
250
+ text = text.strip().upper().replace(" ", "_")
251
  for action_type in ActionType:
252
  if action_type.value.upper() == text:
253
  return RhythmAction(action_type=action_type)
 
 
254
  for action_type in ActionType:
255
  if action_type.value.upper() in text:
256
  return RhythmAction(action_type=action_type)
 
 
257
  return RhythmAction(action_type=ActionType.SLEEP)
258
 
259
 
scripts/diagnostic_replay.py ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Diagnostic: verify env_reward replay matches live env.
3
+
4
+ For 10 seeds, plays a random episode while recording each (action, reward).
5
+ Then for each step independently, replays the prefix and asserts the reward
6
+ from the replay matches the recorded reward within 1e-6.
7
+
8
+ Run from rhythm_env root:
9
+ python scripts/diagnostic_replay.py
10
+ """
11
+
12
+ import os
13
+ import random
14
+ import sys
15
+
16
+ sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
17
+
18
+ from models import ActionType, RhythmAction
19
+ from server.rhythm_environment import RhythmEnvironment, MAX_STEPS
20
+ from training.reward_functions import env_reward
21
+
22
+
23
+ def play_and_record(seed: int):
24
+ """Play one random episode, return list of (action_history_at_step, action_taken, reward_received)."""
25
+ env = RhythmEnvironment()
26
+ env.reset(seed=seed)
27
+ rng = random.Random(seed + 7777)
28
+ actions_so_far = []
29
+ records = []
30
+ for _ in range(MAX_STEPS):
31
+ action_type = rng.choice(list(ActionType))
32
+ history_snapshot = list(actions_so_far)
33
+ obs = env.step(RhythmAction(action_type=action_type))
34
+ records.append((history_snapshot, action_type, obs.reward))
35
+ actions_so_far.append(action_type.value)
36
+ if obs.done:
37
+ break
38
+ return records
39
+
40
+
41
+ def main():
42
+ print("=" * 70)
43
+ print("env_reward replay diagnostic")
44
+ print("=" * 70)
45
+
46
+ total_steps = 0
47
+ mismatches = 0
48
+ max_diff = 0.0
49
+
50
+ for seed in range(10):
51
+ records = play_and_record(seed)
52
+ for history, action_type, recorded_reward in records:
53
+ # Build a fake completion that yields this exact action.
54
+ completions = [[{"content": action_type.value.upper()}]]
55
+
56
+ scores = env_reward(
57
+ completions,
58
+ seed=[seed],
59
+ step_index=[len(history)],
60
+ action_history=[history],
61
+ )
62
+ replay_reward = scores[0]
63
+ diff = abs(replay_reward - recorded_reward)
64
+ max_diff = max(max_diff, diff)
65
+ if diff > 1e-6:
66
+ mismatches += 1
67
+ if mismatches <= 3:
68
+ print(
69
+ f" MISMATCH seed={seed} step={len(history)} "
70
+ f"action={action_type.value} recorded={recorded_reward:.6f} "
71
+ f"replay={replay_reward:.6f} diff={diff:.6f}"
72
+ )
73
+ total_steps += 1
74
+
75
+ print()
76
+ print(f"Total steps checked: {total_steps}")
77
+ print(f"Mismatches (>1e-6): {mismatches}")
78
+ print(f"Max diff: {max_diff:.6e}")
79
+ print()
80
+ if mismatches == 0:
81
+ print("PASS: env_reward replay is deterministic and matches live env.")
82
+ return 0
83
+ else:
84
+ print("FAIL: env_reward replay diverges from live env.")
85
+ print("Likely cause: non-determinism in env (RNG state, profile selection, etc.)")
86
+ return 1
87
+
88
+
89
+ if __name__ == "__main__":
90
+ sys.exit(main())
scripts/pipeline_dryrun.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Pipeline dry-run: validate the full meta-RL reward stack end-to-end without GPU.
3
+
4
+ Generates a small dataset, synthesizes completions of varying quality
5
+ (random valid, perfect, garbage, action-only, action+belief, etc.), and
6
+ runs all 4 reward functions. Reports score distributions and prompt sizes.
7
+
8
+ This is the local Gate 2 smoke check — proves the dataset, parser, and
9
+ reward stack are internally consistent before kicking off real training.
10
+
11
+ Run from rhythm_env root:
12
+ python scripts/pipeline_dryrun.py
13
+ """
14
+
15
+ import os
16
+ import random
17
+ import sys
18
+
19
+ sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
20
+
21
+ from training.dataset import generate_dataset
22
+ from training.reward_functions import (
23
+ extract_action_and_belief,
24
+ format_valid,
25
+ action_legal,
26
+ env_reward,
27
+ belief_accuracy,
28
+ )
29
+ from models import ActionType
30
+ from server.rhythm_environment import sample_profile, profile_to_belief_vector
31
+
32
+
33
+ def synth_completion(prompt_seed: int, kind: str) -> str:
34
+ """Synthesize a completion of a given quality."""
35
+ rng = random.Random(prompt_seed)
36
+ actions = list(ActionType)
37
+ action_str = rng.choice(actions).value.upper()
38
+ s, m, w = rng.randint(0, 9), rng.randint(0, 9), rng.randint(0, 9)
39
+ if kind == "perfect":
40
+ # Perfect belief means matching the profile
41
+ true = profile_to_belief_vector(sample_profile(prompt_seed))
42
+ s = round(true[0] * 9)
43
+ m = round(true[1] * 9)
44
+ w = round(true[2] * 9)
45
+ return f"{action_str} {s} {m} {w}"
46
+ if kind == "good":
47
+ return f"{action_str} {s} {m} {w}"
48
+ if kind == "action_only":
49
+ return action_str
50
+ if kind == "garbage":
51
+ return "I don't know what to do here"
52
+ if kind == "verbose":
53
+ return f"My choice is {action_str} with belief {s} {m} {w} based on the rewards I see."
54
+ if kind == "wrong_belief":
55
+ # Output opposite of true belief
56
+ true = profile_to_belief_vector(sample_profile(prompt_seed))
57
+ s = round((1 - true[0]) * 9)
58
+ m = round((1 - true[1]) * 9)
59
+ w = round((1 - true[2]) * 9)
60
+ return f"{action_str} {s} {m} {w}"
61
+ return action_str
62
+
63
+
64
+ def main():
65
+ print("=" * 70)
66
+ print("Pipeline Dry-Run")
67
+ print("=" * 70)
68
+
69
+ # Generate a small dataset (continuous profiles, 10% hint)
70
+ samples = generate_dataset(
71
+ num_episodes=20,
72
+ strategy="mixed",
73
+ max_samples=80,
74
+ profile_mode="continuous",
75
+ hint_fraction=0.1,
76
+ )
77
+
78
+ # Prompt size analysis
79
+ sizes = [len(s["prompt"][0]["content"]) + len(s["prompt"][1]["content"]) for s in samples]
80
+ print(f"\nPrompt sizes (chars): min={min(sizes)}, max={max(sizes)}, mean={sum(sizes)/len(sizes):.0f}")
81
+ # Rough token estimate: ~4 chars per token
82
+ print(f"Estimated tokens: min={min(sizes)//4}, max={max(sizes)//4}, mean={sum(sizes)//len(sizes)//4}")
83
+
84
+ print("\n" + "=" * 70)
85
+ print("Reward distributions across completion kinds")
86
+ print("=" * 70)
87
+
88
+ # For each completion kind, generate completions for first N samples
89
+ kinds = ["perfect", "good", "action_only", "garbage", "verbose", "wrong_belief"]
90
+ n = 30
91
+
92
+ sub = samples[:n]
93
+ seeds_col = [s["seed"] for s in sub]
94
+ history_col = [s["action_history"] for s in sub]
95
+ mode_col = [s["profile_mode"] for s in sub]
96
+
97
+ print(f"\n{'kind':<14} | {'fmt':>6} {'leg':>6} {'env':>6} {'bel':>6} | {'TOTAL':>6}")
98
+ print("-" * 60)
99
+ for kind in kinds:
100
+ completions = [[{"content": synth_completion(s["seed"], kind)}] for s in sub]
101
+ f_scores = format_valid(completions)
102
+ l_scores = action_legal(completions)
103
+ e_scores = env_reward(completions, seed=seeds_col, action_history=history_col, profile_mode=mode_col)
104
+ b_scores = belief_accuracy(completions, seed=seeds_col, action_history=history_col, profile_mode=mode_col)
105
+ f_avg = sum(f_scores) / len(f_scores)
106
+ l_avg = sum(l_scores) / len(l_scores)
107
+ e_avg = sum(e_scores) / len(e_scores)
108
+ b_avg = sum(b_scores) / len(b_scores)
109
+ total = f_avg + l_avg + e_avg + b_avg
110
+ print(f"{kind:<14} | {f_avg:+6.2f} {l_avg:+6.2f} {e_avg:+6.2f} {b_avg:+6.2f} | {total:+6.2f}")
111
+
112
+ print()
113
+ print("Expected ordering (best -> worst total):")
114
+ print(" perfect > good > wrong_belief, action_only > verbose > garbage")
115
+ print("If `perfect > wrong_belief`, the belief signal is gradient-providing.")
116
+ print("If `good > action_only`, format_valid pushes toward emitting belief.")
117
+ print("If `garbage` is most negative, format penalty is doing its job.")
118
+
119
+
120
+ if __name__ == "__main__":
121
+ main()
server/rhythm_environment.py CHANGED
@@ -187,6 +187,104 @@ SOCIAL_ACTIONS = {"family_time", "socialize"}
187
  IDLE_ACTIONS = {"me_time", "binge_watch", "sleep"}
188
  WORK_ACTIONS = {"deep_work", "learn", "admin_work"}
189
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
 
191
  class RhythmEnvironment(Environment):
192
  """
@@ -226,6 +324,7 @@ class RhythmEnvironment(Environment):
226
  self._total_reward: float = 0.0
227
  self._recent_actions: list = []
228
  self._step_history: list = []
 
229
 
230
  def get_metadata(self) -> EnvironmentMetadata:
231
  return EnvironmentMetadata(
@@ -256,16 +355,20 @@ class RhythmEnvironment(Environment):
256
 
257
  self._rng = random.Random(effective_seed)
258
 
259
- # Profile selection: explicit kwarg overrides; otherwise use scrambled seed.
260
- # Scrambling decouples profile from episode dynamics (events, etc.) so the
261
- # model cannot memorize seed profile patterns during training.
 
262
  profile_name = kwargs.get("profile")
 
263
  if profile_name and profile_name in PROFILE_MAP:
264
  self._profile = deepcopy(PROFILE_MAP[profile_name])
265
- else:
266
  profile_rng = random.Random(effective_seed ^ 0xA3C5F729)
267
  profile_index = profile_rng.randint(0, len(PROFILES) - 1)
268
  self._profile = deepcopy(PROFILES[profile_index])
 
 
269
 
270
  # Initialize meters from profile defaults
271
  initial = self._profile["initial_meters"]
@@ -281,6 +384,7 @@ class RhythmEnvironment(Environment):
281
  self._total_reward = 0.0
282
  self._recent_actions = []
283
  self._step_history = []
 
284
 
285
  self._state = RhythmState(
286
  episode_id=episode_id or str(uuid4()),
@@ -376,6 +480,7 @@ class RhythmEnvironment(Environment):
376
  # Clamp reward
377
  reward = max(-3.0, min(3.0, round(reward, 4)))
378
  self._total_reward += reward
 
379
 
380
  # --- 10. Advance timestep ---
381
  self._timestep += 1
@@ -446,6 +551,24 @@ class RhythmEnvironment(Environment):
446
  def state(self) -> RhythmState:
447
  return self._state
448
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
449
  # ------------------------------------------------------------------
450
  # Internal helpers
451
  # ------------------------------------------------------------------
@@ -581,66 +704,60 @@ class RhythmEnvironment(Environment):
581
  """
582
  Compute final episode score in [0, 1].
583
 
584
- Scoring breakdown:
585
- 0.25meter balance (high mean, low variance)
586
- 0.20crash-free ratio (no critical meter drops)
587
- 0.20progress made
588
- 0.15connection maintained
589
- 0.05 — efficiency (average reward)
590
- 0.15 — discovery bonus (profile-adapted strategy in second half)
591
- """
592
- meters = {m: getattr(self, f"_{m}") for m in METERS}
593
 
594
- # 1. Meter balance (0.25): high mean, low variance
595
- values = list(meters.values())
596
- mean_meter = sum(values) / len(values)
597
- variance = sum((v - mean_meter) ** 2 for v in values) / len(values)
598
- balance_score = max(0.0, mean_meter - variance)
599
 
600
- # 2. No crashes (0.20): fraction of steps without critical meters
 
 
 
601
  steps = max(self._timestep, 1)
 
 
602
  crash_free_ratio = 1.0 - (self._crash_count / (steps * len(METERS)))
603
 
604
- # 3. Progress made (0.20)
605
  progress_score = self._progress
606
 
607
- # 4. Connection maintained (0.15)
608
  connection_score = self._connection
609
 
610
- # 5. Efficiency (0.05): normalized average reward
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
611
  avg_reward = self._total_reward / steps
612
  efficiency_score = max(0.0, min(1.0, (avg_reward + 1.0) / 2.0))
613
 
614
- # 6. Discovery bonus (0.15): did the agent adapt its strategy to the
615
- # hidden profile in the second half of the week (steps 14–27)?
616
- # This is the only component that directly rewards personality discovery.
617
- second_half = self._recent_actions[14:]
618
- if len(second_half) > 0:
619
- profile_name = self._profile["name"]
620
- if profile_name == "introvert_morning":
621
- # Introvert should minimise social actions
622
- social_frac = sum(1 for a in second_half if a in SOCIAL_ACTIONS) / len(second_half)
623
- discovery_score = max(0.0, 1.0 - social_frac * 2.5)
624
- elif profile_name == "extrovert_night_owl":
625
- # Extrovert should embrace social actions
626
- social_frac = sum(1 for a in second_half if a in SOCIAL_ACTIONS) / len(second_half)
627
- discovery_score = min(1.0, social_frac * 2.5)
628
- elif profile_name == "workaholic_stoic":
629
- # Workaholic should front-load work actions
630
- work_frac = sum(1 for a in second_half if a in WORK_ACTIONS) / len(second_half)
631
- discovery_score = min(1.0, work_frac * 1.5)
632
- else:
633
- discovery_score = 0.5
634
- else:
635
- discovery_score = 0.5
636
-
637
  score = (
638
- 0.25 * balance_score
639
- + 0.20 * crash_free_ratio
640
- + 0.20 * progress_score
641
  + 0.15 * connection_score
642
- + 0.05 * efficiency_score
643
- + 0.15 * discovery_score
644
  )
645
  return max(0.0, min(1.0, score))
646
 
 
187
  IDLE_ACTIONS = {"me_time", "binge_watch", "sleep"}
188
  WORK_ACTIONS = {"deep_work", "learn", "admin_work"}
189
 
190
+ # ---------------------------------------------------------------------------
191
+ # Continuous profile sampling (meta-RL training distribution)
192
+ # ---------------------------------------------------------------------------
193
+ #
194
+ # Hardcoded profiles are 3 fixed personalities — memorizable, classification-like.
195
+ # Continuous sampling draws profile parameters from distributions per episode,
196
+ # making memorization impossible and forcing the agent to learn the *skill* of
197
+ # profile inference. This is the core meta-learning move.
198
+ #
199
+ # Belief vector dimensions (for Phase 3 cosine-similarity reward):
200
+ # social_pref: 0 = hates social (introvert), 1 = loves social (extrovert)
201
+ # morning_pref: 0 = night owl, 1 = morning person
202
+ # work_pref: 0 = work-averse, 1 = workaholic
203
+
204
+ def sample_profile(seed: int) -> Dict[str, Any]:
205
+ """Sample a continuous profile deterministically from a seed.
206
+
207
+ Reward weights drawn from a Dirichlet biased toward non-infrastructure
208
+ meters (progress, serenity, connection). Per-action modifiers drawn from
209
+ bounded uniforms so any sampled profile is playable.
210
+ """
211
+ rng = random.Random(seed ^ 0xA3C5F729)
212
+
213
+ # Reward weights via Dirichlet (α<1 produces sparse weights → "personalities")
214
+ alphas = [0.5, 0.5, 1.5, 1.5, 1.5] # [vit, cog, prog, ser, conn]
215
+ raw = [rng.gammavariate(a, 1.0) for a in alphas]
216
+ total = sum(raw)
217
+ weights = [w / total for w in raw]
218
+ # Clamp and re-normalize to avoid degenerate weights
219
+ weights = [max(0.02, min(0.80, w)) for w in weights]
220
+ total = sum(weights)
221
+ weights = [w / total for w in weights]
222
+
223
+ return {
224
+ "name": f"sampled_{seed}",
225
+ "social_vitality_multiplier": rng.uniform(0.2, 3.0),
226
+ "morning_cognition_bonus": rng.uniform(0.4, 2.0) if rng.random() < 0.5 else None,
227
+ "evening_night_cognition_bonus": rng.uniform(0.6, 1.8) if rng.random() < 0.5 else None,
228
+ "morning_penalty": rng.uniform(0.4, 0.9) if rng.random() < 0.3 else None,
229
+ "binge_shame": rng.random() < 0.5,
230
+ "progress_serenity_bonus": rng.uniform(0.0, 0.10),
231
+ "idle_serenity_decay": rng.uniform(0.0, 0.10),
232
+ "vitality_decay_rate": rng.uniform(0.0, 0.04),
233
+ "stress_tolerance": rng.uniform(0.15, 0.30),
234
+ "event_impact_multiplier": rng.uniform(0.5, 1.0),
235
+ "connection_decay_rate": rng.uniform(0.005, 0.02),
236
+ "solo_serenity_bonus": rng.uniform(0.0, 0.10),
237
+ "social_connection_multiplier": rng.uniform(1.0, 2.0),
238
+ "social_serenity_bonus": rng.uniform(0.0, 0.06),
239
+ "work_vitality_recovery": rng.uniform(0.0, 0.06),
240
+ "initial_meters": {
241
+ "vitality": 0.7, "cognition": 0.7, "progress": 0.0,
242
+ "serenity": 0.7, "connection": 0.5,
243
+ },
244
+ "reward_weights": dict(zip(METERS, weights)),
245
+ }
246
+
247
+
248
+ def profile_to_belief_vector(profile: Dict[str, Any]) -> List[float]:
249
+ """Reduce a profile to a 3-dim trait vector [social, morning, work] in [0, 1].
250
+
251
+ Used as the ground-truth target for the agent's belief output (Phase 3).
252
+ Aggregates the most-diagnostic modifiers per trait.
253
+ """
254
+ # social_pref: low if social drains vitality a lot, high if social bonuses are big
255
+ # social_vitality_multiplier in [0.2, 3.0]: lower = more extroverted
256
+ sm = profile.get("social_vitality_multiplier", 1.0)
257
+ social_drain_norm = 1.0 - max(0.0, min(1.0, (sm - 0.2) / 2.8)) # invert
258
+ scm = profile.get("social_connection_multiplier", 1.0)
259
+ social_conn_norm = max(0.0, min(1.0, (scm - 1.0) / 1.0))
260
+ ssb = profile.get("social_serenity_bonus", 0.0)
261
+ social_ser_norm = max(0.0, min(1.0, ssb / 0.06))
262
+ social_pref = 0.5 * social_drain_norm + 0.3 * social_conn_norm + 0.2 * social_ser_norm
263
+
264
+ # morning_pref: high if morning bonus exists, low if morning penalty exists
265
+ mcb = profile.get("morning_cognition_bonus")
266
+ mp = profile.get("morning_penalty")
267
+ morning_pref = 0.5
268
+ if mcb is not None:
269
+ morning_pref = 0.5 + 0.5 * max(0.0, min(1.0, (mcb - 0.4) / 1.6))
270
+ if mp is not None:
271
+ morning_pref = min(morning_pref, 0.5 - 0.5 * (1.0 - mp))
272
+
273
+ # work_pref: high if work recovers vitality + progress gives serenity + progress weight high
274
+ wvr = profile.get("work_vitality_recovery", 0.0)
275
+ wvr_norm = max(0.0, min(1.0, wvr / 0.06))
276
+ psb = profile.get("progress_serenity_bonus", 0.0)
277
+ psb_norm = max(0.0, min(1.0, psb / 0.10))
278
+ pw = profile.get("reward_weights", {}).get("progress", 0.2)
279
+ pw_norm = max(0.0, min(1.0, (pw - 0.05) / 0.65))
280
+ work_pref = 0.4 * wvr_norm + 0.3 * psb_norm + 0.3 * pw_norm
281
+
282
+ return [
283
+ max(0.0, min(1.0, social_pref)),
284
+ max(0.0, min(1.0, morning_pref)),
285
+ max(0.0, min(1.0, work_pref)),
286
+ ]
287
+
288
 
289
  class RhythmEnvironment(Environment):
290
  """
 
324
  self._total_reward: float = 0.0
325
  self._recent_actions: list = []
326
  self._step_history: list = []
327
+ self._step_rewards: list = [] # per-step rewards (for adaptation_score in grader)
328
 
329
  def get_metadata(self) -> EnvironmentMetadata:
330
  return EnvironmentMetadata(
 
355
 
356
  self._rng = random.Random(effective_seed)
357
 
358
+ # Profile selection three modes:
359
+ # 1. Explicit hardcoded profile name legacy 3-profile mode (eval comparison)
360
+ # 2. profile_mode='discrete' pick from 3 hardcoded profiles by scrambled seed
361
+ # 3. Default → sampled continuous profile (meta-RL training distribution)
362
  profile_name = kwargs.get("profile")
363
+ profile_mode = kwargs.get("profile_mode", "continuous")
364
  if profile_name and profile_name in PROFILE_MAP:
365
  self._profile = deepcopy(PROFILE_MAP[profile_name])
366
+ elif profile_mode == "discrete":
367
  profile_rng = random.Random(effective_seed ^ 0xA3C5F729)
368
  profile_index = profile_rng.randint(0, len(PROFILES) - 1)
369
  self._profile = deepcopy(PROFILES[profile_index])
370
+ else:
371
+ self._profile = sample_profile(effective_seed)
372
 
373
  # Initialize meters from profile defaults
374
  initial = self._profile["initial_meters"]
 
384
  self._total_reward = 0.0
385
  self._recent_actions = []
386
  self._step_history = []
387
+ self._step_rewards = []
388
 
389
  self._state = RhythmState(
390
  episode_id=episode_id or str(uuid4()),
 
480
  # Clamp reward
481
  reward = max(-3.0, min(3.0, round(reward, 4)))
482
  self._total_reward += reward
483
+ self._step_rewards.append(reward)
484
 
485
  # --- 10. Advance timestep ---
486
  self._timestep += 1
 
551
  def state(self) -> RhythmState:
552
  return self._state
553
 
554
+ def get_belief_target(self) -> List[float]:
555
+ """Return the 3-dim ground-truth belief vector for the active profile.
556
+
557
+ Used during training to compute belief-accuracy reward (Phase 3).
558
+ Privileged information — not exposed via observation.
559
+ """
560
+ return profile_to_belief_vector(self._profile)
561
+
562
+ def get_profile_hint(self) -> Dict[str, float]:
563
+ """Return a coarse profile hint usable in observation during curriculum.
564
+
565
+ Returns the 3-dim belief vector with descriptive keys. The hint is
566
+ included in observations only when the env is started with
567
+ profile_visible=True — this is the curriculum's "visible" phase.
568
+ """
569
+ b = profile_to_belief_vector(self._profile)
570
+ return {"social_pref": round(b[0], 3), "morning_pref": round(b[1], 3), "work_pref": round(b[2], 3)}
571
+
572
  # ------------------------------------------------------------------
573
  # Internal helpers
574
  # ------------------------------------------------------------------
 
704
  """
705
  Compute final episode score in [0, 1].
706
 
707
+ Components (meta-learning aligned):
708
+ 0.20crash_free_ratio: no critical meter drops
709
+ 0.25progress: career/skill growth
710
+ 0.15connection: relationship maintained
711
+ 0.30adaptation_score: agent got better as it learned the user
712
+ 0.10 — efficiency: bounded normalized average reward
 
 
 
713
 
714
+ adaptation_score is the meta-learning signal: late-half mean per-step
715
+ reward minus early-half mean, gated by absolute late-half quality so
716
+ a "terrible-then-mediocre" exploit cannot win.
 
 
717
 
718
+ Per-step reward is already profile-weighted via _compute_reward(), so
719
+ a high late-half mean directly means the agent figured out and
720
+ optimized for THIS profile's preferences.
721
+ """
722
  steps = max(self._timestep, 1)
723
+
724
+ # 1. Crash-free ratio (0.20)
725
  crash_free_ratio = 1.0 - (self._crash_count / (steps * len(METERS)))
726
 
727
+ # 2. Progress (0.25)
728
  progress_score = self._progress
729
 
730
+ # 3. Connection (0.15)
731
  connection_score = self._connection
732
 
733
+ # 4. Adaptation score (0.30) KEY meta-learning signal.
734
+ # Split rewards in halves; positive only if late half is non-negative
735
+ # AND late > early. Normalized to [0, 1].
736
+ half = max(steps // 2, 1)
737
+ early = self._step_rewards[:half]
738
+ late = self._step_rewards[half:]
739
+ if early and late:
740
+ mean_early = sum(early) / len(early)
741
+ mean_late = sum(late) / len(late)
742
+ # late_quality: rewards typically in [-1, 1] per step, normalize
743
+ late_quality = max(0.0, min(1.0, (mean_late + 1.0) / 2.0))
744
+ gain = mean_late - mean_early
745
+ # gain typically in [-1, 1]; clip to [0, 1] (only positive counts)
746
+ gain_norm = max(0.0, min(1.0, gain))
747
+ adaptation_score = gain_norm * late_quality
748
+ else:
749
+ adaptation_score = 0.0
750
+
751
+ # 5. Efficiency (0.10): bounded normalized average reward
752
  avg_reward = self._total_reward / steps
753
  efficiency_score = max(0.0, min(1.0, (avg_reward + 1.0) / 2.0))
754
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
755
  score = (
756
+ 0.20 * crash_free_ratio
757
+ + 0.25 * progress_score
 
758
  + 0.15 * connection_score
759
+ + 0.30 * adaptation_score
760
+ + 0.10 * efficiency_score
761
  )
762
  return max(0.0, min(1.0, score))
763
 
tests/test_rhythm_env.py CHANGED
@@ -356,10 +356,19 @@ class TestEdgeCases:
356
 
357
  def test_state_exposes_profile(self, env):
358
  """State should include profile_name for debugging."""
 
359
  env.reset(seed=0)
360
  assert env.state.profile_name != ""
 
 
 
 
361
  assert env.state.profile_name in [p["name"] for p in PROFILES]
362
 
 
 
 
 
363
  def test_all_action_types_valid(self, env):
364
  """Every ActionType should be processable without error."""
365
  env.reset(seed=0)
 
356
 
357
  def test_state_exposes_profile(self, env):
358
  """State should include profile_name for debugging."""
359
+ # Default mode: continuous profile (name like 'sampled_0')
360
  env.reset(seed=0)
361
  assert env.state.profile_name != ""
362
+ assert env.state.profile_name.startswith("sampled_")
363
+
364
+ # Discrete mode: name from hardcoded PROFILES list
365
+ env.reset(seed=0, profile_mode="discrete")
366
  assert env.state.profile_name in [p["name"] for p in PROFILES]
367
 
368
+ # Explicit profile: name matches the requested profile
369
+ env.reset(seed=0, profile="workaholic_stoic")
370
+ assert env.state.profile_name == "workaholic_stoic"
371
+
372
  def test_all_action_types_valid(self, env):
373
  """Every ActionType should be processable without error."""
374
  env.reset(seed=0)
training/RhythmEnv_GRPO_Training.ipynb CHANGED
@@ -4,15 +4,17 @@
4
  "cell_type": "markdown",
5
  "metadata": {},
6
  "source": [
7
- "# RhythmEnv GRPO Training Life Simulator\n",
8
  "\n",
9
- "Train an LLM agent to balance 5 life meters across a 7-day week using **Group Relative Policy Optimization (GRPO)**.\n",
 
 
 
 
10
  "\n",
11
- "The agent must discover hidden personality profiles (introvert, extrovert, workaholic) through reward signals alone.\n",
12
  "\n",
13
- "**Requirements**: Free Colab GPU (T4) is sufficient. Training takes ~30-60 minutes for 500 steps.\n",
14
- "\n",
15
- "**Environment**: https://huggingface.co/spaces/InosLihka/rhythm_env"
16
  ]
17
  },
18
  {
@@ -31,9 +33,9 @@
31
  "%%capture\n",
32
  "!pip install --upgrade uv\n",
33
  "!uv pip install --system torch>=2.8.0 triton>=3.4.0 torchvision bitsandbytes\n",
34
- "!uv pip install --system transformers==4.56.2 trl==0.22.2 datasets wandb\n",
35
  "!pip install unsloth\n",
36
- "!pip install fastapi uvicorn pydantic matplotlib"
37
  ]
38
  },
39
  {
@@ -50,8 +52,9 @@
50
  "outputs": [],
51
  "source": [
52
  "import os\n",
 
53
  "\n",
54
- "# Clone the repo (update URL to your actual repo)\n",
55
  "REPO_URL = \"https://huggingface.co/spaces/InosLihka/rhythm_env\"\n",
56
  "ENV_DIR = \"/content/rhythm_env\"\n",
57
  "\n",
@@ -59,13 +62,10 @@
59
  " !git clone {REPO_URL} {ENV_DIR}\n",
60
  "\n",
61
  "os.chdir(ENV_DIR)\n",
62
- "\n",
63
- "import sys\n",
64
  "sys.path.insert(0, ENV_DIR)\n",
65
  "sys.path.insert(0, os.path.join(ENV_DIR, \"training\"))\n",
66
- "\n",
67
  "print(f\"Working directory: {os.getcwd()}\")\n",
68
- "print(f\"Files: {os.listdir('.')}\")"
69
  ]
70
  },
71
  {
@@ -86,12 +86,12 @@
86
  "\n",
87
  "env = RhythmEnvironment()\n",
88
  "obs = env.reset(seed=0)\n",
89
- "print(f\"Initial state Vitality: {obs.vitality:.2f}, Cognition: {obs.cognition:.2f}, \"\n",
90
  " f\"Progress: {obs.progress:.2f}, Serenity: {obs.serenity:.2f}, Connection: {obs.connection:.2f}\")\n",
91
  "print(f\"Day: {obs.day}, Slot: {obs.slot}, Steps: {obs.remaining_steps}\")\n",
92
  "\n",
93
  "obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))\n",
94
- "print(f\"\\nAfter DEEP_WORK Reward: {obs.reward:.3f}\")\n",
95
  "print(f\"Vitality: {obs.vitality:.2f}, Progress: {obs.progress:.2f}\")\n",
96
  "print(\"\\nEnvironment is working!\")"
97
  ]
@@ -109,26 +109,9 @@
109
  "metadata": {},
110
  "outputs": [],
111
  "source": [
112
- "from training.inference_eval import run_episode, heuristic_action\n",
113
- "import json\n",
114
- "\n",
115
- "PROFILES = [\"introvert_morning\", \"extrovert_night_owl\", \"workaholic_stoic\"]\n",
116
- "NUM_EVAL_EPISODES = 5\n",
117
- "\n",
118
- "baseline_results = {}\n",
119
- "\n",
120
- "for strategy in [\"heuristic\", \"random\"]:\n",
121
- " baseline_results[strategy] = {}\n",
122
- " for profile in PROFILES:\n",
123
- " scores = []\n",
124
- " for ep in range(NUM_EVAL_EPISODES):\n",
125
- " result = run_episode(profile=profile, seed=ep, strategy=strategy)\n",
126
- " scores.append(result[\"final_score\"])\n",
127
- " avg = sum(scores) / len(scores)\n",
128
- " baseline_results[strategy][profile] = avg\n",
129
- " print(f\"{strategy:10s} | {profile:25s} | avg_score={avg:.3f}\")\n",
130
- "\n",
131
- "print(\"\\nBaseline scores saved. These are our 'before training' numbers.\")"
132
  ]
133
  },
134
  {
@@ -143,7 +126,33 @@
143
  "execution_count": null,
144
  "metadata": {},
145
  "outputs": [],
146
- "source": "from dataset import generate_dataset\nfrom datasets import Dataset\n\nNUM_EPISODES = 200\nMAX_SAMPLES = 2000\n\nraw_samples = generate_dataset(\n num_episodes=NUM_EPISODES,\n strategy=\"mixed\",\n max_samples=MAX_SAMPLES,\n)\n\n# Include replay metadata so env_reward can reconstruct the exact state\ndataset = Dataset.from_list([\n {\n \"prompt\": sample[\"prompt\"],\n \"seed\": sample[\"seed\"],\n \"step_index\": sample[\"step_index\"],\n \"action_history\": sample[\"action_history\"],\n }\n for sample in raw_samples\n])\n\nprint(f\"\\nDataset size: {len(dataset)}\")\nprint(f\"\\nSample prompt:\")\nprint(dataset[0][\"prompt\"][1][\"content\"])"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
  },
148
  {
149
  "cell_type": "markdown",
@@ -160,9 +169,9 @@
160
  "source": [
161
  "from unsloth import FastLanguageModel\n",
162
  "\n",
163
- "MODEL_NAME = \"unsloth/Qwen2.5-3B-Instruct\" # Good balance of capability vs VRAM\n",
164
- "MAX_SEQ_LENGTH = 768\n",
165
- "LORA_RANK = 4\n",
166
  "\n",
167
  "model, tokenizer = FastLanguageModel.from_pretrained(\n",
168
  " model_name=MODEL_NAME,\n",
@@ -173,15 +182,15 @@
173
  "model = FastLanguageModel.get_peft_model(\n",
174
  " model,\n",
175
  " r=LORA_RANK,\n",
176
- " target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
177
- " \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
 
 
178
  " lora_alpha=LORA_RANK * 2,\n",
179
  " use_gradient_checkpointing=\"unsloth\",\n",
180
  " random_state=3407,\n",
181
  ")\n",
182
- "\n",
183
- "print(f\"Model loaded: {MODEL_NAME}\")\n",
184
- "print(f\"LoRA rank: {LORA_RANK}, seq_length: {MAX_SEQ_LENGTH}\")"
185
  ]
186
  },
187
  {
@@ -196,7 +205,17 @@
196
  "execution_count": null,
197
  "metadata": {},
198
  "outputs": [],
199
- "source": "from reward_functions import format_valid, action_legal, env_reward\n\n# Three-layer reward stack:\n# 1. format_valid — did the model output a parseable action name? (+1.0 / -2.0)\n# 2. action_legal — is it one of the 10 valid actions? (+0.5 / -1.0)\n# 3. env_reward — actual reward from stepping the environment (varies)\n#\n# GRPOTrainer passes seed/step_index/action_history as dataset columns,\n# so env_reward replays the exact episode state for each training prompt.\nreward_funcs = [format_valid, action_legal, env_reward]\n\n# Quick sanity check\ntest_completions = [\n [{\"content\": \"DEEP_WORK\"}],\n [{\"content\": \"garbage\"}],\n [{\"content\": \"SLEEP\"}],\n]\nprint(\"format_valid:\", format_valid(test_completions))\nprint(\"action_legal:\", action_legal(test_completions))\nprint(\"env_reward (no history, seed=0):\", env_reward(test_completions))"
 
 
 
 
 
 
 
 
 
 
200
  },
201
  {
202
  "cell_type": "markdown",
@@ -211,12 +230,9 @@
211
  "metadata": {},
212
  "outputs": [],
213
  "source": [
214
- "import wandb\n",
215
- "\n",
216
- "# Optional: login to W&B for tracking (or set report_to=\"none\" below)\n",
217
- "# wandb.login(key=\"YOUR_KEY\")\n",
218
- "\n",
219
- "REPORT_TO = \"none\" # Change to \"wandb\" if you have a key"
220
  ]
221
  },
222
  {
@@ -227,18 +243,18 @@
227
  "source": [
228
  "from trl import GRPOConfig, GRPOTrainer\n",
229
  "\n",
230
- "MAX_STEPS = 500 # Increase to 1000 if time allows\n",
231
  "NUM_GENERATIONS = 4\n",
232
- "LEARNING_RATE = 5e-5 # Reduced from default — lower lr prevents destabilizing early gradient steps\n",
233
- "\n",
234
- "max_prompt_length = 400\n",
235
- "max_completion_length = 16 # Action names are 3-15 chars\n",
236
  "\n",
237
  "training_args = GRPOConfig(\n",
238
  " temperature=1.0,\n",
239
  " learning_rate=LEARNING_RATE,\n",
240
- " beta=0.1, # KL penalty — higher = more conservative, prevents policy drift\n",
241
- " max_grad_norm=0.5, # Gradient clipping prevents large destabilizing updates\n",
242
  " weight_decay=0.001,\n",
243
  " warmup_ratio=0.1,\n",
244
  " lr_scheduler_type=\"linear\",\n",
@@ -247,12 +263,12 @@
247
  " per_device_train_batch_size=1,\n",
248
  " gradient_accumulation_steps=4,\n",
249
  " num_generations=NUM_GENERATIONS,\n",
250
- " max_prompt_length=max_prompt_length,\n",
251
- " max_completion_length=max_completion_length,\n",
252
  " max_steps=MAX_STEPS,\n",
253
- " save_steps=100,\n",
254
- " report_to=REPORT_TO,\n",
255
- " output_dir=\"outputs/rhythmenv_trained\",\n",
256
  ")\n",
257
  "\n",
258
  "trainer = GRPOTrainer(\n",
@@ -262,12 +278,7 @@
262
  " args=training_args,\n",
263
  " train_dataset=dataset,\n",
264
  ")\n",
265
- "\n",
266
- "print(f\"Training config: {MAX_STEPS} steps, {NUM_GENERATIONS} generations, lr={LEARNING_RATE}\")\n",
267
- "print(f\" beta=0.1 (higher KL penalty = more conservative = less policy drift)\")\n",
268
- "print(f\" max_grad_norm=0.5 (gradient clipping for stability)\")\n",
269
- "print(f\" max_completion_length=16 (action names only, no verbose outputs)\")\n",
270
- "print(\"Starting training...\")"
271
  ]
272
  },
273
  {
@@ -292,14 +303,14 @@
292
  "metadata": {},
293
  "outputs": [],
294
  "source": [
295
- "OUTPUT_DIR = \"outputs/rhythmenv_trained\"\n",
296
  "\n",
297
  "model.save_pretrained_merged(\n",
298
  " OUTPUT_DIR,\n",
299
  " tokenizer,\n",
300
  " save_method=\"merged_16bit\",\n",
301
  ")\n",
302
- "print(f\"Model saved to: {OUTPUT_DIR}\")"
303
  ]
304
  },
305
  {
@@ -353,7 +364,7 @@
353
  "ax.plot(steps[:len(losses)], losses, color=\"#2563eb\", linewidth=1.5, alpha=0.8)\n",
354
  "ax.set_xlabel(\"Training Step\", fontsize=12)\n",
355
  "ax.set_ylabel(\"Loss\", fontsize=12)\n",
356
- "ax.set_title(\"GRPO Training Loss RhythmEnv Life Simulator\", fontsize=14)\n",
357
  "ax.grid(True, alpha=0.3)\n",
358
  "plt.tight_layout()\n",
359
  "plt.savefig(\"plots/training_loss.png\", dpi=150)\n",
@@ -369,10 +380,10 @@
369
  " import numpy as np\n",
370
  " r = np.array(rewards)\n",
371
  " s = np.array(reward_stds)\n",
372
- " ax.fill_between(reward_steps, r - s, r + s, color=\"#16a34a\", alpha=0.15, label=\"±1 Std Dev\")\n",
373
  " ax.set_xlabel(\"Training Step\", fontsize=12)\n",
374
  " ax.set_ylabel(\"Mean Reward\", fontsize=12)\n",
375
- " ax.set_title(\"GRPO Mean Reward RhythmEnv Life Simulator\", fontsize=14)\n",
376
  " ax.legend()\n",
377
  " ax.grid(True, alpha=0.3)\n",
378
  " plt.tight_layout()\n",
@@ -403,39 +414,10 @@
403
  "metadata": {},
404
  "outputs": [],
405
  "source": [
406
- "from training.inference_eval import run_episode, model_action\n",
407
- "from unsloth import FastLanguageModel\n",
408
  "\n",
409
- "# Load trained model for inference\n",
410
- "trained_model, trained_tokenizer = FastLanguageModel.from_pretrained(\n",
411
- " model_name=\"outputs/rhythmenv_trained\",\n",
412
- " load_in_4bit=True,\n",
413
- " max_seq_length=768,\n",
414
- ")\n",
415
- "FastLanguageModel.for_inference(trained_model)\n",
416
- "\n",
417
- "PROFILES = [\"introvert_morning\", \"extrovert_night_owl\", \"workaholic_stoic\"]\n",
418
- "NUM_EVAL = 5\n",
419
- "\n",
420
- "trained_results = {}\n",
421
- "all_eval_results = []\n",
422
- "\n",
423
- "for profile in PROFILES:\n",
424
- " scores = []\n",
425
- " for ep in range(NUM_EVAL):\n",
426
- " result = run_episode(\n",
427
- " profile=profile, seed=ep, strategy=\"model\",\n",
428
- " model=trained_model, tokenizer=trained_tokenizer,\n",
429
- " )\n",
430
- " scores.append(result[\"final_score\"])\n",
431
- " all_eval_results.append(result)\n",
432
- " avg = sum(scores) / len(scores)\n",
433
- " trained_results[profile] = avg\n",
434
- " print(f\"Trained | {profile:25s} | avg_score={avg:.3f} | scores={[round(s,3) for s in scores]}\")\n",
435
- "\n",
436
- "# Save eval results\n",
437
- "with open(\"plots/eval_results.json\", \"w\") as f:\n",
438
- " json.dump(all_eval_results, f, indent=2)"
439
  ]
440
  },
441
  {
@@ -451,52 +433,71 @@
451
  "metadata": {},
452
  "outputs": [],
453
  "source": [
454
- "import matplotlib.pyplot as plt\n",
 
 
455
  "import numpy as np\n",
 
456
  "\n",
457
- "profiles_short = [\"Introvert\\nMorning\", \"Extrovert\\nNight Owl\", \"Workaholic\\nStoic\"]\n",
458
- "profile_keys = [\"introvert_morning\", \"extrovert_night_owl\", \"workaholic_stoic\"]\n",
459
- "\n",
460
- "random_scores = [baseline_results[\"random\"][p] for p in profile_keys]\n",
461
- "heuristic_scores = [baseline_results[\"heuristic\"][p] for p in profile_keys]\n",
462
- "trained_scores = [trained_results[p] for p in profile_keys]\n",
463
- "\n",
464
- "x = np.arange(len(profiles_short))\n",
465
- "width = 0.25\n",
466
- "\n",
467
- "fig, ax = plt.subplots(figsize=(10, 6))\n",
468
- "bars1 = ax.bar(x - width, random_scores, width, label=\"Random Baseline\", color=\"#94a3b8\")\n",
469
- "bars2 = ax.bar(x, heuristic_scores, width, label=\"Heuristic Baseline\", color=\"#60a5fa\")\n",
470
- "bars3 = ax.bar(x + width, trained_scores, width, label=\"GRPO Trained\", color=\"#22c55e\")\n",
471
- "\n",
472
- "ax.set_ylabel(\"Final Episode Score (0-1)\", fontsize=12)\n",
473
- "ax.set_title(\"RhythmEnv: Baseline vs Trained Agent Across Hidden Profiles\", fontsize=14)\n",
474
- "ax.set_xticks(x)\n",
475
- "ax.set_xticklabels(profiles_short, fontsize=11)\n",
476
- "ax.legend(fontsize=11)\n",
477
- "ax.set_ylim(0, 1.0)\n",
478
- "ax.grid(axis=\"y\", alpha=0.3)\n",
479
- "\n",
480
- "# Add value labels on bars\n",
481
- "for bars in [bars1, bars2, bars3]:\n",
482
- " for bar in bars:\n",
483
- " height = bar.get_height()\n",
484
- " ax.annotate(f\"{height:.3f}\",\n",
485
- " xy=(bar.get_x() + bar.get_width() / 2, height),\n",
486
- " xytext=(0, 3), textcoords=\"offset points\",\n",
487
- " ha=\"center\", va=\"bottom\", fontsize=9)\n",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
488
  "\n",
489
  "plt.tight_layout()\n",
 
490
  "plt.savefig(\"plots/baseline_vs_trained.png\", dpi=150)\n",
491
  "plt.show()\n",
492
  "print(\"Saved: plots/baseline_vs_trained.png\")\n",
493
  "\n",
494
  "# Print summary table\n",
495
- "print(f\"\\n{'Profile':<25} {'Random':>10} {'Heuristic':>10} {'Trained':>10} {'Improvement':>12}\")\n",
496
- "print(\"-\" * 70)\n",
497
- "for i, p in enumerate(profile_keys):\n",
498
- " improvement = trained_scores[i] - heuristic_scores[i]\n",
499
- " print(f\"{p:<25} {random_scores[i]:>10.3f} {heuristic_scores[i]:>10.3f} {trained_scores[i]:>10.3f} {improvement:>+12.3f}\")"
500
  ]
501
  },
502
  {
@@ -519,15 +520,15 @@
519
  "source": [
520
  "# List all generated artifacts\n",
521
  "print(\"Generated files:\")\n",
522
- "for f in os.listdir(\"plots\"):\n",
523
  " size = os.path.getsize(f\"plots/{f}\")\n",
524
  " print(f\" plots/{f} ({size:,} bytes)\")\n",
525
  "\n",
526
- "print(f\"\\nModel saved at: outputs/rhythmenv_trained/\")\n",
527
  "print(\"\\nNext steps:\")\n",
528
  "print(\"1. Download plots/*.png and commit to your repo\")\n",
529
  "print(\"2. Embed plots in README.md\")\n",
530
- "print(\"3. Create mini-blog or video\")"
531
  ]
532
  },
533
  {
@@ -537,8 +538,8 @@
537
  "outputs": [],
538
  "source": [
539
  "# Optional: zip everything for easy download\n",
540
- "!zip -r /content/rhythmenv_training_artifacts.zip plots/ outputs/rhythmenv_trained/training_config.json\n",
541
- "print(\"Download: /content/rhythmenv_training_artifacts.zip\")"
542
  ]
543
  }
544
  ],
 
4
  "cell_type": "markdown",
5
  "metadata": {},
6
  "source": [
7
+ "# RhythmEnv GRPO Training \u2014 Meta-RL Life Simulator\n",
8
  "\n",
9
+ "Train an LLM agent to *learn how to learn a person*. The agent receives 4-layer reward:\n",
10
+ "- format_valid (parseable ACTION + 3 belief digits)\n",
11
+ "- action_legal (one of 10 valid actions)\n",
12
+ "- env_reward (actual env reward via seed-replay)\n",
13
+ "- belief_accuracy (cosine-MAE vs hidden true profile vector)\n",
14
  "\n",
15
+ "Continuous profile space + curriculum warmup + adaptation-aware grader \u2192 trains a meta-policy that generalizes to unseen personalities.\n",
16
  "\n",
17
+ "Default config: 1500 steps on Qwen 2.5-3B + LoRA rank 8, ~6h on Colab T4.\n"
 
 
18
  ]
19
  },
20
  {
 
33
  "%%capture\n",
34
  "!pip install --upgrade uv\n",
35
  "!uv pip install --system torch>=2.8.0 triton>=3.4.0 torchvision bitsandbytes\n",
36
+ "!uv pip install --system transformers==4.56.2 trl==0.22.2 datasets wandb peft accelerate\n",
37
  "!pip install unsloth\n",
38
+ "!pip install fastapi uvicorn pydantic matplotlib openenv-core\n"
39
  ]
40
  },
41
  {
 
52
  "outputs": [],
53
  "source": [
54
  "import os\n",
55
+ "import sys\n",
56
  "\n",
57
+ "# HF Space contains the meta-RL code (server/, models.py, training/, inference.py).\n",
58
  "REPO_URL = \"https://huggingface.co/spaces/InosLihka/rhythm_env\"\n",
59
  "ENV_DIR = \"/content/rhythm_env\"\n",
60
  "\n",
 
62
  " !git clone {REPO_URL} {ENV_DIR}\n",
63
  "\n",
64
  "os.chdir(ENV_DIR)\n",
 
 
65
  "sys.path.insert(0, ENV_DIR)\n",
66
  "sys.path.insert(0, os.path.join(ENV_DIR, \"training\"))\n",
 
67
  "print(f\"Working directory: {os.getcwd()}\")\n",
68
+ "print(f\"Files: {sorted(os.listdir('.'))}\")\n"
69
  ]
70
  },
71
  {
 
86
  "\n",
87
  "env = RhythmEnvironment()\n",
88
  "obs = env.reset(seed=0)\n",
89
+ "print(f\"Initial state \u00e2\u20ac\u201d Vitality: {obs.vitality:.2f}, Cognition: {obs.cognition:.2f}, \"\n",
90
  " f\"Progress: {obs.progress:.2f}, Serenity: {obs.serenity:.2f}, Connection: {obs.connection:.2f}\")\n",
91
  "print(f\"Day: {obs.day}, Slot: {obs.slot}, Steps: {obs.remaining_steps}\")\n",
92
  "\n",
93
  "obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))\n",
94
+ "print(f\"\\nAfter DEEP_WORK \u00e2\u20ac\u201d Reward: {obs.reward:.3f}\")\n",
95
  "print(f\"Vitality: {obs.vitality:.2f}, Progress: {obs.progress:.2f}\")\n",
96
  "print(\"\\nEnvironment is working!\")"
97
  ]
 
109
  "metadata": {},
110
  "outputs": [],
111
  "source": [
112
+ "# Baselines (random + heuristic) under the new meta-RL grader\n",
113
+ "# Runs 3 conditions: discrete-3-profiles (legacy), continuous-in-dist, continuous-OOD\n",
114
+ "!python training/inference_eval.py --num_episodes 5 --output_file eval_baselines_meta.json\n"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  ]
116
  },
117
  {
 
126
  "execution_count": null,
127
  "metadata": {},
128
  "outputs": [],
129
+ "source": [
130
+ "from dataset import generate_dataset\n",
131
+ "from datasets import Dataset\n",
132
+ "\n",
133
+ "NUM_EPISODES = 300\n",
134
+ "MAX_SAMPLES = 3000\n",
135
+ "HINT_FRACTION = 0.15 # curriculum warmup\n",
136
+ "\n",
137
+ "raw_samples = generate_dataset(\n",
138
+ " num_episodes=NUM_EPISODES,\n",
139
+ " strategy=\"mixed\",\n",
140
+ " max_samples=MAX_SAMPLES,\n",
141
+ " profile_mode=\"continuous\", # meta-RL: sampled per episode\n",
142
+ " hint_fraction=HINT_FRACTION,\n",
143
+ ")\n",
144
+ "\n",
145
+ "dataset = Dataset.from_list([\n",
146
+ " {\n",
147
+ " \"prompt\": s[\"prompt\"],\n",
148
+ " \"seed\": s[\"seed\"],\n",
149
+ " \"step_index\": s[\"step_index\"],\n",
150
+ " \"action_history\": s[\"action_history\"],\n",
151
+ " \"profile_mode\": s[\"profile_mode\"],\n",
152
+ " } for s in raw_samples\n",
153
+ "])\n",
154
+ "print(f\"Dataset size: {len(dataset)}\")\n"
155
+ ]
156
  },
157
  {
158
  "cell_type": "markdown",
 
169
  "source": [
170
  "from unsloth import FastLanguageModel\n",
171
  "\n",
172
+ "MODEL_NAME = \"unsloth/Qwen2.5-3B-Instruct\"\n",
173
+ "MAX_SEQ_LENGTH = 1024 # bumped from 768 for longer prompts\n",
174
+ "LORA_RANK = 8 # bumped from 4 for meta-RL capacity\n",
175
  "\n",
176
  "model, tokenizer = FastLanguageModel.from_pretrained(\n",
177
  " model_name=MODEL_NAME,\n",
 
182
  "model = FastLanguageModel.get_peft_model(\n",
183
  " model,\n",
184
  " r=LORA_RANK,\n",
185
+ " target_modules=[\n",
186
+ " \"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
187
+ " \"gate_proj\", \"up_proj\", \"down_proj\",\n",
188
+ " ],\n",
189
  " lora_alpha=LORA_RANK * 2,\n",
190
  " use_gradient_checkpointing=\"unsloth\",\n",
191
  " random_state=3407,\n",
192
  ")\n",
193
+ "print(f\"Loaded {MODEL_NAME} with LoRA rank {LORA_RANK}\")\n"
 
 
194
  ]
195
  },
196
  {
 
205
  "execution_count": null,
206
  "metadata": {},
207
  "outputs": [],
208
+ "source": [
209
+ "from reward_functions import format_valid, action_legal, env_reward, belief_accuracy\n",
210
+ "\n",
211
+ "# Four-layer reward stack:\n",
212
+ "# 1. format_valid -- output parseable as ACTION + 3 belief digits\n",
213
+ "# 2. action_legal -- action is one of 10 valid types\n",
214
+ "# 3. env_reward -- actual env reward via seed-replay (cannot be fabricated)\n",
215
+ "# 4. belief_accuracy -- how close belief is to hidden profile (META-RL signal)\n",
216
+ "reward_funcs = [format_valid, action_legal, env_reward, belief_accuracy]\n",
217
+ "print(\"Reward functions: format_valid, action_legal, env_reward, belief_accuracy\")\n"
218
+ ]
219
  },
220
  {
221
  "cell_type": "markdown",
 
230
  "metadata": {},
231
  "outputs": [],
232
  "source": [
233
+ "# Optional: enable W&B for live training-loss tracking.\n",
234
+ "# (Skip this cell if you don't have a W&B account.)\n",
235
+ "# import wandb; wandb.login(key=\"YOUR_KEY\")\n"
 
 
 
236
  ]
237
  },
238
  {
 
243
  "source": [
244
  "from trl import GRPOConfig, GRPOTrainer\n",
245
  "\n",
246
+ "MAX_STEPS = 1500 # bumped from 500 for meta-RL\n",
247
  "NUM_GENERATIONS = 4\n",
248
+ "LEARNING_RATE = 5e-5\n",
249
+ "BETA = 0.1 # raise to 0.2 if training is unstable\n",
250
+ "MAX_PROMPT_LEN = 600 # bumped from 400 for history + hint\n",
251
+ "MAX_COMPLETION_LEN = 20 # bumped from 16 for ACTION_NAME D D D format\n",
252
  "\n",
253
  "training_args = GRPOConfig(\n",
254
  " temperature=1.0,\n",
255
  " learning_rate=LEARNING_RATE,\n",
256
+ " beta=BETA,\n",
257
+ " max_grad_norm=0.5,\n",
258
  " weight_decay=0.001,\n",
259
  " warmup_ratio=0.1,\n",
260
  " lr_scheduler_type=\"linear\",\n",
 
263
  " per_device_train_batch_size=1,\n",
264
  " gradient_accumulation_steps=4,\n",
265
  " num_generations=NUM_GENERATIONS,\n",
266
+ " max_prompt_length=MAX_PROMPT_LEN,\n",
267
+ " max_completion_length=MAX_COMPLETION_LEN,\n",
268
  " max_steps=MAX_STEPS,\n",
269
+ " save_steps=250, # checkpoint every 250 (Colab can disconnect)\n",
270
+ " report_to=\"none\",\n",
271
+ " output_dir=\"outputs/rhythmenv_meta_trained\",\n",
272
  ")\n",
273
  "\n",
274
  "trainer = GRPOTrainer(\n",
 
278
  " args=training_args,\n",
279
  " train_dataset=dataset,\n",
280
  ")\n",
281
+ "print(f\"Training {MAX_STEPS} steps, {NUM_GENERATIONS} generations per prompt\")\n"
 
 
 
 
 
282
  ]
283
  },
284
  {
 
303
  "metadata": {},
304
  "outputs": [],
305
  "source": [
306
+ "OUTPUT_DIR = \"outputs/rhythmenv_meta_trained\"\n",
307
  "\n",
308
  "model.save_pretrained_merged(\n",
309
  " OUTPUT_DIR,\n",
310
  " tokenizer,\n",
311
  " save_method=\"merged_16bit\",\n",
312
  ")\n",
313
+ "print(f\"Model saved to: {OUTPUT_DIR}\")\n"
314
  ]
315
  },
316
  {
 
364
  "ax.plot(steps[:len(losses)], losses, color=\"#2563eb\", linewidth=1.5, alpha=0.8)\n",
365
  "ax.set_xlabel(\"Training Step\", fontsize=12)\n",
366
  "ax.set_ylabel(\"Loss\", fontsize=12)\n",
367
+ "ax.set_title(\"GRPO Training Loss \u00e2\u20ac\u201d RhythmEnv Life Simulator\", fontsize=14)\n",
368
  "ax.grid(True, alpha=0.3)\n",
369
  "plt.tight_layout()\n",
370
  "plt.savefig(\"plots/training_loss.png\", dpi=150)\n",
 
380
  " import numpy as np\n",
381
  " r = np.array(rewards)\n",
382
  " s = np.array(reward_stds)\n",
383
+ " ax.fill_between(reward_steps, r - s, r + s, color=\"#16a34a\", alpha=0.15, label=\"\u00c2\u00b11 Std Dev\")\n",
384
  " ax.set_xlabel(\"Training Step\", fontsize=12)\n",
385
  " ax.set_ylabel(\"Mean Reward\", fontsize=12)\n",
386
+ " ax.set_title(\"GRPO Mean Reward \u00e2\u20ac\u201d RhythmEnv Life Simulator\", fontsize=14)\n",
387
  " ax.legend()\n",
388
  " ax.grid(True, alpha=0.3)\n",
389
  " plt.tight_layout()\n",
 
414
  "metadata": {},
415
  "outputs": [],
416
  "source": [
417
+ "OUTPUT_DIR = \"outputs/rhythmenv_meta_trained\"\n",
 
418
  "\n",
419
+ "# Full eval suite: discrete + continuous in-dist + continuous OOD\n",
420
+ "!python training/inference_eval.py --model_path {OUTPUT_DIR} --num_episodes 5 --output_file eval_results.json\n"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
421
  ]
422
  },
423
  {
 
433
  "metadata": {},
434
  "outputs": [],
435
  "source": [
436
+ "# Comparison plot: 3 conditions x {random, heuristic, trained}\n",
437
+ "# Reads eval_results.json (written by inference_eval.py in cell 24)\n",
438
+ "import json, os\n",
439
  "import numpy as np\n",
440
+ "import matplotlib.pyplot as plt\n",
441
  "\n",
442
+ "with open(\"eval_results.json\") as f:\n",
443
+ " results = json.load(f)\n",
444
+ "\n",
445
+ "conditions = [\"discrete-3-profiles (legacy)\", \"continuous-in-distribution\", \"continuous-OOD (generalization)\"]\n",
446
+ "strategies = [\"random\", \"heuristic\", \"model\"]\n",
447
+ "\n",
448
+ "def avg(cond, strat, key=\"final_score\"):\n",
449
+ " rs = [r[key] for r in results if r[\"condition\"] == cond and r[\"strategy\"] == strat]\n",
450
+ " return float(np.mean(rs)) if rs else 0.0\n",
451
+ "\n",
452
+ "x = np.arange(len(conditions))\n",
453
+ "width = 0.27\n",
454
+ "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
455
+ "\n",
456
+ "# Plot 1: final_score by condition\n",
457
+ "rand = [avg(c, \"random\") for c in conditions]\n",
458
+ "heur = [avg(c, \"heuristic\") for c in conditions]\n",
459
+ "trnd = [avg(c, \"model\") for c in conditions]\n",
460
+ "axes[0].bar(x - width, rand, width, label=\"Random\", color=\"#94a3b8\")\n",
461
+ "axes[0].bar(x, heur, width, label=\"Heuristic\", color=\"#60a5fa\")\n",
462
+ "axes[0].bar(x + width, trnd, width, label=\"Trained Qwen\", color=\"#22c55e\")\n",
463
+ "axes[0].set_ylabel(\"Final score (0-1)\")\n",
464
+ "axes[0].set_title(\"Final score by condition\")\n",
465
+ "axes[0].set_xticks(x)\n",
466
+ "axes[0].set_xticklabels([c.split(\" \")[0] for c in conditions], fontsize=10)\n",
467
+ "axes[0].legend()\n",
468
+ "axes[0].grid(axis=\"y\", alpha=0.3)\n",
469
+ "for xi, vals in enumerate(zip(rand, heur, trnd)):\n",
470
+ " for di, v in enumerate(vals):\n",
471
+ " axes[0].annotate(f\"{v:.3f}\", xy=(xi + (di-1)*width, v), xytext=(0, 3),\n",
472
+ " textcoords=\"offset points\", ha=\"center\", fontsize=8)\n",
473
+ "\n",
474
+ "# Plot 2: adaptation by condition (THE meta-RL signal)\n",
475
+ "rand_a = [avg(c, \"random\", \"adaptation\") for c in conditions]\n",
476
+ "heur_a = [avg(c, \"heuristic\", \"adaptation\") for c in conditions]\n",
477
+ "trnd_a = [avg(c, \"model\", \"adaptation\") for c in conditions]\n",
478
+ "axes[1].bar(x - width, rand_a, width, label=\"Random\", color=\"#94a3b8\")\n",
479
+ "axes[1].bar(x, heur_a, width, label=\"Heuristic\", color=\"#60a5fa\")\n",
480
+ "axes[1].bar(x + width, trnd_a, width, label=\"Trained Qwen\", color=\"#22c55e\")\n",
481
+ "axes[1].set_ylabel(\"Adaptation (late-half mean - early-half mean)\")\n",
482
+ "axes[1].set_title(\"Adaptation: did agent get better mid-episode?\")\n",
483
+ "axes[1].set_xticks(x)\n",
484
+ "axes[1].set_xticklabels([c.split(\" \")[0] for c in conditions], fontsize=10)\n",
485
+ "axes[1].axhline(0, color=\"k\", linewidth=0.5)\n",
486
+ "axes[1].legend()\n",
487
+ "axes[1].grid(axis=\"y\", alpha=0.3)\n",
488
  "\n",
489
  "plt.tight_layout()\n",
490
+ "os.makedirs(\"plots\", exist_ok=True)\n",
491
  "plt.savefig(\"plots/baseline_vs_trained.png\", dpi=150)\n",
492
  "plt.show()\n",
493
  "print(\"Saved: plots/baseline_vs_trained.png\")\n",
494
  "\n",
495
  "# Print summary table\n",
496
+ "print()\n",
497
+ "print(f\"{'Condition':<40} {'Random':>10} {'Heuristic':>10} {'Trained':>10} {'vs Heuristic':>14}\")\n",
498
+ "print(\"-\" * 90)\n",
499
+ "for c, r, h, t in zip(conditions, rand, heur, trnd):\n",
500
+ " print(f\"{c:<40} {r:>10.3f} {h:>10.3f} {t:>10.3f} {(t - h):>+14.3f}\")\n"
501
  ]
502
  },
503
  {
 
520
  "source": [
521
  "# List all generated artifacts\n",
522
  "print(\"Generated files:\")\n",
523
+ "for f in sorted(os.listdir(\"plots\")):\n",
524
  " size = os.path.getsize(f\"plots/{f}\")\n",
525
  " print(f\" plots/{f} ({size:,} bytes)\")\n",
526
  "\n",
527
+ "print(f\"\\nModel saved at: outputs/rhythmenv_meta_trained/\")\n",
528
  "print(\"\\nNext steps:\")\n",
529
  "print(\"1. Download plots/*.png and commit to your repo\")\n",
530
  "print(\"2. Embed plots in README.md\")\n",
531
+ "print(\"3. Push trained model to HF Space if you want it live\")\n"
532
  ]
533
  },
534
  {
 
538
  "outputs": [],
539
  "source": [
540
  "# Optional: zip everything for easy download\n",
541
+ "!zip -r /content/rhythmenv_training_artifacts.zip plots/ outputs/rhythmenv_meta_trained/training_config.json eval_results.json eval_baselines_meta.json\n",
542
+ "print(\"Download: /content/rhythmenv_training_artifacts.zip\")\n"
543
  ]
544
  }
545
  ],
training/dataset.py CHANGED
@@ -1,8 +1,17 @@
1
  """
2
- Dataset generator for RhythmEnv GRPO training.
3
 
4
  Generates observation prompts by playing episodes with random/heuristic
5
- strategies, capturing diverse states across all profiles and time slots.
 
 
 
 
 
 
 
 
 
6
  """
7
 
8
  import sys
@@ -18,32 +27,63 @@ SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
18
  DAY_NAMES = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
19
 
20
  SYSTEM_PROMPT = (
21
- "You are a life management agent. You balance 5 life meters over a 7-day week.\n"
22
- "Each day has 4 slots: Morning, Afternoon, Evening, Night. Total: 28 steps.\n\n"
23
- "Meters (0.0 to 1.0):\n"
24
- "- Vitality: physical energy, sleep quality\n"
25
- "- Cognition: mental clarity, focus\n"
26
- "- Progress: career/skill advancement\n"
27
- "- Serenity: inner peace, stress management\n"
28
- "- Connection: relationship health\n\n"
29
- "Available actions: DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE, "
30
- "FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH\n\n"
31
- "Key rules:\n"
32
- "- Low Vitality reduces effectiveness of all positive actions.\n"
33
- "- If ANY meter drops below 0.1, you get a heavy penalty.\n"
34
- "- Connection decays passively if you don't socialize.\n"
35
- "- Different episodes may have different hidden dynamics - adapt!\n"
36
- "- Time of day matters for cognitive tasks.\n\n"
37
- "Respond with ONLY the action name, nothing else."
 
 
 
 
 
 
38
  )
39
 
40
 
41
- def format_observation_prompt(obs) -> str:
42
- """Format an observation into a user prompt for the LLM."""
 
 
 
 
 
43
  day_name = DAY_NAMES[obs.day] if obs.day < 7 else f"Day {obs.day}"
44
  slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
45
  event_str = f"\nActive event: {obs.active_event}" if obs.active_event else ""
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  return (
48
  f"Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})\n"
49
  f"Remaining steps: {obs.remaining_steps}\n\n"
@@ -53,20 +93,38 @@ def format_observation_prompt(obs) -> str:
53
  f" Progress: {obs.progress:.2f}\n"
54
  f" Serenity: {obs.serenity:.2f}\n"
55
  f" Connection: {obs.connection:.2f}"
56
- f"{event_str}\n\n"
57
- f"Choose your action:"
 
 
58
  )
59
 
60
 
61
- def generate_episode_samples(seed: int, strategy: str = "random") -> list:
62
- """
63
- Play one episode and return a list of (prompt, seed, step_index, action_sequence) tuples.
 
 
 
 
 
 
 
 
64
 
65
- Each sample captures the state at a specific step, along with enough info
66
- to replay to that point for reward evaluation.
 
 
 
 
 
 
 
67
  """
68
  env = RhythmEnvironment()
69
- obs = env.reset(seed=seed)
 
70
  rng = random.Random(seed + 1000)
71
  actions_taken = []
72
  samples = []
@@ -76,10 +134,8 @@ def generate_episode_samples(seed: int, strategy: str = "random") -> list:
76
  if obs.done:
77
  break
78
 
79
- # Generate prompt from current observation
80
- prompt = format_observation_prompt(obs)
81
 
82
- # Record sample with replay info
83
  samples.append({
84
  "prompt": [
85
  {"role": "system", "content": SYSTEM_PROMPT},
@@ -87,10 +143,11 @@ def generate_episode_samples(seed: int, strategy: str = "random") -> list:
87
  ],
88
  "seed": seed,
89
  "step_index": step,
90
- "action_history": list(actions_taken), # copy for replay
 
 
91
  })
92
 
93
- # Choose action based on strategy
94
  if strategy == "random":
95
  action_type = rng.choice(all_actions)
96
  elif strategy == "heuristic":
@@ -126,7 +183,6 @@ def _heuristic_action(obs) -> ActionType:
126
  if p < 0.3 and v > 0.3:
127
  return ActionType.LEARN
128
  return ActionType.ADMIN_WORK
129
- # evening
130
  if cn < 0.4:
131
  return ActionType.SOCIALIZE
132
  if s < 0.5:
@@ -138,19 +194,24 @@ def generate_dataset(
138
  num_episodes: int = 200,
139
  strategy: str = "mixed",
140
  max_samples: int = 2000,
 
 
141
  ) -> list:
142
- """
143
- Generate a training dataset by playing multiple episodes.
 
 
 
144
 
145
  Args:
146
  num_episodes: Number of episodes to play.
147
  strategy: "random", "heuristic", or "mixed" (alternating).
148
  max_samples: Maximum samples to return.
149
-
150
- Returns:
151
- List of prompt dicts suitable for HF Dataset.from_list().
152
  """
153
  all_samples = []
 
154
 
155
  for i in range(num_episodes):
156
  seed = i
@@ -158,23 +219,40 @@ def generate_dataset(
158
  s = "heuristic" if i % 2 == 0 else "random"
159
  else:
160
  s = strategy
161
-
162
- episode_samples = generate_episode_samples(seed=seed, strategy=s)
 
 
 
 
 
 
163
  all_samples.extend(episode_samples)
164
 
165
  if len(all_samples) >= max_samples:
166
  break
167
 
168
- # Shuffle and truncate
169
  random.shuffle(all_samples)
170
  all_samples = all_samples[:max_samples]
171
 
172
- print(f"Generated {len(all_samples)} training samples from {min(i+1, num_episodes)} episodes")
 
 
 
 
173
  return all_samples
174
 
175
 
176
  if __name__ == "__main__":
177
- samples = generate_dataset(num_episodes=50, strategy="mixed", max_samples=500)
178
- print(f"\nFirst sample prompt:")
179
- print(samples[0]["prompt"][1]["content"])
180
- print(f"\nSeed: {samples[0]['seed']}, Step: {samples[0]['step_index']}")
 
 
 
 
 
 
 
 
1
  """
2
+ Dataset generator for RhythmEnv GRPO training (meta-RL version).
3
 
4
  Generates observation prompts by playing episodes with random/heuristic
5
+ strategies, capturing diverse states across continuously-sampled profiles.
6
+
7
+ Key changes vs single-task version:
8
+ - Default profile_mode is "continuous" (sampled profile per episode)
9
+ - System prompt asks for ACTION_NAME + 3 belief digits
10
+ - Curriculum support: a fraction of samples include profile hint in obs
11
+ (training set up to show hint early, mask later, teaching the *skill*
12
+ of using profile signals before forcing inference)
13
+ - Dataset rows include seed, step_index, action_history, profile_mode
14
+ so env_reward and belief_accuracy can replay deterministically
15
  """
16
 
17
  import sys
 
27
  DAY_NAMES = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
28
 
29
  SYSTEM_PROMPT = (
30
+ "You are a life-management agent helping a person whose preferences are HIDDEN.\n"
31
+ "You see 5 life meters and a rolling 7-step history. The same action affects\n"
32
+ "different people differently — you must INFER who you're helping from the\n"
33
+ "rewards and meter changes you observe.\n\n"
34
+ "Each step, output ONE LINE in this exact format:\n"
35
+ " ACTION_NAME S M W\n\n"
36
+ "where ACTION_NAME is one of:\n"
37
+ " DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE,\n"
38
+ " FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH\n\n"
39
+ "and S, M, W are single digits (0-9) representing your current belief:\n"
40
+ " S = social preference (0=hates being social, 9=loves being social)\n"
41
+ " M = morning preference (0=night owl, 9=morning person)\n"
42
+ " W = work preference (0=avoids work, 9=workaholic)\n\n"
43
+ "Example: DEEP_WORK 3 8 7\n\n"
44
+ "Tips:\n"
45
+ "- Update your belief based on rewards: if SOCIALIZE gave a big positive reward,\n"
46
+ " raise S; if it tanked vitality without a reward, lower S.\n"
47
+ "- Early in the week, PROBE different actions to learn the person.\n"
48
+ "- Late in the week, EXPLOIT what you've learned — pick actions matching\n"
49
+ " the person's preferences (use your belief to guide).\n"
50
+ "- Watch for crashes: any meter under 0.1 → big penalty.\n"
51
+ "- Connection decays passively — actively maintain it.\n"
52
+ "Respond with ONLY the format line, no other text."
53
  )
54
 
55
 
56
+ def format_observation_prompt(obs, profile_hint: dict | None = None) -> str:
57
+ """Format an observation into a user prompt for the LLM.
58
+
59
+ If profile_hint is provided (curriculum's "visible" phase), include it in
60
+ the prompt so the agent learns the *skill* of using profile signals
61
+ before having to infer them from scratch.
62
+ """
63
  day_name = DAY_NAMES[obs.day] if obs.day < 7 else f"Day {obs.day}"
64
  slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
65
  event_str = f"\nActive event: {obs.active_event}" if obs.active_event else ""
66
 
67
+ history_lines = []
68
+ for h in (obs.step_history or [])[-5:]: # last 5 only to fit prompt budget
69
+ history_lines.append(
70
+ f" step {h.step}: {h.action} -> reward {h.reward:+.2f} "
71
+ f"(V{h.vitality_delta:+.2f} C{h.cognition_delta:+.2f} "
72
+ f"P{h.progress_delta:+.2f} S{h.serenity_delta:+.2f} Cn{h.connection_delta:+.2f})"
73
+ )
74
+ history_str = ""
75
+ if history_lines:
76
+ history_str = "\n\nRecent history:\n" + "\n".join(history_lines)
77
+
78
+ hint_str = ""
79
+ if profile_hint is not None:
80
+ hint_str = (
81
+ f"\n\nKnown about this person (training hint):\n"
82
+ f" social_pref={profile_hint['social_pref']:.2f}, "
83
+ f"morning_pref={profile_hint['morning_pref']:.2f}, "
84
+ f"work_pref={profile_hint['work_pref']:.2f}"
85
+ )
86
+
87
  return (
88
  f"Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})\n"
89
  f"Remaining steps: {obs.remaining_steps}\n\n"
 
93
  f" Progress: {obs.progress:.2f}\n"
94
  f" Serenity: {obs.serenity:.2f}\n"
95
  f" Connection: {obs.connection:.2f}"
96
+ f"{event_str}"
97
+ f"{history_str}"
98
+ f"{hint_str}\n\n"
99
+ f"Choose your action (format: ACTION S M W):"
100
  )
101
 
102
 
103
+ def generate_episode_samples(
104
+ seed: int,
105
+ strategy: str = "random",
106
+ profile_mode: str = "continuous",
107
+ show_profile_hint: bool = False,
108
+ ) -> list:
109
+ """Play one episode and return a list of training samples.
110
+
111
+ Each sample includes the prompt + replay metadata (seed, step_index,
112
+ action_history, profile_mode) so reward functions can deterministically
113
+ reconstruct the env state.
114
 
115
+ Args:
116
+ seed: Episode seed (also determines profile when profile_mode=continuous).
117
+ strategy: "random" or "heuristic" — used to roll out the episode for
118
+ state diversity. The agent's training generations replace these
119
+ actions; we only need the prefix history for replay.
120
+ profile_mode: "continuous" (sampled per seed) or "discrete" (1 of 3
121
+ hardcoded profiles).
122
+ show_profile_hint: If True, include the true belief vector in the prompt.
123
+ Use during the curriculum's "visible" warmup phase.
124
  """
125
  env = RhythmEnvironment()
126
+ obs = env.reset(seed=seed, profile_mode=profile_mode)
127
+ profile_hint = env.get_profile_hint() if show_profile_hint else None
128
  rng = random.Random(seed + 1000)
129
  actions_taken = []
130
  samples = []
 
134
  if obs.done:
135
  break
136
 
137
+ prompt = format_observation_prompt(obs, profile_hint=profile_hint)
 
138
 
 
139
  samples.append({
140
  "prompt": [
141
  {"role": "system", "content": SYSTEM_PROMPT},
 
143
  ],
144
  "seed": seed,
145
  "step_index": step,
146
+ "action_history": list(actions_taken),
147
+ "profile_mode": profile_mode,
148
+ "show_profile_hint": show_profile_hint,
149
  })
150
 
 
151
  if strategy == "random":
152
  action_type = rng.choice(all_actions)
153
  elif strategy == "heuristic":
 
183
  if p < 0.3 and v > 0.3:
184
  return ActionType.LEARN
185
  return ActionType.ADMIN_WORK
 
186
  if cn < 0.4:
187
  return ActionType.SOCIALIZE
188
  if s < 0.5:
 
194
  num_episodes: int = 200,
195
  strategy: str = "mixed",
196
  max_samples: int = 2000,
197
+ profile_mode: str = "continuous",
198
+ hint_fraction: float = 0.2,
199
  ) -> list:
200
+ """Generate a training dataset by playing multiple episodes.
201
+
202
+ Curriculum is baked into the dataset: hint_fraction of samples have the
203
+ true profile visible (visible-phase warmup). After shuffle, GRPOTrainer
204
+ sees a mix early on; we can sort to put hint samples first if needed.
205
 
206
  Args:
207
  num_episodes: Number of episodes to play.
208
  strategy: "random", "heuristic", or "mixed" (alternating).
209
  max_samples: Maximum samples to return.
210
+ profile_mode: "continuous" (default, meta-RL) or "discrete" (3 profiles).
211
+ hint_fraction: Fraction of episodes to play with profile hint visible.
 
212
  """
213
  all_samples = []
214
+ n_hint_episodes = int(num_episodes * hint_fraction)
215
 
216
  for i in range(num_episodes):
217
  seed = i
 
219
  s = "heuristic" if i % 2 == 0 else "random"
220
  else:
221
  s = strategy
222
+ show_hint = i < n_hint_episodes
223
+
224
+ episode_samples = generate_episode_samples(
225
+ seed=seed,
226
+ strategy=s,
227
+ profile_mode=profile_mode,
228
+ show_profile_hint=show_hint,
229
+ )
230
  all_samples.extend(episode_samples)
231
 
232
  if len(all_samples) >= max_samples:
233
  break
234
 
235
+ # Shuffle (curriculum is per-sample via show_profile_hint flag, not order)
236
  random.shuffle(all_samples)
237
  all_samples = all_samples[:max_samples]
238
 
239
+ n_hint = sum(1 for s in all_samples if s["show_profile_hint"])
240
+ print(
241
+ f"Generated {len(all_samples)} samples from {min(i+1, num_episodes)} episodes "
242
+ f"({n_hint} with profile hint, {len(all_samples) - n_hint} without)"
243
+ )
244
  return all_samples
245
 
246
 
247
  if __name__ == "__main__":
248
+ samples = generate_dataset(num_episodes=20, strategy="mixed", max_samples=80, hint_fraction=0.5)
249
+ print(f"\nFirst sample (with hint):")
250
+ hinted = next((s for s in samples if s["show_profile_hint"]), None)
251
+ if hinted:
252
+ print(hinted["prompt"][1]["content"])
253
+ print(f"\nseed={hinted['seed']}, step={hinted['step_index']}, mode={hinted['profile_mode']}")
254
+
255
+ print(f"\nFirst sample (without hint):")
256
+ plain = next((s for s in samples if not s["show_profile_hint"]), None)
257
+ if plain:
258
+ print(plain["prompt"][1]["content"])
training/inference_eval.py CHANGED
@@ -1,38 +1,46 @@
1
  """
2
- RhythmEnv Inference Evaluation — Baseline vs Trained comparison.
3
 
4
- Runs both a heuristic baseline and a trained model on all 3 profiles,
5
- producing a comparison table and score summary.
 
 
 
 
 
6
 
7
  Usage:
8
- # Heuristic baseline only:
9
  python training/inference_eval.py
10
 
11
  # With trained model:
12
  python training/inference_eval.py --model_path outputs/rhythmenv_trained
13
-
14
- # With remote API:
15
- python training/inference_eval.py --api_url https://api.example.com/v1 --api_key xxx
16
  """
17
 
18
  import argparse
19
  import json
20
  import os
 
21
  import sys
22
  from typing import Optional
23
 
24
  sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
25
 
26
  from models import ActionType, RhythmAction
27
- from server.rhythm_environment import RhythmEnvironment, MAX_STEPS
28
 
29
- PROFILES = ["introvert_morning", "extrovert_night_owl", "workaholic_stoic"]
30
  SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
31
  DAY_NAMES = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
32
 
 
 
 
 
 
33
 
34
  def heuristic_action(obs) -> ActionType:
35
- """Priority-based heuristic baseline."""
36
  slot = obs.slot
37
  v, c, p, s, cn = obs.vitality, obs.cognition, obs.progress, obs.serenity, obs.connection
38
 
@@ -60,21 +68,55 @@ def heuristic_action(obs) -> ActionType:
60
 
61
 
62
  def random_action(rng) -> ActionType:
63
- """Random baseline."""
64
- import random
65
  return rng.choice(list(ActionType))
66
 
67
 
68
- def run_episode(profile: str, seed: int, strategy: str, model=None, tokenizer=None) -> dict:
69
- """Run a single episode and return results."""
70
- import random
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  rng = random.Random(seed + 500)
72
 
73
  env = RhythmEnvironment()
74
- obs = env.reset(seed=seed, profile=profile)
 
 
 
 
 
 
75
 
76
  total_reward = 0.0
 
77
  actions_taken = []
 
78
 
79
  for step in range(MAX_STEPS):
80
  if obs.done:
@@ -85,7 +127,8 @@ def run_episode(profile: str, seed: int, strategy: str, model=None, tokenizer=No
85
  elif strategy == "random":
86
  action_type = random_action(rng)
87
  elif strategy == "model" and model is not None:
88
- action_type = model_action(obs, model, tokenizer)
 
89
  else:
90
  action_type = random_action(rng)
91
 
@@ -93,64 +136,99 @@ def run_episode(profile: str, seed: int, strategy: str, model=None, tokenizer=No
93
  actions_taken.append(action_type.value)
94
  obs = env.step(action)
95
  total_reward += obs.reward
 
96
 
97
  final_score = obs.reward_breakdown.get("final_score", 0.0)
98
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  return {
100
- "profile": profile,
 
101
  "strategy": strategy,
102
  "seed": seed,
103
  "final_score": round(final_score, 4),
104
  "total_reward": round(total_reward, 2),
 
105
  "vitality": round(obs.vitality, 2),
106
  "cognition": round(obs.cognition, 2),
107
  "progress": round(obs.progress, 2),
108
  "serenity": round(obs.serenity, 2),
109
  "connection": round(obs.connection, 2),
110
  "actions": actions_taken,
 
 
 
111
  }
112
 
113
 
114
- def model_action(obs, model, tokenizer) -> ActionType:
115
- """Get action from trained model."""
116
- from training.dataset import format_observation_prompt, SYSTEM_PROMPT
117
- from training.reward_functions import extract_action
118
-
119
- prompt = format_observation_prompt(obs)
120
- messages = [
121
- {"role": "system", "content": SYSTEM_PROMPT},
122
- {"role": "user", "content": prompt},
123
- ]
124
-
125
- text = tokenizer.apply_chat_template(
126
- messages,
127
- tokenize=False,
128
- add_generation_prompt=True,
129
- )
130
-
131
- inputs = tokenizer(text, return_tensors="pt").to(model.device)
132
- outputs = model.generate(
133
- **inputs,
134
- max_new_tokens=32,
135
- temperature=0.7,
136
- do_sample=True,
137
- )
138
- response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
139
 
140
- action_type = extract_action(response)
141
- return action_type if action_type is not None else ActionType.SLEEP
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
 
143
 
144
  def main():
145
- parser = argparse.ArgumentParser(description="Evaluate RhythmEnv agent")
146
  parser.add_argument("--model_path", type=str, default=None,
147
  help="Path to trained model (skip for baseline only)")
148
  parser.add_argument("--num_episodes", type=int, default=5,
149
- help="Episodes per profile per strategy")
150
- parser.add_argument("--output_file", type=str, default="eval_results.json",
151
- help="Output file for results")
 
 
 
152
  args = parser.parse_args()
153
 
 
 
 
 
 
 
 
 
 
154
  model, tokenizer = None, None
155
  strategies = ["heuristic", "random"]
156
 
@@ -169,54 +247,54 @@ def main():
169
  print(f"Warning: Could not load model: {e}")
170
  print("Running baseline-only evaluation.")
171
 
172
- # ---------------------------------------------------------------
173
- # Run evaluations
174
- # ---------------------------------------------------------------
175
  all_results = []
176
 
177
- for strategy in strategies:
178
- print(f"\n{'=' * 50}")
179
- print(f"Strategy: {strategy.upper()}")
180
- print(f"{'=' * 50}")
181
-
182
- for profile in PROFILES:
183
- scores = []
184
- for ep in range(args.num_episodes):
185
- seed = ep
186
- result = run_episode(
187
- profile=profile,
188
- seed=seed,
189
- strategy=strategy,
190
- model=model,
191
- tokenizer=tokenizer,
192
- )
193
- all_results.append(result)
194
- scores.append(result["final_score"])
195
-
196
- avg_score = sum(scores) / len(scores)
197
- print(f" {profile:25s} avg_score={avg_score:.3f} scores={[round(s,3) for s in scores]}")
198
-
199
- # ---------------------------------------------------------------
200
- # Summary table
201
- # ---------------------------------------------------------------
 
 
 
202
  print(f"\n{'=' * 70}")
203
- print("SUMMARY")
204
  print(f"{'=' * 70}")
205
- print(f"{'Profile':<25} {'Heuristic':>10} {'Random':>10}", end="")
206
- if "model" in strategies:
207
- print(f" {'Trained':>10}", end="")
208
  print()
209
  print("-" * 70)
210
-
211
- for profile in PROFILES:
212
- row = f"{profile:<25}"
213
- for strategy in strategies:
214
- profile_results = [r for r in all_results if r["profile"] == profile and r["strategy"] == strategy]
215
- avg = sum(r["final_score"] for r in profile_results) / len(profile_results)
216
- row += f" {avg:>10.3f}"
217
  print(row)
218
 
219
- # Save results
220
  with open(args.output_file, "w") as f:
221
  json.dump(all_results, f, indent=2)
222
  print(f"\nResults saved to: {args.output_file}")
 
1
  """
2
+ RhythmEnv Inference Evaluation — Baseline vs Trained, with meta-RL eval suite.
3
 
4
+ Three evaluation conditions:
5
+ 1. discrete-3-profiles: Original 3 hardcoded profiles for legacy comparison
6
+ (compare against eval_results_v1.json from the single-task version)
7
+ 2. continuous-in-distribution: Sampled profiles from the training distribution
8
+ (was the agent able to learn the meta-policy?)
9
+ 3. continuous-OOD: Profiles from a held-out region of the parameter space
10
+ (does the meta-policy generalize, or did the agent memorize?)
11
 
12
  Usage:
13
+ # Baselines only (no trained model):
14
  python training/inference_eval.py
15
 
16
  # With trained model:
17
  python training/inference_eval.py --model_path outputs/rhythmenv_trained
 
 
 
18
  """
19
 
20
  import argparse
21
  import json
22
  import os
23
+ import random
24
  import sys
25
  from typing import Optional
26
 
27
  sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
28
 
29
  from models import ActionType, RhythmAction
30
+ from server.rhythm_environment import RhythmEnvironment, MAX_STEPS, sample_profile, profile_to_belief_vector
31
 
32
+ DISCRETE_PROFILES = ["introvert_morning", "extrovert_night_owl", "workaholic_stoic"]
33
  SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
34
  DAY_NAMES = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
35
 
36
+ # Seed ranges: training distribution = [0, 200); OOD = [10000, 10030)
37
+ # (10000 offset makes seeded sampled profiles in OOD region statistically distinct)
38
+ IN_DIST_SEEDS_DEFAULT = list(range(100, 110)) # 10 unseen-by-training in-distribution
39
+ OOD_SEEDS_DEFAULT = list(range(10000, 10010)) # 10 OOD seeds
40
+
41
 
42
  def heuristic_action(obs) -> ActionType:
43
+ """Priority-based heuristic baseline (profile-blind)."""
44
  slot = obs.slot
45
  v, c, p, s, cn = obs.vitality, obs.cognition, obs.progress, obs.serenity, obs.connection
46
 
 
68
 
69
 
70
  def random_action(rng) -> ActionType:
 
 
71
  return rng.choice(list(ActionType))
72
 
73
 
74
+ def model_action(obs, model, tokenizer, return_belief: bool = False):
75
+ """Get action (and optionally belief) from trained model."""
76
+ from training.dataset import format_observation_prompt, SYSTEM_PROMPT
77
+ from training.reward_functions import extract_action_and_belief
78
+
79
+ prompt = format_observation_prompt(obs)
80
+ messages = [
81
+ {"role": "system", "content": SYSTEM_PROMPT},
82
+ {"role": "user", "content": prompt},
83
+ ]
84
+
85
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
86
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
87
+ outputs = model.generate(**inputs, max_new_tokens=20, temperature=0.7, do_sample=True)
88
+ response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
89
+
90
+ action_type, belief, _ = extract_action_and_belief(response)
91
+ if action_type is None:
92
+ action_type = ActionType.SLEEP
93
+ return (action_type, belief) if return_belief else action_type
94
+
95
+
96
+ def run_episode(
97
+ seed: int,
98
+ strategy: str,
99
+ profile_mode: str = "continuous",
100
+ profile: Optional[str] = None,
101
+ model=None,
102
+ tokenizer=None,
103
+ ) -> dict:
104
+ """Run a single episode and return per-episode metrics."""
105
  rng = random.Random(seed + 500)
106
 
107
  env = RhythmEnvironment()
108
+ if profile is not None:
109
+ obs = env.reset(seed=seed, profile=profile)
110
+ else:
111
+ obs = env.reset(seed=seed, profile_mode=profile_mode)
112
+
113
+ true_belief = env.get_belief_target()
114
+ profile_name = env.state.profile_name
115
 
116
  total_reward = 0.0
117
+ step_rewards = []
118
  actions_taken = []
119
+ beliefs_seen = [] # for trained model
120
 
121
  for step in range(MAX_STEPS):
122
  if obs.done:
 
127
  elif strategy == "random":
128
  action_type = random_action(rng)
129
  elif strategy == "model" and model is not None:
130
+ action_type, belief = model_action(obs, model, tokenizer, return_belief=True)
131
+ beliefs_seen.append(belief)
132
  else:
133
  action_type = random_action(rng)
134
 
 
136
  actions_taken.append(action_type.value)
137
  obs = env.step(action)
138
  total_reward += obs.reward
139
+ step_rewards.append(obs.reward)
140
 
141
  final_score = obs.reward_breakdown.get("final_score", 0.0)
142
 
143
+ # Adaptation: late-half mean minus early-half mean
144
+ half = max(len(step_rewards) // 2, 1)
145
+ early = step_rewards[:half]
146
+ late = step_rewards[half:]
147
+ adaptation = (sum(late) / len(late) - sum(early) / len(early)) if (early and late) else 0.0
148
+
149
+ # Belief tracking (only for trained model)
150
+ final_belief = beliefs_seen[-1] if beliefs_seen else None
151
+ belief_mae = None
152
+ if final_belief is not None:
153
+ belief_mae = sum(abs(b - t) for b, t in zip(final_belief, true_belief)) / 3.0
154
+
155
  return {
156
+ "profile_name": profile_name,
157
+ "profile_mode": profile_mode if profile is None else "discrete",
158
  "strategy": strategy,
159
  "seed": seed,
160
  "final_score": round(final_score, 4),
161
  "total_reward": round(total_reward, 2),
162
+ "adaptation": round(adaptation, 3),
163
  "vitality": round(obs.vitality, 2),
164
  "cognition": round(obs.cognition, 2),
165
  "progress": round(obs.progress, 2),
166
  "serenity": round(obs.serenity, 2),
167
  "connection": round(obs.connection, 2),
168
  "actions": actions_taken,
169
+ "true_belief": [round(x, 3) for x in true_belief],
170
+ "final_belief": [round(x, 3) for x in final_belief] if final_belief is not None else None,
171
+ "belief_mae": round(belief_mae, 3) if belief_mae is not None else None,
172
  }
173
 
174
 
175
+ def eval_condition(
176
+ name: str,
177
+ strategies: list[str],
178
+ runs: list[dict],
179
+ model=None,
180
+ tokenizer=None,
181
+ ) -> list[dict]:
182
+ """Run an eval condition and print summary."""
183
+ print(f"\n{'=' * 60}")
184
+ print(f"Condition: {name}")
185
+ print(f"{'=' * 60}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
 
187
+ results = []
188
+ for strategy in strategies:
189
+ print(f"\n Strategy: {strategy.upper()}")
190
+ scores = []
191
+ adaptations = []
192
+ belief_maes = []
193
+ for run in runs:
194
+ r = run_episode(strategy=strategy, model=model, tokenizer=tokenizer, **run)
195
+ results.append({"condition": name, **r})
196
+ scores.append(r["final_score"])
197
+ adaptations.append(r["adaptation"])
198
+ if r["belief_mae"] is not None:
199
+ belief_maes.append(r["belief_mae"])
200
+ avg_score = sum(scores) / len(scores) if scores else 0.0
201
+ avg_adapt = sum(adaptations) / len(adaptations) if adaptations else 0.0
202
+ avg_mae = sum(belief_maes) / len(belief_maes) if belief_maes else None
203
+ line = f" avg_score={avg_score:.3f} avg_adaptation={avg_adapt:+.3f}"
204
+ if avg_mae is not None:
205
+ line += f" avg_belief_mae={avg_mae:.3f}"
206
+ print(line)
207
+ return results
208
 
209
 
210
  def main():
211
+ parser = argparse.ArgumentParser(description="Evaluate RhythmEnv agent (meta-RL eval suite)")
212
  parser.add_argument("--model_path", type=str, default=None,
213
  help="Path to trained model (skip for baseline only)")
214
  parser.add_argument("--num_episodes", type=int, default=5,
215
+ help="Episodes per condition per strategy (for discrete: per-profile)")
216
+ parser.add_argument("--output_file", type=str, default="eval_results.json")
217
+ parser.add_argument("--in_dist_seeds", type=str, default=None,
218
+ help="Comma-separated seeds for in-distribution eval")
219
+ parser.add_argument("--ood_seeds", type=str, default=None,
220
+ help="Comma-separated seeds for OOD eval")
221
  args = parser.parse_args()
222
 
223
+ in_dist_seeds = (
224
+ [int(s) for s in args.in_dist_seeds.split(",")] if args.in_dist_seeds
225
+ else IN_DIST_SEEDS_DEFAULT[:args.num_episodes * 2]
226
+ )
227
+ ood_seeds = (
228
+ [int(s) for s in args.ood_seeds.split(",")] if args.ood_seeds
229
+ else OOD_SEEDS_DEFAULT[:args.num_episodes * 2]
230
+ )
231
+
232
  model, tokenizer = None, None
233
  strategies = ["heuristic", "random"]
234
 
 
247
  print(f"Warning: Could not load model: {e}")
248
  print("Running baseline-only evaluation.")
249
 
 
 
 
250
  all_results = []
251
 
252
+ # Condition 1: Discrete 3 hardcoded profiles (legacy comparison)
253
+ discrete_runs = [
254
+ {"seed": ep, "profile": p, "profile_mode": "discrete"}
255
+ for p in DISCRETE_PROFILES for ep in range(args.num_episodes)
256
+ ]
257
+ all_results += eval_condition(
258
+ "discrete-3-profiles (legacy)",
259
+ strategies, discrete_runs,
260
+ model=model, tokenizer=tokenizer,
261
+ )
262
+
263
+ # Condition 2: In-distribution sampled profiles
264
+ in_dist_runs = [{"seed": s, "profile_mode": "continuous"} for s in in_dist_seeds]
265
+ all_results += eval_condition(
266
+ "continuous-in-distribution",
267
+ strategies, in_dist_runs,
268
+ model=model, tokenizer=tokenizer,
269
+ )
270
+
271
+ # Condition 3: OOD sampled profiles (the meta-learning generalization test)
272
+ ood_runs = [{"seed": s, "profile_mode": "continuous"} for s in ood_seeds]
273
+ all_results += eval_condition(
274
+ "continuous-OOD (generalization)",
275
+ strategies, ood_runs,
276
+ model=model, tokenizer=tokenizer,
277
+ )
278
+
279
+ # Per-profile breakdown for discrete (preserves legacy table format)
280
  print(f"\n{'=' * 70}")
281
+ print("DISCRETE-3-PROFILE BREAKDOWN")
282
  print(f"{'=' * 70}")
283
+ print(f"{'Profile':<25} ", end="")
284
+ for s in strategies:
285
+ print(f"{s:>10}", end="")
286
  print()
287
  print("-" * 70)
288
+ discrete = [r for r in all_results if r["condition"] == "discrete-3-profiles (legacy)"]
289
+ for profile in DISCRETE_PROFILES:
290
+ row = f"{profile:<25} "
291
+ for s in strategies:
292
+ rs = [r for r in discrete if r["profile_name"] == profile and r["strategy"] == s]
293
+ avg = sum(r["final_score"] for r in rs) / len(rs) if rs else 0.0
294
+ row += f"{avg:>10.3f}"
295
  print(row)
296
 
297
+ # Save
298
  with open(args.output_file, "w") as f:
299
  json.dump(all_results, f, indent=2)
300
  print(f"\nResults saved to: {args.output_file}")
training/reward_functions.py CHANGED
@@ -1,17 +1,29 @@
1
  """
2
- Reward functions for RhythmEnv GRPO training.
3
 
4
- Three-layer reward stack:
5
- 1. format_valid is the LLM output a parseable action name?
6
- 2. action_legal — is it one of the 10 valid actions?
7
- 3. env_reward — actual reward from stepping the environment
 
 
 
 
 
 
 
 
 
 
 
8
 
9
  Each function returns a list of floats (one per completion).
10
  """
11
 
12
- import sys
13
  import os
14
  import re
 
15
 
16
  sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
17
 
@@ -21,39 +33,107 @@ from server.rhythm_environment import RhythmEnvironment
21
  # All valid action names (uppercase for matching)
22
  VALID_ACTIONS = {at.value.upper(): at for at in ActionType}
23
 
 
 
24
 
25
- def extract_action(text: str) -> ActionType | None:
26
- """Parse LLM output into an ActionType. Returns None if unparseable."""
 
 
 
 
 
 
 
 
27
  if not text:
28
- return None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- # Clean up: take first line, strip whitespace, uppercase
31
- text = text.strip().split("\n")[0].strip().upper().replace(" ", "_")
32
 
33
- # Direct match
34
- if text in VALID_ACTIONS:
35
- return VALID_ACTIONS[text]
36
 
37
- # Partial match (e.g., "I choose DEEP_WORK" → DEEP_WORK)
38
- for name, at in VALID_ACTIONS.items():
39
- if name in text:
40
- return at
41
 
42
- return None
 
 
 
43
 
44
 
45
  def format_valid(completions, **kwargs) -> list[float]:
46
  """
47
- Layer 1: Does the LLM output contain a parseable action?
48
 
49
- +1.0 if valid action found, -2.0 if not.
50
- This teaches the model to output clean action names.
 
51
  """
52
  scores = []
53
  for completion in completions:
54
  response = completion[0]["content"] if isinstance(completion, list) else completion
55
- action = extract_action(response)
56
- scores.append(1.0 if action is not None else -2.0)
 
 
 
 
 
57
  return scores
58
 
59
 
@@ -61,9 +141,7 @@ def action_legal(completions, **kwargs) -> list[float]:
61
  """
62
  Layer 2: Is the parsed action one of the 10 valid actions?
63
 
64
- For our env, all 10 actions are always legal (no state-dependent validity).
65
- This is simpler than the workday scheduler where START_TASK required idle state.
66
-
67
  +0.5 if legal, -1.0 if not parseable.
68
  """
69
  scores = []
@@ -74,24 +152,32 @@ def action_legal(completions, **kwargs) -> list[float]:
74
  return scores
75
 
76
 
 
 
 
 
 
 
 
 
 
77
  def env_reward(
78
  completions,
79
  prompts=None,
80
  seed=None,
81
  step_index=None,
82
  action_history=None,
 
83
  **kwargs,
84
  ) -> list[float]:
85
  """
86
  Layer 3: Actual environment reward from stepping with the chosen action.
87
 
88
  Replays the episode to the correct state, then steps with the model's action.
89
- Returns the raw env reward (scaled).
90
-
91
- This is the core training signal — the model learns which actions
92
- produce higher rewards for different states and hidden profiles.
93
 
94
  seed/step_index/action_history are passed as dataset columns by GRPOTrainer.
 
95
  """
96
  scores = []
97
 
@@ -103,48 +189,87 @@ def env_reward(
103
  scores.append(-3.0)
104
  continue
105
 
106
- # Prefer direct dataset column kwargs (passed by GRPOTrainer)
107
  if seed is not None and i < len(seed):
108
  ep_seed = seed[i]
109
  ep_history = action_history[i] if action_history is not None else []
 
110
  else:
111
- # Legacy: try to read from prompt dict (fallback for old-style prompts)
112
  prompt_data = prompts[i] if prompts and i < len(prompts) else None
113
  if prompt_data and isinstance(prompt_data, dict) and "seed" in prompt_data:
114
  ep_seed = prompt_data["seed"]
115
  ep_history = prompt_data.get("action_history", [])
 
116
  else:
117
- ep_seed = i % 50 # vary seed so training sees diverse states
118
  ep_history = []
 
119
 
120
  try:
121
- # Replay to the correct state
122
- env = RhythmEnvironment()
123
- env.reset(seed=ep_seed)
 
 
124
 
125
- for past_action_name in ep_history:
126
- past_action = RhythmAction(action_type=ActionType(past_action_name))
127
- env.step(past_action)
128
 
129
- # Step with the model's chosen action
130
- action = RhythmAction(action_type=action_type)
131
- obs = env.step(action)
132
 
133
- # Return the env reward (already in reasonable range from REWARD_SCALE)
134
- scores.append(obs.reward)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
  except Exception:
136
- scores.append(-3.0)
137
 
138
  return scores
139
 
140
 
141
  def env_reward_simple(completions, prompts=None, **kwargs) -> list[float]:
142
  """
143
- State-aware heuristic reward (no env replay).
144
-
145
- Parses meter values and time-of-day from the prompt text,
146
- then scores whether the chosen action addresses the most pressing need.
147
- Faster than env_reward but less precise.
148
  """
149
  scores = []
150
  for i, completion in enumerate(completions):
@@ -155,7 +280,6 @@ def env_reward_simple(completions, prompts=None, **kwargs) -> list[float]:
155
  scores.append(-2.0)
156
  continue
157
 
158
- # Parse prompt for state info
159
  prompt_text = ""
160
  if prompts and i < len(prompts):
161
  p = prompts[i]
@@ -166,24 +290,20 @@ def env_reward_simple(completions, prompts=None, **kwargs) -> list[float]:
166
  else:
167
  prompt_text = str(p)
168
 
169
- # Extract meter values with regex
170
  meters = {}
171
  for meter_name in ["Vitality", "Cognition", "Progress", "Serenity", "Connection"]:
172
  match = re.search(rf"{meter_name}:\s*([\d.]+)", prompt_text)
173
  if match:
174
  meters[meter_name.lower()] = float(match.group(1))
175
 
176
- # Extract time slot
177
  is_morning = "Morning" in prompt_text
178
  is_night = "Night" in prompt_text
179
 
180
  score = 0.0
181
-
182
  v = meters.get("vitality", 0.5)
183
  s = meters.get("serenity", 0.5)
184
  cn = meters.get("connection", 0.5)
185
 
186
- # Reward actions that address critically low meters
187
  if v < 0.2 and action_type in (ActionType.SLEEP, ActionType.EXERCISE):
188
  score += 0.5
189
  elif v < 0.2:
@@ -199,13 +319,11 @@ def env_reward_simple(completions, prompts=None, **kwargs) -> list[float]:
199
  elif cn < 0.2:
200
  score -= 0.1
201
 
202
- # Time-appropriate bonuses
203
  if is_morning and action_type in (ActionType.DEEP_WORK, ActionType.LEARN):
204
  score += 0.2
205
  if is_night and action_type == ActionType.SLEEP:
206
  score += 0.3
207
 
208
- # Penalize binge
209
  if action_type == ActionType.BINGE_WATCH:
210
  score -= 0.3
211
 
 
1
  """
2
+ Reward functions for RhythmEnv GRPO training (meta-RL version).
3
 
4
+ Four-layer reward stack:
5
+ 1. format_valid does the LLM output have a parseable action + belief format?
6
+ 2. action_legal — is the action one of the 10 valid actions?
7
+ 3. env_reward — actual environment reward (seed-replay) for the chosen action
8
+ 4. belief_accuracy — how close is the belief vector to the hidden profile's true vector?
9
+
10
+ Action output format: "ACTION_NAME S M W"
11
+ - ACTION_NAME: one of 10 valid actions
12
+ - S, M, W: single digits 0-9 representing the agent's belief about the user
13
+ S = social preference (0=hates social, 9=loves social)
14
+ M = morning preference (0=night owl, 9=morning person)
15
+ W = work preference (0=avoids work, 9=workaholic)
16
+
17
+ Example: "DEEP_WORK 3 8 7"
18
+ → action=DEEP_WORK, belief=[0.33, 0.89, 0.78]
19
 
20
  Each function returns a list of floats (one per completion).
21
  """
22
 
23
+ import math
24
  import os
25
  import re
26
+ import sys
27
 
28
  sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
29
 
 
33
  # All valid action names (uppercase for matching)
34
  VALID_ACTIONS = {at.value.upper(): at for at in ActionType}
35
 
36
+ # Default belief returned when the LLM doesn't provide one (neutral)
37
+ DEFAULT_BELIEF = [0.5, 0.5, 0.5]
38
 
39
+
40
+ def extract_action_and_belief(text: str) -> tuple[ActionType | None, list[float], bool]:
41
+ """Parse 'ACTION_NAME S M W' format.
42
+
43
+ Returns:
44
+ (action, belief, belief_provided):
45
+ action: parsed ActionType or None
46
+ belief: 3-dim vector in [0, 1], DEFAULT_BELIEF if not parseable
47
+ belief_provided: True iff at least one belief digit was parsed from output
48
+ """
49
  if not text:
50
+ return None, list(DEFAULT_BELIEF), False
51
+
52
+ line = text.strip().split("\n")[0].strip()
53
+ if not line:
54
+ return None, list(DEFAULT_BELIEF), False
55
+
56
+ # Normalize: uppercase, comma → space
57
+ parts = line.upper().replace(",", " ").split()
58
+ if not parts:
59
+ return None, list(DEFAULT_BELIEF), False
60
+
61
+ # Find action and its index in parts (try first token, then any token)
62
+ action: ActionType | None = None
63
+ action_idx = -1
64
+ for idx, p in enumerate(parts):
65
+ if p in VALID_ACTIONS:
66
+ action = VALID_ACTIONS[p]
67
+ action_idx = idx
68
+ break
69
+ if action is None:
70
+ for idx, p in enumerate(parts):
71
+ for name, at in VALID_ACTIONS.items():
72
+ if name in p:
73
+ action = at
74
+ action_idx = idx
75
+ break
76
+ if action is not None:
77
+ break
78
+
79
+ # Parse next 3 tokens AFTER the action as belief digits/floats
80
+ belief: list[float] = []
81
+ belief_provided = False
82
+ if action_idx >= 0:
83
+ for i in range(3):
84
+ j = action_idx + 1 + i
85
+ if j < len(parts):
86
+ p = parts[j].strip().rstrip(".")
87
+ if not p:
88
+ belief.append(0.5)
89
+ continue
90
+ try:
91
+ if len(p) == 1 and p.isdigit():
92
+ belief.append(int(p) / 9.0)
93
+ belief_provided = True
94
+ else:
95
+ val = float(p)
96
+ if val > 1.0:
97
+ val = val / 9.0 # interpret as 0-9 scale
98
+ belief.append(max(0.0, min(1.0, val)))
99
+ belief_provided = True
100
+ except (ValueError, IndexError):
101
+ belief.append(0.5)
102
+ else:
103
+ belief.append(0.5)
104
+ else:
105
+ belief = list(DEFAULT_BELIEF)
106
 
107
+ if not belief:
108
+ belief = list(DEFAULT_BELIEF)
109
 
110
+ return action, belief, belief_provided
 
 
111
 
 
 
 
 
112
 
113
+ def extract_action(text: str) -> ActionType | None:
114
+ """Backward-compatible action-only extractor (used by env_reward replay)."""
115
+ action, _, _ = extract_action_and_belief(text)
116
+ return action
117
 
118
 
119
  def format_valid(completions, **kwargs) -> list[float]:
120
  """
121
+ Layer 1: Is the output a parseable action + (optionally) belief digits?
122
 
123
+ +1.0 if action + 3 belief digits, +0.5 if action only, -2.0 if no action.
124
+ Graduated reward pushes the model toward emitting belief without harshly
125
+ punishing action-only output during early training.
126
  """
127
  scores = []
128
  for completion in completions:
129
  response = completion[0]["content"] if isinstance(completion, list) else completion
130
+ action, _, belief_provided = extract_action_and_belief(response)
131
+ if action is None:
132
+ scores.append(-2.0)
133
+ elif belief_provided:
134
+ scores.append(1.0)
135
+ else:
136
+ scores.append(0.5)
137
  return scores
138
 
139
 
 
141
  """
142
  Layer 2: Is the parsed action one of the 10 valid actions?
143
 
144
+ All 10 actions are always legal in this env (no state-dependent validity).
 
 
145
  +0.5 if legal, -1.0 if not parseable.
146
  """
147
  scores = []
 
152
  return scores
153
 
154
 
155
+ def _replay_env(seed: int, history: list, profile_mode: str = "continuous") -> RhythmEnvironment:
156
+ """Build a replayed env at the given seed/history/mode."""
157
+ env = RhythmEnvironment()
158
+ env.reset(seed=seed, profile_mode=profile_mode)
159
+ for past_action_name in history:
160
+ env.step(RhythmAction(action_type=ActionType(past_action_name)))
161
+ return env
162
+
163
+
164
  def env_reward(
165
  completions,
166
  prompts=None,
167
  seed=None,
168
  step_index=None,
169
  action_history=None,
170
+ profile_mode=None,
171
  **kwargs,
172
  ) -> list[float]:
173
  """
174
  Layer 3: Actual environment reward from stepping with the chosen action.
175
 
176
  Replays the episode to the correct state, then steps with the model's action.
177
+ Returns the raw env reward (already scaled by REWARD_SCALE).
 
 
 
178
 
179
  seed/step_index/action_history are passed as dataset columns by GRPOTrainer.
180
+ profile_mode column controls whether replay uses continuous or discrete profile.
181
  """
182
  scores = []
183
 
 
189
  scores.append(-3.0)
190
  continue
191
 
192
+ # Per-row dataset values (preferred path)
193
  if seed is not None and i < len(seed):
194
  ep_seed = seed[i]
195
  ep_history = action_history[i] if action_history is not None else []
196
+ ep_mode = profile_mode[i] if (profile_mode is not None and i < len(profile_mode)) else "continuous"
197
  else:
 
198
  prompt_data = prompts[i] if prompts and i < len(prompts) else None
199
  if prompt_data and isinstance(prompt_data, dict) and "seed" in prompt_data:
200
  ep_seed = prompt_data["seed"]
201
  ep_history = prompt_data.get("action_history", [])
202
+ ep_mode = prompt_data.get("profile_mode", "continuous")
203
  else:
204
+ ep_seed = i % 50
205
  ep_history = []
206
+ ep_mode = "continuous"
207
 
208
  try:
209
+ env = _replay_env(ep_seed, ep_history, ep_mode)
210
+ obs = env.step(RhythmAction(action_type=action_type))
211
+ scores.append(obs.reward)
212
+ except Exception:
213
+ scores.append(-3.0)
214
 
215
+ return scores
 
 
216
 
 
 
 
217
 
218
+ def belief_accuracy(
219
+ completions,
220
+ prompts=None,
221
+ seed=None,
222
+ step_index=None,
223
+ action_history=None,
224
+ profile_mode=None,
225
+ **kwargs,
226
+ ) -> list[float]:
227
+ """
228
+ Layer 4: Belief-vector accuracy reward (META-LEARNING signal).
229
+
230
+ Compares the agent's emitted [social, morning, work] belief vector to the
231
+ hidden profile's true belief vector. Reward in [-0.5, +0.5]:
232
+ perfect match → +0.5
233
+ neutral [0.5,0.5,0.5] → 0.0 (zero-effort baseline)
234
+ max wrong → -0.5
235
+
236
+ Mean-absolute-error based (cleaner than cosine for [0,1] vectors).
237
+ Skipped (returns 0) if no seed available — keeps reward conservative.
238
+ """
239
+ scores = []
240
+ for i, completion in enumerate(completions):
241
+ response = completion[0]["content"] if isinstance(completion, list) else completion
242
+ _, belief, belief_provided = extract_action_and_belief(response)
243
+
244
+ if not belief_provided:
245
+ scores.append(-0.2) # weak push toward emitting belief
246
+ continue
247
+
248
+ # Resolve seed/mode for replay
249
+ if seed is not None and i < len(seed):
250
+ ep_seed = seed[i]
251
+ ep_history = action_history[i] if action_history is not None else []
252
+ ep_mode = profile_mode[i] if (profile_mode is not None and i < len(profile_mode)) else "continuous"
253
+ else:
254
+ scores.append(0.0)
255
+ continue
256
+
257
+ try:
258
+ env = _replay_env(ep_seed, ep_history, ep_mode)
259
+ true_belief = env.get_belief_target()
260
+ mae = sum(abs(b - t) for b, t in zip(belief, true_belief)) / 3.0
261
+ similarity = 1.0 - mae # in [0, 1]
262
+ scores.append(similarity - 0.5) # in [-0.5, +0.5]
263
  except Exception:
264
+ scores.append(0.0)
265
 
266
  return scores
267
 
268
 
269
  def env_reward_simple(completions, prompts=None, **kwargs) -> list[float]:
270
  """
271
+ State-aware heuristic reward (no env replay), used for fast smoke-training.
272
+ Identical to original simple reward — the format change is handled by extract_action.
 
 
 
273
  """
274
  scores = []
275
  for i, completion in enumerate(completions):
 
280
  scores.append(-2.0)
281
  continue
282
 
 
283
  prompt_text = ""
284
  if prompts and i < len(prompts):
285
  p = prompts[i]
 
290
  else:
291
  prompt_text = str(p)
292
 
 
293
  meters = {}
294
  for meter_name in ["Vitality", "Cognition", "Progress", "Serenity", "Connection"]:
295
  match = re.search(rf"{meter_name}:\s*([\d.]+)", prompt_text)
296
  if match:
297
  meters[meter_name.lower()] = float(match.group(1))
298
 
 
299
  is_morning = "Morning" in prompt_text
300
  is_night = "Night" in prompt_text
301
 
302
  score = 0.0
 
303
  v = meters.get("vitality", 0.5)
304
  s = meters.get("serenity", 0.5)
305
  cn = meters.get("connection", 0.5)
306
 
 
307
  if v < 0.2 and action_type in (ActionType.SLEEP, ActionType.EXERCISE):
308
  score += 0.5
309
  elif v < 0.2:
 
319
  elif cn < 0.2:
320
  score -= 0.1
321
 
 
322
  if is_morning and action_type in (ActionType.DEEP_WORK, ActionType.LEARN):
323
  score += 0.2
324
  if is_night and action_type == ActionType.SLEEP:
325
  score += 0.3
326
 
 
327
  if action_type == ActionType.BINGE_WATCH:
328
  score -= 0.3
329
 
training/train.py CHANGED
@@ -1,15 +1,20 @@
1
  """
2
- RhythmEnv GRPO Training Script
3
 
4
- Trains an LLM agent to balance life meters using Group Relative Policy
5
- Optimization (GRPO) with Unsloth for efficient fine-tuning.
6
 
7
- Usage (Colab with GPU):
 
 
 
 
 
8
  !pip install unsloth transformers trl datasets
9
- !python training/train.py
10
 
11
- Usage (local):
12
- python training/train.py --max_steps 100 --model_name "unsloth/Qwen2.5-3B-Instruct"
13
  """
14
 
15
  import argparse
@@ -21,32 +26,37 @@ sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
21
 
22
 
23
  def main():
24
- parser = argparse.ArgumentParser(description="Train RhythmEnv agent with GRPO")
25
- parser.add_argument("--model_name", type=str, default="unsloth/Qwen2.5-3B-Instruct",
26
- help="Model to fine-tune (HF model ID)")
27
- parser.add_argument("--max_steps", type=int, default=500,
28
- help="Number of GRPO training steps")
29
- parser.add_argument("--num_episodes", type=int, default=200,
30
- help="Number of episodes for dataset generation")
31
- parser.add_argument("--max_samples", type=int, default=2000,
32
  help="Maximum training samples")
33
  parser.add_argument("--num_generations", type=int, default=4,
34
- help="Number of completions per prompt for GRPO")
35
- parser.add_argument("--learning_rate", type=float, default=5e-5,
36
- help="Learning rate")
37
- parser.add_argument("--output_dir", type=str, default="outputs/rhythmenv_trained",
38
- help="Output directory for model and logs")
 
 
 
 
 
 
 
39
  parser.add_argument("--use_simple_reward", action="store_true",
40
- help="Use simplified reward (no env replay, faster)")
41
- parser.add_argument("--report_to", type=str, default="none",
42
- help="Logging: 'wandb', 'trackio', or 'none'")
43
  args = parser.parse_args()
44
 
45
  # ---------------------------------------------------------------
46
  # 1. Generate dataset
47
  # ---------------------------------------------------------------
48
  print("=" * 60)
49
- print("Step 1: Generating training dataset...")
50
  print("=" * 60)
51
 
52
  from dataset import generate_dataset
@@ -56,15 +66,18 @@ def main():
56
  num_episodes=args.num_episodes,
57
  strategy="mixed",
58
  max_samples=args.max_samples,
 
 
59
  )
60
 
61
- # Include replay metadata so env_reward can reconstruct the exact state
62
  dataset = Dataset.from_list([
63
  {
64
  "prompt": sample["prompt"],
65
  "seed": sample["seed"],
66
  "step_index": sample["step_index"],
67
  "action_history": sample["action_history"],
 
68
  }
69
  for sample in raw_samples
70
  ])
@@ -74,13 +87,12 @@ def main():
74
  # 2. Load model with Unsloth
75
  # ---------------------------------------------------------------
76
  print("\n" + "=" * 60)
77
- print("Step 2: Loading model...")
78
  print("=" * 60)
79
 
80
  from unsloth import FastLanguageModel
81
 
82
- max_seq_length = 768
83
- lora_rank = 4
84
 
85
  model, tokenizer = FastLanguageModel.from_pretrained(
86
  model_name=args.model_name,
@@ -90,51 +102,53 @@ def main():
90
 
91
  model = FastLanguageModel.get_peft_model(
92
  model,
93
- r=lora_rank,
94
  target_modules=[
95
  "q_proj", "k_proj", "v_proj", "o_proj",
96
  "gate_proj", "up_proj", "down_proj",
97
  ],
98
- lora_alpha=lora_rank * 2,
99
  use_gradient_checkpointing="unsloth",
100
  random_state=3407,
101
  )
102
 
103
- print(f"Model loaded: {args.model_name}")
104
 
105
  # ---------------------------------------------------------------
106
- # 3. Set up reward functions
107
  # ---------------------------------------------------------------
108
  print("\n" + "=" * 60)
109
- print("Step 3: Setting up reward functions...")
110
  print("=" * 60)
111
 
112
- from reward_functions import format_valid, action_legal, env_reward, env_reward_simple
 
 
113
 
114
  if args.use_simple_reward:
115
- reward_funcs = [format_valid, action_legal, env_reward_simple]
116
- print("Using: format_valid + action_legal + env_reward_simple (fast mode)")
117
  else:
118
- reward_funcs = [format_valid, action_legal, env_reward]
119
- print("Using: format_valid + action_legal + env_reward (full replay)")
120
 
121
  # ---------------------------------------------------------------
122
- # 4. Configure GRPO trainer
123
  # ---------------------------------------------------------------
124
  print("\n" + "=" * 60)
125
- print("Step 4: Configuring GRPO trainer...")
126
  print("=" * 60)
127
 
128
  from trl import GRPOConfig, GRPOTrainer
129
 
130
- max_prompt_length = 400
131
- max_completion_length = 16 # Action names are 3-15 chars; cap prevents verbose drift
132
 
133
  training_args = GRPOConfig(
134
  temperature=1.0,
135
  learning_rate=args.learning_rate,
136
- beta=0.1, # KL penalty — higher = more conservative, prevents policy drift
137
- max_grad_norm=0.5, # Gradient clipping prevents large destabilizing updates
138
  weight_decay=0.001,
139
  warmup_ratio=0.1,
140
  lr_scheduler_type="linear",
@@ -146,20 +160,21 @@ def main():
146
  max_prompt_length=max_prompt_length,
147
  max_completion_length=max_completion_length,
148
  max_steps=args.max_steps,
149
- save_steps=100,
150
  report_to=args.report_to,
151
  output_dir=args.output_dir,
152
  )
153
 
154
- print(f"Max steps: {args.max_steps}")
155
- print(f"Num generations: {args.num_generations}")
156
- print(f"Learning rate: {args.learning_rate}")
 
157
 
158
  # ---------------------------------------------------------------
159
  # 5. Train
160
  # ---------------------------------------------------------------
161
  print("\n" + "=" * 60)
162
- print("Step 5: Starting GRPO training...")
163
  print("=" * 60)
164
 
165
  trainer = GRPOTrainer(
@@ -173,31 +188,26 @@ def main():
173
  trainer.train()
174
 
175
  # ---------------------------------------------------------------
176
- # 6. Save model
177
  # ---------------------------------------------------------------
178
  print("\n" + "=" * 60)
179
- print("Step 6: Saving model...")
180
  print("=" * 60)
181
 
182
- # Save merged model (correct way — don't upcast 4-bit naively)
183
  model.save_pretrained_merged(
184
  args.output_dir,
185
  tokenizer,
186
  save_method="merged_16bit",
187
  )
188
- print(f"Model saved to: {args.output_dir}")
189
 
190
- # Also save training config for reference
191
  config_path = os.path.join(args.output_dir, "training_config.json")
192
  with open(config_path, "w") as f:
193
  json.dump(vars(args), f, indent=2)
194
- print(f"Training config saved to: {config_path}")
195
 
196
- print("\n" + "=" * 60)
197
- print("Training complete!")
198
- print(f"Model: {args.output_dir}")
199
- print("Next: run inference_eval.py to compare baseline vs trained")
200
- print("=" * 60)
201
 
202
 
203
  if __name__ == "__main__":
 
1
  """
2
+ RhythmEnv GRPO Training Script (Meta-RL version).
3
 
4
+ Trains an LLM agent to BOTH (a) balance life meters AND (b) infer the hidden
5
+ personality of the person it's helping. Four-layer reward stack:
6
 
7
+ format_valid — output parseable as ACTION + 3 belief digits
8
+ action_legal — action is one of 10 valid types
9
+ env_reward — actual env reward for the chosen action (seed replay)
10
+ belief_accuracy — how close the belief vector is to the hidden profile
11
+
12
+ Usage (Colab T4):
13
  !pip install unsloth transformers trl datasets
14
+ !python training/train.py --max_steps 1500
15
 
16
+ Usage (local — for setup checks; needs CUDA for actual training):
17
+ python training/train.py --max_steps 50 --use_simple_reward
18
  """
19
 
20
  import argparse
 
26
 
27
 
28
  def main():
29
+ parser = argparse.ArgumentParser(description="Train RhythmEnv agent with GRPO (meta-RL)")
30
+ parser.add_argument("--model_name", type=str, default="unsloth/Qwen2.5-3B-Instruct")
31
+ parser.add_argument("--max_steps", type=int, default=1500,
32
+ help="Number of GRPO training steps (1500 recommended for meta-RL)")
33
+ parser.add_argument("--num_episodes", type=int, default=300,
34
+ help="Number of episodes for dataset generation (more diversity = better meta-RL)")
35
+ parser.add_argument("--max_samples", type=int, default=3000,
 
36
  help="Maximum training samples")
37
  parser.add_argument("--num_generations", type=int, default=4,
38
+ help="Completions per prompt for GRPO (higher = lower variance, more compute)")
39
+ parser.add_argument("--learning_rate", type=float, default=5e-5)
40
+ parser.add_argument("--beta", type=float, default=0.1,
41
+ help="KL penalty (raise to 0.2 if training is unstable)")
42
+ parser.add_argument("--lora_rank", type=int, default=8,
43
+ help="LoRA rank (8 = more capacity than original 4 for meta-RL)")
44
+ parser.add_argument("--hint_fraction", type=float, default=0.15,
45
+ help="Fraction of dataset with profile hint visible (curriculum warmup)")
46
+ parser.add_argument("--profile_mode", type=str, default="continuous",
47
+ choices=["continuous", "discrete"],
48
+ help="continuous = sampled per-episode (meta-RL); discrete = 3 hardcoded profiles")
49
+ parser.add_argument("--output_dir", type=str, default="outputs/rhythmenv_meta_trained")
50
  parser.add_argument("--use_simple_reward", action="store_true",
51
+ help="Use heuristic reward instead of env-replay (smoke testing)")
52
+ parser.add_argument("--report_to", type=str, default="none")
 
53
  args = parser.parse_args()
54
 
55
  # ---------------------------------------------------------------
56
  # 1. Generate dataset
57
  # ---------------------------------------------------------------
58
  print("=" * 60)
59
+ print("Step 1: Generating training dataset (continuous profiles)")
60
  print("=" * 60)
61
 
62
  from dataset import generate_dataset
 
66
  num_episodes=args.num_episodes,
67
  strategy="mixed",
68
  max_samples=args.max_samples,
69
+ profile_mode=args.profile_mode,
70
+ hint_fraction=args.hint_fraction,
71
  )
72
 
73
+ # Replay metadata so env_reward + belief_accuracy can reconstruct state
74
  dataset = Dataset.from_list([
75
  {
76
  "prompt": sample["prompt"],
77
  "seed": sample["seed"],
78
  "step_index": sample["step_index"],
79
  "action_history": sample["action_history"],
80
+ "profile_mode": sample["profile_mode"],
81
  }
82
  for sample in raw_samples
83
  ])
 
87
  # 2. Load model with Unsloth
88
  # ---------------------------------------------------------------
89
  print("\n" + "=" * 60)
90
+ print(f"Step 2: Loading model {args.model_name}")
91
  print("=" * 60)
92
 
93
  from unsloth import FastLanguageModel
94
 
95
+ max_seq_length = 1024 # bumped from 768 to fit longer prompts with history
 
96
 
97
  model, tokenizer = FastLanguageModel.from_pretrained(
98
  model_name=args.model_name,
 
102
 
103
  model = FastLanguageModel.get_peft_model(
104
  model,
105
+ r=args.lora_rank,
106
  target_modules=[
107
  "q_proj", "k_proj", "v_proj", "o_proj",
108
  "gate_proj", "up_proj", "down_proj",
109
  ],
110
+ lora_alpha=args.lora_rank * 2,
111
  use_gradient_checkpointing="unsloth",
112
  random_state=3407,
113
  )
114
 
115
+ print(f"LoRA rank: {args.lora_rank}, alpha: {args.lora_rank * 2}")
116
 
117
  # ---------------------------------------------------------------
118
+ # 3. Reward functions (4-layer stack including belief_accuracy)
119
  # ---------------------------------------------------------------
120
  print("\n" + "=" * 60)
121
+ print("Step 3: Setting up reward functions")
122
  print("=" * 60)
123
 
124
+ from reward_functions import (
125
+ format_valid, action_legal, env_reward, env_reward_simple, belief_accuracy
126
+ )
127
 
128
  if args.use_simple_reward:
129
+ reward_funcs = [format_valid, action_legal, env_reward_simple, belief_accuracy]
130
+ print("Using: format_valid + action_legal + env_reward_simple + belief_accuracy")
131
  else:
132
+ reward_funcs = [format_valid, action_legal, env_reward, belief_accuracy]
133
+ print("Using: format_valid + action_legal + env_reward + belief_accuracy")
134
 
135
  # ---------------------------------------------------------------
136
+ # 4. GRPO trainer config
137
  # ---------------------------------------------------------------
138
  print("\n" + "=" * 60)
139
+ print("Step 4: Configuring GRPO trainer")
140
  print("=" * 60)
141
 
142
  from trl import GRPOConfig, GRPOTrainer
143
 
144
+ max_prompt_length = 600 # bumped from 400 for longer prompts (history + hint)
145
+ max_completion_length = 20 # bumped from 16 for "ACTION_NAME D D D" format
146
 
147
  training_args = GRPOConfig(
148
  temperature=1.0,
149
  learning_rate=args.learning_rate,
150
+ beta=args.beta,
151
+ max_grad_norm=0.5,
152
  weight_decay=0.001,
153
  warmup_ratio=0.1,
154
  lr_scheduler_type="linear",
 
160
  max_prompt_length=max_prompt_length,
161
  max_completion_length=max_completion_length,
162
  max_steps=args.max_steps,
163
+ save_steps=250, # checkpoint every 250 (was 100)
164
  report_to=args.report_to,
165
  output_dir=args.output_dir,
166
  )
167
 
168
+ print(f"max_steps={args.max_steps}, num_generations={args.num_generations}, "
169
+ f"lr={args.learning_rate}, beta={args.beta}")
170
+ print(f"max_prompt_length={max_prompt_length}, max_completion_length={max_completion_length}")
171
+ print(f"hint_fraction={args.hint_fraction} (curriculum warmup)")
172
 
173
  # ---------------------------------------------------------------
174
  # 5. Train
175
  # ---------------------------------------------------------------
176
  print("\n" + "=" * 60)
177
+ print("Step 5: Starting GRPO training")
178
  print("=" * 60)
179
 
180
  trainer = GRPOTrainer(
 
188
  trainer.train()
189
 
190
  # ---------------------------------------------------------------
191
+ # 6. Save merged model
192
  # ---------------------------------------------------------------
193
  print("\n" + "=" * 60)
194
+ print("Step 6: Saving model")
195
  print("=" * 60)
196
 
 
197
  model.save_pretrained_merged(
198
  args.output_dir,
199
  tokenizer,
200
  save_method="merged_16bit",
201
  )
 
202
 
 
203
  config_path = os.path.join(args.output_dir, "training_config.json")
204
  with open(config_path, "w") as f:
205
  json.dump(vars(args), f, indent=2)
 
206
 
207
+ print(f"Model saved to: {args.output_dir}")
208
+ print(f"Training config saved to: {config_path}")
209
+ print("\nNext: run inference_eval.py to compare baseline vs trained")
210
+ print(" python training/inference_eval.py --model_path " + args.output_dir)
 
211
 
212
 
213
  if __name__ == "__main__":