# RhythmEnv — Entity Definitions ## Episode Structure ``` 1 episode = 1 week (7 days) 1 step = 1 time slot (Morning / Afternoon / Evening / Night) 4 slots = 1 day 28 steps = 1 full week ``` --- ## Observable State What the agent sees in every observation. No hidden information here. | Variable | Type | Range | Description | |---|---|---|---| | `timestep` | int | 0–27 | Current step (0 = Monday Morning) | | `day` | int | 0–6 | Day of week (0 = Monday, 6 = Sunday) | | `slot` | int | 0–3 | Time of day (0=Morning, 1=Afternoon, 2=Evening, 3=Night) | | `vitality` | float | 0–1 | Physical energy and sleep quality | | `cognition` | float | 0–1 | Mental clarity and focus | | `progress` | float | 0–1 | Career and skill advancement made this week | | `serenity` | float | 0–1 | Inner peace and stress management | | `connection` | float | 0–1 | Relationship health | | `active_event` | str\|null | — | Random event this step (null if none) | | `remaining_steps` | int | 0–28 | Steps left in episode | | `reward` | float | — | Reward received this step | | `done` | bool | — | True on the final step | | `reward_breakdown` | dict | — | Per-meter deltas; `final_score` when done | --- ## Actions 10 actions, always legal regardless of state. | Action | Category | Primary Effect | |---|---|---| | `DEEP_WORK` | Productivity | +Progress (large), −Vitality, −Cognition | | `ADMIN_WORK` | Productivity | +Progress (small), light drain | | `LEARN` | Productivity | +Progress, slight +Serenity | | `SLEEP` | Recovery | +Vitality (large), +Cognition | | `EXERCISE` | Recovery | +Vitality, +Serenity | | `MEDITATE` | Recovery | +Serenity (large), +Cognition | | `FAMILY_TIME` | Social | +Connection (large), +Serenity | | `SOCIALIZE` | Social | +Connection | | `ME_TIME` | Leisure | +Serenity, +Vitality (small) | | `BINGE_WATCH` | Leisure | +Serenity (small), −Cognition | ### Action Effect Matrix Base deltas per action on each meter, **before** any profile modifiers or time-of-day multipliers. | Action | Vitality | Cognition | Progress | Serenity | Connection | |---|---|---|---|---|---| | deep_work | −0.12 | −0.10 | +0.18 | −0.05 | 0.00 | | admin_work | −0.06 | −0.05 | +0.08 | −0.03 | 0.00 | | learn | −0.08 | −0.08 | +0.12 | +0.02 | 0.00 | | sleep | +0.20 | +0.10 | 0.00 | +0.05 | 0.00 | | exercise | +0.12 | +0.05 | 0.00 | +0.08 | 0.00 | | meditate | +0.03 | +0.08 | 0.00 | +0.15 | 0.00 | | family_time | −0.04 | −0.02 | 0.00 | +0.06 | +0.15 | | socialize | −0.06 | −0.03 | 0.00 | +0.04 | +0.12 | | me_time | +0.05 | +0.03 | 0.00 | +0.10 | −0.02 | | binge_watch | +0.02 | −0.05 | −0.02 | +0.06 | −0.03 | --- ## Hidden Personality Profiles The person's identity. **Hidden from the agent.** Controls both reward weights and how actions affect meters. Agent must infer the active profile from reward patterns across episodes. ### Profile 1 — `introvert_morning` **Reward weights:** Serenity 60%, Progress 20%, Connection 10%, Vitality 5%, Cognition 5% Hidden modifiers: - Social vitality drain ×3.0 — socialising is exhausting, not neutral - Morning (slot 0): cognition and progress gains ×2.0 — peak productivity window - Solo time (me_time, meditate): serenity +0.10 bonus — recharges alone - Binge watch triggers shame spiral: serenity −0.15, cognition −0.06 - Connection passive decay: −0.01/step **Agent discovers:** Mornings are sacred; social activities are costly; alone time heals. --- ### Profile 2 — `extrovert_night_owl` **Reward weights:** Connection 75%, Progress 10%, Vitality 5%, Cognition 5%, Serenity 5% Hidden modifiers: - Social vitality drain ×0.2 — socialising energises, barely drains - Morning (slot 0): cognition and progress gains ×0.4 penalty — groggy zone - Evening/Night (slots 2–3): cognition and progress gains ×1.8 — peak zone - Social actions: connection ×2.0 (double connection gain) - Social actions: serenity +0.06 bonus — people lift mood - Connection passive decay: −0.01/step **Agent discovers:** Avoid cognitive work in the morning; socialise to charge up; deep work in evening. --- ### Profile 3 — `workaholic_stoic` **Reward weights:** Progress 70%, Serenity 10%, Connection 10%, Vitality 5%, Cognition 5% Hidden modifiers: - Productive work (deep_work, learn, admin_work): vitality +0.06 recovery — energised by output - Productive work: serenity +0.10 bonus — meaning comes from progress - Idle actions (me_time, binge_watch, sleep when optional): serenity −0.10 — idle guilt - Extra vitality passive decay: −0.04/step — burnout risk - Random event negative impact ×0.5 — stoic resilience - Connection passive decay: −0.02/step — faster relational drift **Agent discovers:** Keep working; rest only when vitality is critical; neglect at cost of connection. --- ## Time-of-Day Multipliers Applied to all non-sleep actions based on current slot. | Slot | Cognition Gain Multiplier | Vitality Drain Multiplier | |---|---|---| | 0 — Morning | ×1.2 | ×0.8 | | 1 — Afternoon | ×1.0 | ×1.0 | | 2 — Evening | ×0.8 | ×1.1 | | 3 — Night | ×0.6 | ×1.3 | These are global. Profile-specific time bonuses (HV1) layer on top. --- ## Passive Decays (every step, before action effects) | Profile | Meter | Decay | |---|---|---| | All | Connection | −0.01/step | | workaholic_stoic | Connection | −0.02/step (replaces above) | | workaholic_stoic | Vitality | −0.04/step | --- ## Random Events Roll probability: 8% per step. | Event | Vitality | Cognition | Progress | Serenity | Connection | |---|---|---|---|---|---| | prod_crash | −0.08 | −0.10 | −0.10 | −0.15 | 0.00 | | family_emergency | −0.05 | −0.08 | 0.00 | −0.12 | −0.10 | | illness | −0.20 | −0.10 | 0.00 | −0.05 | 0.00 | | good_news | +0.05 | +0.03 | 0.00 | +0.10 | +0.05 | Negative effects are reduced by `event_impact_multiplier` per profile (workaholic_stoic = 0.5; others = 1.0 or 0.8). --- ## Reward Computation ### Per-step reward ``` reward = sum(meter_delta × profile_weight for each meter) × 15.0 ``` Profile reward weights are **hidden**. Same action, different profile → very different reward. Example — DEEP_WORK, step 1, same initial state: ``` workaholic_stoic: +1.57 (progress weight = 70%) introvert_morning: +0.32 (serenity weight = 60%; deep work slightly drains serenity) extrovert_night_owl: −0.39 (connection weight = 75%; deep work gives 0 connection) ``` ### Modifiers applied during step (in order) 1. Roll and apply random event (if any) 2. Get base action effects (ACTION_EFFECTS matrix) 3. Apply repetition dampening (same action 3× in a row → 25% / 50% / 75% effect reduction) 4. Apply time-of-day multipliers (cognition gain, vitality drain) 5. Apply profile-specific modifiers (HV1/HV2/HV3) 6. Apply global vitality factor (`0.5 + 0.5 × vitality`) — low vitality reduces positive effects 7. Apply passive decays (connection, workaholic vitality) 8. Clamp all meters to [0.0, 1.0] 9. Compute reward as weighted sum of deltas × REWARD_SCALE (15.0) 10. Apply critical floor penalty: any meter < 0.10 → −0.30 ### Final grade (returned in `reward_breakdown["final_score"]` when `done=True`) Score in [0.0, 1.0]: ``` score = 0.15 × crash_free_ratio (1 − crash_count / total_possible_crashes) + 0.20 × progress (final progress meter value) + 0.10 × connection (final connection meter value) + 0.25 × adaptation_score (late-half mean per-step reward minus early-half mean — gated by absolute late-half quality so a "terrible-then- mediocre" exploit cannot win) + 0.10 × efficiency_score (avg step reward normalised to [0, 1]) + 0.20 × belief_accuracy (1 − MAE between agent's last-emitted belief vector and the true profile vector; 0 if the agent never emitted a belief — heuristic / random baselines) ``` Two meta-RL signals: `adaptation_score` is implicit (rewards getting better over time, since per-step rewards are profile-weighted), and `belief_accuracy` is explicit (rewards INFERRING the profile correctly). Without the explicit term, agents that play heuristic-style "keep meters healthy" score the same as agents that actually do inference, since the other components don't differentiate inference from reflex. To emit a belief, the agent calls `env.record_belief([s, m, w])` once per step (typically right after parsing its own completion). The grader uses the LAST recorded belief. --- ## Internal Tracking Variables Not in the observation. Used by the environment to compute rewards and grade. | Variable | Description | |---|---| | `_profile` | Active profile dict (hidden from agent) | | `_rng` | Seeded random instance for event rolls and profile selection | | `_crash_count` | Steps where any meter fell below 0.10 | | `_total_reward` | Running sum of step rewards for efficiency score | | `_step_history` | Rolling window of completed steps (action, reward, deltas, anomalies). Used both as the agent-visible history and to compute repetition dampening. | | `_step_rewards` | Per-step reward list for adaptation_score in the grader | | `_timestep` | Current step index (0–27) |