Spaces:
Sleeping
Sleeping
| # RhythmEnv β Entity Definitions | |
| ## Episode Structure | |
| ``` | |
| 1 episode = 1 week (7 days) | |
| 1 step = 1 time slot (Morning / Afternoon / Evening / Night) | |
| 4 slots = 1 day | |
| 28 steps = 1 full week | |
| ``` | |
| --- | |
| ## Observable State | |
| What the agent sees in every observation. No hidden information here. | |
| | Variable | Type | Range | Description | | |
| |---|---|---|---| | |
| | `timestep` | int | 0β27 | Current step (0 = Monday Morning) | | |
| | `day` | int | 0β6 | Day of week (0 = Monday, 6 = Sunday) | | |
| | `slot` | int | 0β3 | Time of day (0=Morning, 1=Afternoon, 2=Evening, 3=Night) | | |
| | `vitality` | float | 0β1 | Physical energy and sleep quality | | |
| | `cognition` | float | 0β1 | Mental clarity and focus | | |
| | `progress` | float | 0β1 | Career and skill advancement made this week | | |
| | `serenity` | float | 0β1 | Inner peace and stress management | | |
| | `connection` | float | 0β1 | Relationship health | | |
| | `active_event` | str\|null | β | Random event this step (null if none) | | |
| | `remaining_steps` | int | 0β28 | Steps left in episode | | |
| | `reward` | float | β | Reward received this step | | |
| | `done` | bool | β | True on the final step | | |
| | `reward_breakdown` | dict | β | Per-meter deltas; `final_score` when done | | |
| --- | |
| ## Actions | |
| 10 actions, always legal regardless of state. | |
| | Action | Category | Primary Effect | | |
| |---|---|---| | |
| | `DEEP_WORK` | Productivity | +Progress (large), βVitality, βCognition | | |
| | `ADMIN_WORK` | Productivity | +Progress (small), light drain | | |
| | `LEARN` | Productivity | +Progress, slight +Serenity | | |
| | `SLEEP` | Recovery | +Vitality (large), +Cognition | | |
| | `EXERCISE` | Recovery | +Vitality, +Serenity | | |
| | `MEDITATE` | Recovery | +Serenity (large), +Cognition | | |
| | `FAMILY_TIME` | Social | +Connection (large), +Serenity | | |
| | `SOCIALIZE` | Social | +Connection | | |
| | `ME_TIME` | Leisure | +Serenity, +Vitality (small) | | |
| | `BINGE_WATCH` | Leisure | +Serenity (small), βCognition | | |
| ### Action Effect Matrix | |
| Base deltas per action on each meter, **before** any profile modifiers or time-of-day multipliers. | |
| | Action | Vitality | Cognition | Progress | Serenity | Connection | | |
| |---|---|---|---|---|---| | |
| | deep_work | β0.12 | β0.10 | +0.18 | β0.05 | 0.00 | | |
| | admin_work | β0.06 | β0.05 | +0.08 | β0.03 | 0.00 | | |
| | learn | β0.08 | β0.08 | +0.12 | +0.02 | 0.00 | | |
| | sleep | +0.20 | +0.10 | 0.00 | +0.05 | 0.00 | | |
| | exercise | +0.12 | +0.05 | 0.00 | +0.08 | 0.00 | | |
| | meditate | +0.03 | +0.08 | 0.00 | +0.15 | 0.00 | | |
| | family_time | β0.04 | β0.02 | 0.00 | +0.06 | +0.15 | | |
| | socialize | β0.06 | β0.03 | 0.00 | +0.04 | +0.12 | | |
| | me_time | +0.05 | +0.03 | 0.00 | +0.10 | β0.02 | | |
| | binge_watch | +0.02 | β0.05 | β0.02 | +0.06 | β0.03 | | |
| --- | |
| ## Hidden Personality Profiles | |
| The person's identity. **Hidden from the agent.** Controls both reward weights and how | |
| actions affect meters. Agent must infer the active profile from reward patterns across episodes. | |
| ### Profile 1 β `introvert_morning` | |
| **Reward weights:** Serenity 60%, Progress 20%, Connection 10%, Vitality 5%, Cognition 5% | |
| Hidden modifiers: | |
| - Social vitality drain Γ3.0 β socialising is exhausting, not neutral | |
| - Morning (slot 0): cognition and progress gains Γ2.0 β peak productivity window | |
| - Solo time (me_time, meditate): serenity +0.10 bonus β recharges alone | |
| - Binge watch triggers shame spiral: serenity β0.15, cognition β0.06 | |
| - Connection passive decay: β0.01/step | |
| **Agent discovers:** Mornings are sacred; social activities are costly; alone time heals. | |
| --- | |
| ### Profile 2 β `extrovert_night_owl` | |
| **Reward weights:** Connection 75%, Progress 10%, Vitality 5%, Cognition 5%, Serenity 5% | |
| Hidden modifiers: | |
| - Social vitality drain Γ0.2 β socialising energises, barely drains | |
| - Morning (slot 0): cognition and progress gains Γ0.4 penalty β groggy zone | |
| - Evening/Night (slots 2β3): cognition and progress gains Γ1.8 β peak zone | |
| - Social actions: connection Γ2.0 (double connection gain) | |
| - Social actions: serenity +0.06 bonus β people lift mood | |
| - Connection passive decay: β0.01/step | |
| **Agent discovers:** Avoid cognitive work in the morning; socialise to charge up; deep work in evening. | |
| --- | |
| ### Profile 3 β `workaholic_stoic` | |
| **Reward weights:** Progress 70%, Serenity 10%, Connection 10%, Vitality 5%, Cognition 5% | |
| Hidden modifiers: | |
| - Productive work (deep_work, learn, admin_work): vitality +0.06 recovery β energised by output | |
| - Productive work: serenity +0.10 bonus β meaning comes from progress | |
| - Idle actions (me_time, binge_watch, sleep when optional): serenity β0.10 β idle guilt | |
| - Extra vitality passive decay: β0.04/step β burnout risk | |
| - Random event negative impact Γ0.5 β stoic resilience | |
| - Connection passive decay: β0.02/step β faster relational drift | |
| **Agent discovers:** Keep working; rest only when vitality is critical; neglect at cost of connection. | |
| --- | |
| ## Time-of-Day Multipliers | |
| Applied to all non-sleep actions based on current slot. | |
| | Slot | Cognition Gain Multiplier | Vitality Drain Multiplier | | |
| |---|---|---| | |
| | 0 β Morning | Γ1.2 | Γ0.8 | | |
| | 1 β Afternoon | Γ1.0 | Γ1.0 | | |
| | 2 β Evening | Γ0.8 | Γ1.1 | | |
| | 3 β Night | Γ0.6 | Γ1.3 | | |
| These are global. Profile-specific time bonuses (HV1) layer on top. | |
| --- | |
| ## Passive Decays (every step, before action effects) | |
| | Profile | Meter | Decay | | |
| |---|---|---| | |
| | All | Connection | β0.01/step | | |
| | workaholic_stoic | Connection | β0.02/step (replaces above) | | |
| | workaholic_stoic | Vitality | β0.04/step | | |
| --- | |
| ## Random Events | |
| Roll probability: 8% per step. | |
| | Event | Vitality | Cognition | Progress | Serenity | Connection | | |
| |---|---|---|---|---|---| | |
| | prod_crash | β0.08 | β0.10 | β0.10 | β0.15 | 0.00 | | |
| | family_emergency | β0.05 | β0.08 | 0.00 | β0.12 | β0.10 | | |
| | illness | β0.20 | β0.10 | 0.00 | β0.05 | 0.00 | | |
| | good_news | +0.05 | +0.03 | 0.00 | +0.10 | +0.05 | | |
| Negative effects are reduced by `event_impact_multiplier` per profile | |
| (workaholic_stoic = 0.5; others = 1.0 or 0.8). | |
| --- | |
| ## Reward Computation | |
| ### Per-step reward | |
| ``` | |
| reward = sum(meter_delta Γ profile_weight for each meter) Γ 15.0 | |
| ``` | |
| Profile reward weights are **hidden**. Same action, different profile β very different reward. | |
| Example β DEEP_WORK, step 1, same initial state: | |
| ``` | |
| workaholic_stoic: +1.57 (progress weight = 70%) | |
| introvert_morning: +0.32 (serenity weight = 60%; deep work slightly drains serenity) | |
| extrovert_night_owl: β0.39 (connection weight = 75%; deep work gives 0 connection) | |
| ``` | |
| ### Modifiers applied during step (in order) | |
| 1. Roll and apply random event (if any) | |
| 2. Get base action effects (ACTION_EFFECTS matrix) | |
| 3. Apply repetition dampening (same action 3Γ in a row β 25% / 50% / 75% effect reduction) | |
| 4. Apply time-of-day multipliers (cognition gain, vitality drain) | |
| 5. Apply profile-specific modifiers (HV1/HV2/HV3) | |
| 6. Apply global vitality factor (`0.5 + 0.5 Γ vitality`) β low vitality reduces positive effects | |
| 7. Apply passive decays (connection, workaholic vitality) | |
| 8. Clamp all meters to [0.0, 1.0] | |
| 9. Compute reward as weighted sum of deltas Γ REWARD_SCALE (15.0) | |
| 10. Apply critical floor penalty: any meter < 0.10 β β0.30 | |
| ### Final grade (returned in `reward_breakdown["final_score"]` when `done=True`) | |
| Score in [0.0, 1.0]: | |
| ``` | |
| score = 0.15 Γ crash_free_ratio (1 β crash_count / total_possible_crashes) | |
| + 0.20 Γ progress (final progress meter value) | |
| + 0.10 Γ connection (final connection meter value) | |
| + 0.25 Γ adaptation_score (late-half mean per-step reward minus | |
| early-half mean β gated by absolute | |
| late-half quality so a "terrible-then- | |
| mediocre" exploit cannot win) | |
| + 0.10 Γ efficiency_score (avg step reward normalised to [0, 1]) | |
| + 0.20 Γ belief_accuracy (1 β MAE between agent's last-emitted | |
| belief vector and the true profile | |
| vector; 0 if the agent never emitted a | |
| belief β heuristic / random baselines) | |
| ``` | |
| Two meta-RL signals: `adaptation_score` is implicit (rewards getting better | |
| over time, since per-step rewards are profile-weighted), and `belief_accuracy` | |
| is explicit (rewards INFERRING the profile correctly). Without the explicit | |
| term, agents that play heuristic-style "keep meters healthy" score the same | |
| as agents that actually do inference, since the other components don't | |
| differentiate inference from reflex. | |
| To emit a belief, the agent calls `env.record_belief([s, m, w])` once per | |
| step (typically right after parsing its own completion). The grader uses the | |
| LAST recorded belief. | |
| --- | |
| ## Internal Tracking Variables | |
| Not in the observation. Used by the environment to compute rewards and grade. | |
| | Variable | Description | | |
| |---|---| | |
| | `_profile` | Active profile dict (hidden from agent) | | |
| | `_rng` | Seeded random instance for event rolls and profile selection | | |
| | `_crash_count` | Steps where any meter fell below 0.10 | | |
| | `_total_reward` | Running sum of step rewards for efficiency score | | |
| | `_step_history` | Rolling window of completed steps (action, reward, deltas, anomalies). Used both as the agent-visible history and to compute repetition dampening. | | |
| | `_step_rewards` | Per-step reward list for adaptation_score in the grader | | |
| | `_timestep` | Current step index (0β27) | | |