Spaces:
Sleeping
Sleeping
File size: 9,459 Bytes
cc6473a ece0bbe cc6473a ece0bbe cc6473a ece0bbe cc6473a ece0bbe cc6473a ece0bbe cc6473a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 | # RhythmEnv β Entity Definitions
## Episode Structure
```
1 episode = 1 week (7 days)
1 step = 1 time slot (Morning / Afternoon / Evening / Night)
4 slots = 1 day
28 steps = 1 full week
```
---
## Observable State
What the agent sees in every observation. No hidden information here.
| Variable | Type | Range | Description |
|---|---|---|---|
| `timestep` | int | 0β27 | Current step (0 = Monday Morning) |
| `day` | int | 0β6 | Day of week (0 = Monday, 6 = Sunday) |
| `slot` | int | 0β3 | Time of day (0=Morning, 1=Afternoon, 2=Evening, 3=Night) |
| `vitality` | float | 0β1 | Physical energy and sleep quality |
| `cognition` | float | 0β1 | Mental clarity and focus |
| `progress` | float | 0β1 | Career and skill advancement made this week |
| `serenity` | float | 0β1 | Inner peace and stress management |
| `connection` | float | 0β1 | Relationship health |
| `active_event` | str\|null | β | Random event this step (null if none) |
| `remaining_steps` | int | 0β28 | Steps left in episode |
| `reward` | float | β | Reward received this step |
| `done` | bool | β | True on the final step |
| `reward_breakdown` | dict | β | Per-meter deltas; `final_score` when done |
---
## Actions
10 actions, always legal regardless of state.
| Action | Category | Primary Effect |
|---|---|---|
| `DEEP_WORK` | Productivity | +Progress (large), βVitality, βCognition |
| `ADMIN_WORK` | Productivity | +Progress (small), light drain |
| `LEARN` | Productivity | +Progress, slight +Serenity |
| `SLEEP` | Recovery | +Vitality (large), +Cognition |
| `EXERCISE` | Recovery | +Vitality, +Serenity |
| `MEDITATE` | Recovery | +Serenity (large), +Cognition |
| `FAMILY_TIME` | Social | +Connection (large), +Serenity |
| `SOCIALIZE` | Social | +Connection |
| `ME_TIME` | Leisure | +Serenity, +Vitality (small) |
| `BINGE_WATCH` | Leisure | +Serenity (small), βCognition |
### Action Effect Matrix
Base deltas per action on each meter, **before** any profile modifiers or time-of-day multipliers.
| Action | Vitality | Cognition | Progress | Serenity | Connection |
|---|---|---|---|---|---|
| deep_work | β0.12 | β0.10 | +0.18 | β0.05 | 0.00 |
| admin_work | β0.06 | β0.05 | +0.08 | β0.03 | 0.00 |
| learn | β0.08 | β0.08 | +0.12 | +0.02 | 0.00 |
| sleep | +0.20 | +0.10 | 0.00 | +0.05 | 0.00 |
| exercise | +0.12 | +0.05 | 0.00 | +0.08 | 0.00 |
| meditate | +0.03 | +0.08 | 0.00 | +0.15 | 0.00 |
| family_time | β0.04 | β0.02 | 0.00 | +0.06 | +0.15 |
| socialize | β0.06 | β0.03 | 0.00 | +0.04 | +0.12 |
| me_time | +0.05 | +0.03 | 0.00 | +0.10 | β0.02 |
| binge_watch | +0.02 | β0.05 | β0.02 | +0.06 | β0.03 |
---
## Hidden Personality Profiles
The person's identity. **Hidden from the agent.** Controls both reward weights and how
actions affect meters. Agent must infer the active profile from reward patterns across episodes.
### Profile 1 β `introvert_morning`
**Reward weights:** Serenity 60%, Progress 20%, Connection 10%, Vitality 5%, Cognition 5%
Hidden modifiers:
- Social vitality drain Γ3.0 β socialising is exhausting, not neutral
- Morning (slot 0): cognition and progress gains Γ2.0 β peak productivity window
- Solo time (me_time, meditate): serenity +0.10 bonus β recharges alone
- Binge watch triggers shame spiral: serenity β0.15, cognition β0.06
- Connection passive decay: β0.01/step
**Agent discovers:** Mornings are sacred; social activities are costly; alone time heals.
---
### Profile 2 β `extrovert_night_owl`
**Reward weights:** Connection 75%, Progress 10%, Vitality 5%, Cognition 5%, Serenity 5%
Hidden modifiers:
- Social vitality drain Γ0.2 β socialising energises, barely drains
- Morning (slot 0): cognition and progress gains Γ0.4 penalty β groggy zone
- Evening/Night (slots 2β3): cognition and progress gains Γ1.8 β peak zone
- Social actions: connection Γ2.0 (double connection gain)
- Social actions: serenity +0.06 bonus β people lift mood
- Connection passive decay: β0.01/step
**Agent discovers:** Avoid cognitive work in the morning; socialise to charge up; deep work in evening.
---
### Profile 3 β `workaholic_stoic`
**Reward weights:** Progress 70%, Serenity 10%, Connection 10%, Vitality 5%, Cognition 5%
Hidden modifiers:
- Productive work (deep_work, learn, admin_work): vitality +0.06 recovery β energised by output
- Productive work: serenity +0.10 bonus β meaning comes from progress
- Idle actions (me_time, binge_watch, sleep when optional): serenity β0.10 β idle guilt
- Extra vitality passive decay: β0.04/step β burnout risk
- Random event negative impact Γ0.5 β stoic resilience
- Connection passive decay: β0.02/step β faster relational drift
**Agent discovers:** Keep working; rest only when vitality is critical; neglect at cost of connection.
---
## Time-of-Day Multipliers
Applied to all non-sleep actions based on current slot.
| Slot | Cognition Gain Multiplier | Vitality Drain Multiplier |
|---|---|---|
| 0 β Morning | Γ1.2 | Γ0.8 |
| 1 β Afternoon | Γ1.0 | Γ1.0 |
| 2 β Evening | Γ0.8 | Γ1.1 |
| 3 β Night | Γ0.6 | Γ1.3 |
These are global. Profile-specific time bonuses (HV1) layer on top.
---
## Passive Decays (every step, before action effects)
| Profile | Meter | Decay |
|---|---|---|
| All | Connection | β0.01/step |
| workaholic_stoic | Connection | β0.02/step (replaces above) |
| workaholic_stoic | Vitality | β0.04/step |
---
## Random Events
Roll probability: 8% per step.
| Event | Vitality | Cognition | Progress | Serenity | Connection |
|---|---|---|---|---|---|
| prod_crash | β0.08 | β0.10 | β0.10 | β0.15 | 0.00 |
| family_emergency | β0.05 | β0.08 | 0.00 | β0.12 | β0.10 |
| illness | β0.20 | β0.10 | 0.00 | β0.05 | 0.00 |
| good_news | +0.05 | +0.03 | 0.00 | +0.10 | +0.05 |
Negative effects are reduced by `event_impact_multiplier` per profile
(workaholic_stoic = 0.5; others = 1.0 or 0.8).
---
## Reward Computation
### Per-step reward
```
reward = sum(meter_delta Γ profile_weight for each meter) Γ 15.0
```
Profile reward weights are **hidden**. Same action, different profile β very different reward.
Example β DEEP_WORK, step 1, same initial state:
```
workaholic_stoic: +1.57 (progress weight = 70%)
introvert_morning: +0.32 (serenity weight = 60%; deep work slightly drains serenity)
extrovert_night_owl: β0.39 (connection weight = 75%; deep work gives 0 connection)
```
### Modifiers applied during step (in order)
1. Roll and apply random event (if any)
2. Get base action effects (ACTION_EFFECTS matrix)
3. Apply repetition dampening (same action 3Γ in a row β 25% / 50% / 75% effect reduction)
4. Apply time-of-day multipliers (cognition gain, vitality drain)
5. Apply profile-specific modifiers (HV1/HV2/HV3)
6. Apply global vitality factor (`0.5 + 0.5 Γ vitality`) β low vitality reduces positive effects
7. Apply passive decays (connection, workaholic vitality)
8. Clamp all meters to [0.0, 1.0]
9. Compute reward as weighted sum of deltas Γ REWARD_SCALE (15.0)
10. Apply critical floor penalty: any meter < 0.10 β β0.30
### Final grade (returned in `reward_breakdown["final_score"]` when `done=True`)
Score in [0.0, 1.0]:
```
score = 0.15 Γ crash_free_ratio (1 β crash_count / total_possible_crashes)
+ 0.20 Γ progress (final progress meter value)
+ 0.10 Γ connection (final connection meter value)
+ 0.25 Γ adaptation_score (late-half mean per-step reward minus
early-half mean β gated by absolute
late-half quality so a "terrible-then-
mediocre" exploit cannot win)
+ 0.10 Γ efficiency_score (avg step reward normalised to [0, 1])
+ 0.20 Γ belief_accuracy (1 β MAE between agent's last-emitted
belief vector and the true profile
vector; 0 if the agent never emitted a
belief β heuristic / random baselines)
```
Two meta-RL signals: `adaptation_score` is implicit (rewards getting better
over time, since per-step rewards are profile-weighted), and `belief_accuracy`
is explicit (rewards INFERRING the profile correctly). Without the explicit
term, agents that play heuristic-style "keep meters healthy" score the same
as agents that actually do inference, since the other components don't
differentiate inference from reflex.
To emit a belief, the agent calls `env.record_belief([s, m, w])` once per
step (typically right after parsing its own completion). The grader uses the
LAST recorded belief.
---
## Internal Tracking Variables
Not in the observation. Used by the environment to compute rewards and grade.
| Variable | Description |
|---|---|
| `_profile` | Active profile dict (hidden from agent) |
| `_rng` | Seeded random instance for event rolls and profile selection |
| `_crash_count` | Steps where any meter fell below 0.10 |
| `_total_reward` | Running sum of step rewards for efficiency score |
| `_step_history` | Rolling window of completed steps (action, reward, deltas, anomalies). Used both as the agent-visible history and to compute repetition dampening. |
| `_step_rewards` | Per-step reward list for adaptation_score in the grader |
| `_timestep` | Current step index (0β27) |
|