Spaces:

InosLihka
/

rhythm_env

Sleeping

File size: 9,459 Bytes

# RhythmEnv — Entity Definitions

## Episode Structure

```
1 episode  = 1 week (7 days)
1 step     = 1 time slot (Morning / Afternoon / Evening / Night)
4 slots    = 1 day
28 steps   = 1 full week
```

---

## Observable State

What the agent sees in every observation. No hidden information here.

| Variable | Type | Range | Description |
|---|---|---|---|
| `timestep` | int | 0–27 | Current step (0 = Monday Morning) |
| `day` | int | 0–6 | Day of week (0 = Monday, 6 = Sunday) |
| `slot` | int | 0–3 | Time of day (0=Morning, 1=Afternoon, 2=Evening, 3=Night) |
| `vitality` | float | 0–1 | Physical energy and sleep quality |
| `cognition` | float | 0–1 | Mental clarity and focus |
| `progress` | float | 0–1 | Career and skill advancement made this week |
| `serenity` | float | 0–1 | Inner peace and stress management |
| `connection` | float | 0–1 | Relationship health |
| `active_event` | str\|null | — | Random event this step (null if none) |
| `remaining_steps` | int | 0–28 | Steps left in episode |
| `reward` | float | — | Reward received this step |
| `done` | bool | — | True on the final step |
| `reward_breakdown` | dict | — | Per-meter deltas; `final_score` when done |

---

## Actions

10 actions, always legal regardless of state.

| Action | Category | Primary Effect |
|---|---|---|
| `DEEP_WORK` | Productivity | +Progress (large), −Vitality, −Cognition |
| `ADMIN_WORK` | Productivity | +Progress (small), light drain |
| `LEARN` | Productivity | +Progress, slight +Serenity |
| `SLEEP` | Recovery | +Vitality (large), +Cognition |
| `EXERCISE` | Recovery | +Vitality, +Serenity |
| `MEDITATE` | Recovery | +Serenity (large), +Cognition |
| `FAMILY_TIME` | Social | +Connection (large), +Serenity |
| `SOCIALIZE` | Social | +Connection |
| `ME_TIME` | Leisure | +Serenity, +Vitality (small) |
| `BINGE_WATCH` | Leisure | +Serenity (small), −Cognition |

### Action Effect Matrix

Base deltas per action on each meter, **before** any profile modifiers or time-of-day multipliers.

| Action | Vitality | Cognition | Progress | Serenity | Connection |
|---|---|---|---|---|---|
| deep_work | −0.12 | −0.10 | +0.18 | −0.05 | 0.00 |
| admin_work | −0.06 | −0.05 | +0.08 | −0.03 | 0.00 |
| learn | −0.08 | −0.08 | +0.12 | +0.02 | 0.00 |
| sleep | +0.20 | +0.10 | 0.00 | +0.05 | 0.00 |
| exercise | +0.12 | +0.05 | 0.00 | +0.08 | 0.00 |
| meditate | +0.03 | +0.08 | 0.00 | +0.15 | 0.00 |
| family_time | −0.04 | −0.02 | 0.00 | +0.06 | +0.15 |
| socialize | −0.06 | −0.03 | 0.00 | +0.04 | +0.12 |
| me_time | +0.05 | +0.03 | 0.00 | +0.10 | −0.02 |
| binge_watch | +0.02 | −0.05 | −0.02 | +0.06 | −0.03 |

---

## Hidden Personality Profiles

The person's identity. **Hidden from the agent.** Controls both reward weights and how
actions affect meters. Agent must infer the active profile from reward patterns across episodes.

### Profile 1 — `introvert_morning`

**Reward weights:** Serenity 60%, Progress 20%, Connection 10%, Vitality 5%, Cognition 5%

Hidden modifiers:
- Social vitality drain ×3.0 — socialising is exhausting, not neutral
- Morning (slot 0): cognition and progress gains ×2.0 — peak productivity window
- Solo time (me_time, meditate): serenity +0.10 bonus — recharges alone
- Binge watch triggers shame spiral: serenity −0.15, cognition −0.06
- Connection passive decay: −0.01/step

**Agent discovers:** Mornings are sacred; social activities are costly; alone time heals.

---

### Profile 2 — `extrovert_night_owl`

**Reward weights:** Connection 75%, Progress 10%, Vitality 5%, Cognition 5%, Serenity 5%

Hidden modifiers:
- Social vitality drain ×0.2 — socialising energises, barely drains
- Morning (slot 0): cognition and progress gains ×0.4 penalty — groggy zone
- Evening/Night (slots 2–3): cognition and progress gains ×1.8 — peak zone
- Social actions: connection ×2.0 (double connection gain)
- Social actions: serenity +0.06 bonus — people lift mood
- Connection passive decay: −0.01/step

**Agent discovers:** Avoid cognitive work in the morning; socialise to charge up; deep work in evening.

---

### Profile 3 — `workaholic_stoic`

**Reward weights:** Progress 70%, Serenity 10%, Connection 10%, Vitality 5%, Cognition 5%

Hidden modifiers:
- Productive work (deep_work, learn, admin_work): vitality +0.06 recovery — energised by output
- Productive work: serenity +0.10 bonus — meaning comes from progress
- Idle actions (me_time, binge_watch, sleep when optional): serenity −0.10 — idle guilt
- Extra vitality passive decay: −0.04/step — burnout risk
- Random event negative impact ×0.5 — stoic resilience
- Connection passive decay: −0.02/step — faster relational drift

**Agent discovers:** Keep working; rest only when vitality is critical; neglect at cost of connection.

---

## Time-of-Day Multipliers

Applied to all non-sleep actions based on current slot.

| Slot | Cognition Gain Multiplier | Vitality Drain Multiplier |
|---|---|---|
| 0 — Morning | ×1.2 | ×0.8 |
| 1 — Afternoon | ×1.0 | ×1.0 |
| 2 — Evening | ×0.8 | ×1.1 |
| 3 — Night | ×0.6 | ×1.3 |

These are global. Profile-specific time bonuses (HV1) layer on top.

---

## Passive Decays (every step, before action effects)

| Profile | Meter | Decay |
|---|---|---|
| All | Connection | −0.01/step |
| workaholic_stoic | Connection | −0.02/step (replaces above) |
| workaholic_stoic | Vitality | −0.04/step |

---

## Random Events

Roll probability: 8% per step.

| Event | Vitality | Cognition | Progress | Serenity | Connection |
|---|---|---|---|---|---|
| prod_crash | −0.08 | −0.10 | −0.10 | −0.15 | 0.00 |
| family_emergency | −0.05 | −0.08 | 0.00 | −0.12 | −0.10 |
| illness | −0.20 | −0.10 | 0.00 | −0.05 | 0.00 |
| good_news | +0.05 | +0.03 | 0.00 | +0.10 | +0.05 |

Negative effects are reduced by `event_impact_multiplier` per profile
(workaholic_stoic = 0.5; others = 1.0 or 0.8).

---

## Reward Computation

### Per-step reward

```
reward = sum(meter_delta × profile_weight for each meter) × 15.0
```

Profile reward weights are **hidden**. Same action, different profile → very different reward.

Example — DEEP_WORK, step 1, same initial state:
```
workaholic_stoic:    +1.57  (progress weight = 70%)
introvert_morning:   +0.32  (serenity weight = 60%; deep work slightly drains serenity)
extrovert_night_owl: −0.39  (connection weight = 75%; deep work gives 0 connection)
```

### Modifiers applied during step (in order)

1. Roll and apply random event (if any)
2. Get base action effects (ACTION_EFFECTS matrix)
3. Apply repetition dampening (same action 3× in a row → 25% / 50% / 75% effect reduction)
4. Apply time-of-day multipliers (cognition gain, vitality drain)
5. Apply profile-specific modifiers (HV1/HV2/HV3)
6. Apply global vitality factor (`0.5 + 0.5 × vitality`) — low vitality reduces positive effects
7. Apply passive decays (connection, workaholic vitality)
8. Clamp all meters to [0.0, 1.0]
9. Compute reward as weighted sum of deltas × REWARD_SCALE (15.0)
10. Apply critical floor penalty: any meter < 0.10 → −0.30

### Final grade (returned in `reward_breakdown["final_score"]` when `done=True`)

Score in [0.0, 1.0]:

```
score = 0.15 × crash_free_ratio    (1 − crash_count / total_possible_crashes)
      + 0.20 × progress            (final progress meter value)
      + 0.10 × connection          (final connection meter value)
      + 0.25 × adaptation_score    (late-half mean per-step reward minus
                                    early-half mean — gated by absolute
                                    late-half quality so a "terrible-then-
                                    mediocre" exploit cannot win)
      + 0.10 × efficiency_score    (avg step reward normalised to [0, 1])
      + 0.20 × belief_accuracy     (1 − MAE between agent's last-emitted
                                    belief vector and the true profile
                                    vector; 0 if the agent never emitted a
                                    belief — heuristic / random baselines)
```

Two meta-RL signals: `adaptation_score` is implicit (rewards getting better
over time, since per-step rewards are profile-weighted), and `belief_accuracy`
is explicit (rewards INFERRING the profile correctly). Without the explicit
term, agents that play heuristic-style "keep meters healthy" score the same
as agents that actually do inference, since the other components don't
differentiate inference from reflex.

To emit a belief, the agent calls `env.record_belief([s, m, w])` once per
step (typically right after parsing its own completion). The grader uses the
LAST recorded belief.

---

## Internal Tracking Variables

Not in the observation. Used by the environment to compute rewards and grade.

| Variable | Description |
|---|---|
| `_profile` | Active profile dict (hidden from agent) |
| `_rng` | Seeded random instance for event rolls and profile selection |
| `_crash_count` | Steps where any meter fell below 0.10 |
| `_total_reward` | Running sum of step rewards for efficiency score |
| `_step_history` | Rolling window of completed steps (action, reward, deltas, anomalies). Used both as the agent-visible history and to compute repetition dampening. |
| `_step_rewards` | Per-step reward list for adaptation_score in the grader |
| `_timestep` | Current step index (0–27) |