rhythm_env / docs /entity_definitions.md
InosLihka's picture
Algorithm Distillation: grader v2 with belief_accuracy + SFT pipeline
ece0bbe
# RhythmEnv β€” Entity Definitions
## Episode Structure
```
1 episode = 1 week (7 days)
1 step = 1 time slot (Morning / Afternoon / Evening / Night)
4 slots = 1 day
28 steps = 1 full week
```
---
## Observable State
What the agent sees in every observation. No hidden information here.
| Variable | Type | Range | Description |
|---|---|---|---|
| `timestep` | int | 0–27 | Current step (0 = Monday Morning) |
| `day` | int | 0–6 | Day of week (0 = Monday, 6 = Sunday) |
| `slot` | int | 0–3 | Time of day (0=Morning, 1=Afternoon, 2=Evening, 3=Night) |
| `vitality` | float | 0–1 | Physical energy and sleep quality |
| `cognition` | float | 0–1 | Mental clarity and focus |
| `progress` | float | 0–1 | Career and skill advancement made this week |
| `serenity` | float | 0–1 | Inner peace and stress management |
| `connection` | float | 0–1 | Relationship health |
| `active_event` | str\|null | β€” | Random event this step (null if none) |
| `remaining_steps` | int | 0–28 | Steps left in episode |
| `reward` | float | β€” | Reward received this step |
| `done` | bool | β€” | True on the final step |
| `reward_breakdown` | dict | β€” | Per-meter deltas; `final_score` when done |
---
## Actions
10 actions, always legal regardless of state.
| Action | Category | Primary Effect |
|---|---|---|
| `DEEP_WORK` | Productivity | +Progress (large), βˆ’Vitality, βˆ’Cognition |
| `ADMIN_WORK` | Productivity | +Progress (small), light drain |
| `LEARN` | Productivity | +Progress, slight +Serenity |
| `SLEEP` | Recovery | +Vitality (large), +Cognition |
| `EXERCISE` | Recovery | +Vitality, +Serenity |
| `MEDITATE` | Recovery | +Serenity (large), +Cognition |
| `FAMILY_TIME` | Social | +Connection (large), +Serenity |
| `SOCIALIZE` | Social | +Connection |
| `ME_TIME` | Leisure | +Serenity, +Vitality (small) |
| `BINGE_WATCH` | Leisure | +Serenity (small), βˆ’Cognition |
### Action Effect Matrix
Base deltas per action on each meter, **before** any profile modifiers or time-of-day multipliers.
| Action | Vitality | Cognition | Progress | Serenity | Connection |
|---|---|---|---|---|---|
| deep_work | βˆ’0.12 | βˆ’0.10 | +0.18 | βˆ’0.05 | 0.00 |
| admin_work | βˆ’0.06 | βˆ’0.05 | +0.08 | βˆ’0.03 | 0.00 |
| learn | βˆ’0.08 | βˆ’0.08 | +0.12 | +0.02 | 0.00 |
| sleep | +0.20 | +0.10 | 0.00 | +0.05 | 0.00 |
| exercise | +0.12 | +0.05 | 0.00 | +0.08 | 0.00 |
| meditate | +0.03 | +0.08 | 0.00 | +0.15 | 0.00 |
| family_time | βˆ’0.04 | βˆ’0.02 | 0.00 | +0.06 | +0.15 |
| socialize | βˆ’0.06 | βˆ’0.03 | 0.00 | +0.04 | +0.12 |
| me_time | +0.05 | +0.03 | 0.00 | +0.10 | βˆ’0.02 |
| binge_watch | +0.02 | βˆ’0.05 | βˆ’0.02 | +0.06 | βˆ’0.03 |
---
## Hidden Personality Profiles
The person's identity. **Hidden from the agent.** Controls both reward weights and how
actions affect meters. Agent must infer the active profile from reward patterns across episodes.
### Profile 1 β€” `introvert_morning`
**Reward weights:** Serenity 60%, Progress 20%, Connection 10%, Vitality 5%, Cognition 5%
Hidden modifiers:
- Social vitality drain Γ—3.0 β€” socialising is exhausting, not neutral
- Morning (slot 0): cognition and progress gains Γ—2.0 β€” peak productivity window
- Solo time (me_time, meditate): serenity +0.10 bonus β€” recharges alone
- Binge watch triggers shame spiral: serenity βˆ’0.15, cognition βˆ’0.06
- Connection passive decay: βˆ’0.01/step
**Agent discovers:** Mornings are sacred; social activities are costly; alone time heals.
---
### Profile 2 β€” `extrovert_night_owl`
**Reward weights:** Connection 75%, Progress 10%, Vitality 5%, Cognition 5%, Serenity 5%
Hidden modifiers:
- Social vitality drain Γ—0.2 β€” socialising energises, barely drains
- Morning (slot 0): cognition and progress gains Γ—0.4 penalty β€” groggy zone
- Evening/Night (slots 2–3): cognition and progress gains Γ—1.8 β€” peak zone
- Social actions: connection Γ—2.0 (double connection gain)
- Social actions: serenity +0.06 bonus β€” people lift mood
- Connection passive decay: βˆ’0.01/step
**Agent discovers:** Avoid cognitive work in the morning; socialise to charge up; deep work in evening.
---
### Profile 3 β€” `workaholic_stoic`
**Reward weights:** Progress 70%, Serenity 10%, Connection 10%, Vitality 5%, Cognition 5%
Hidden modifiers:
- Productive work (deep_work, learn, admin_work): vitality +0.06 recovery β€” energised by output
- Productive work: serenity +0.10 bonus β€” meaning comes from progress
- Idle actions (me_time, binge_watch, sleep when optional): serenity βˆ’0.10 β€” idle guilt
- Extra vitality passive decay: βˆ’0.04/step β€” burnout risk
- Random event negative impact Γ—0.5 β€” stoic resilience
- Connection passive decay: βˆ’0.02/step β€” faster relational drift
**Agent discovers:** Keep working; rest only when vitality is critical; neglect at cost of connection.
---
## Time-of-Day Multipliers
Applied to all non-sleep actions based on current slot.
| Slot | Cognition Gain Multiplier | Vitality Drain Multiplier |
|---|---|---|
| 0 β€” Morning | Γ—1.2 | Γ—0.8 |
| 1 β€” Afternoon | Γ—1.0 | Γ—1.0 |
| 2 β€” Evening | Γ—0.8 | Γ—1.1 |
| 3 β€” Night | Γ—0.6 | Γ—1.3 |
These are global. Profile-specific time bonuses (HV1) layer on top.
---
## Passive Decays (every step, before action effects)
| Profile | Meter | Decay |
|---|---|---|
| All | Connection | βˆ’0.01/step |
| workaholic_stoic | Connection | βˆ’0.02/step (replaces above) |
| workaholic_stoic | Vitality | βˆ’0.04/step |
---
## Random Events
Roll probability: 8% per step.
| Event | Vitality | Cognition | Progress | Serenity | Connection |
|---|---|---|---|---|---|
| prod_crash | βˆ’0.08 | βˆ’0.10 | βˆ’0.10 | βˆ’0.15 | 0.00 |
| family_emergency | βˆ’0.05 | βˆ’0.08 | 0.00 | βˆ’0.12 | βˆ’0.10 |
| illness | βˆ’0.20 | βˆ’0.10 | 0.00 | βˆ’0.05 | 0.00 |
| good_news | +0.05 | +0.03 | 0.00 | +0.10 | +0.05 |
Negative effects are reduced by `event_impact_multiplier` per profile
(workaholic_stoic = 0.5; others = 1.0 or 0.8).
---
## Reward Computation
### Per-step reward
```
reward = sum(meter_delta Γ— profile_weight for each meter) Γ— 15.0
```
Profile reward weights are **hidden**. Same action, different profile β†’ very different reward.
Example β€” DEEP_WORK, step 1, same initial state:
```
workaholic_stoic: +1.57 (progress weight = 70%)
introvert_morning: +0.32 (serenity weight = 60%; deep work slightly drains serenity)
extrovert_night_owl: βˆ’0.39 (connection weight = 75%; deep work gives 0 connection)
```
### Modifiers applied during step (in order)
1. Roll and apply random event (if any)
2. Get base action effects (ACTION_EFFECTS matrix)
3. Apply repetition dampening (same action 3Γ— in a row β†’ 25% / 50% / 75% effect reduction)
4. Apply time-of-day multipliers (cognition gain, vitality drain)
5. Apply profile-specific modifiers (HV1/HV2/HV3)
6. Apply global vitality factor (`0.5 + 0.5 Γ— vitality`) β€” low vitality reduces positive effects
7. Apply passive decays (connection, workaholic vitality)
8. Clamp all meters to [0.0, 1.0]
9. Compute reward as weighted sum of deltas Γ— REWARD_SCALE (15.0)
10. Apply critical floor penalty: any meter < 0.10 β†’ βˆ’0.30
### Final grade (returned in `reward_breakdown["final_score"]` when `done=True`)
Score in [0.0, 1.0]:
```
score = 0.15 Γ— crash_free_ratio (1 βˆ’ crash_count / total_possible_crashes)
+ 0.20 Γ— progress (final progress meter value)
+ 0.10 Γ— connection (final connection meter value)
+ 0.25 Γ— adaptation_score (late-half mean per-step reward minus
early-half mean β€” gated by absolute
late-half quality so a "terrible-then-
mediocre" exploit cannot win)
+ 0.10 Γ— efficiency_score (avg step reward normalised to [0, 1])
+ 0.20 Γ— belief_accuracy (1 βˆ’ MAE between agent's last-emitted
belief vector and the true profile
vector; 0 if the agent never emitted a
belief β€” heuristic / random baselines)
```
Two meta-RL signals: `adaptation_score` is implicit (rewards getting better
over time, since per-step rewards are profile-weighted), and `belief_accuracy`
is explicit (rewards INFERRING the profile correctly). Without the explicit
term, agents that play heuristic-style "keep meters healthy" score the same
as agents that actually do inference, since the other components don't
differentiate inference from reflex.
To emit a belief, the agent calls `env.record_belief([s, m, w])` once per
step (typically right after parsing its own completion). The grader uses the
LAST recorded belief.
---
## Internal Tracking Variables
Not in the observation. Used by the environment to compute rewards and grade.
| Variable | Description |
|---|---|
| `_profile` | Active profile dict (hidden from agent) |
| `_rng` | Seeded random instance for event rolls and profile selection |
| `_crash_count` | Steps where any meter fell below 0.10 |
| `_total_reward` | Running sum of step rewards for efficiency score |
| `_step_history` | Rolling window of completed steps (action, reward, deltas, anomalies). Used both as the agent-visible history and to compute repetition dampening. |
| `_step_rewards` | Per-step reward list for adaptation_score in the grader |
| `_timestep` | Current step index (0–27) |