rhythm_env / docs /entity_definitions.md
InosLihka's picture
Algorithm Distillation: grader v2 with belief_accuracy + SFT pipeline
ece0bbe

RhythmEnv β€” Entity Definitions

Episode Structure

1 episode  = 1 week (7 days)
1 step     = 1 time slot (Morning / Afternoon / Evening / Night)
4 slots    = 1 day
28 steps   = 1 full week

Observable State

What the agent sees in every observation. No hidden information here.

Variable Type Range Description
timestep int 0–27 Current step (0 = Monday Morning)
day int 0–6 Day of week (0 = Monday, 6 = Sunday)
slot int 0–3 Time of day (0=Morning, 1=Afternoon, 2=Evening, 3=Night)
vitality float 0–1 Physical energy and sleep quality
cognition float 0–1 Mental clarity and focus
progress float 0–1 Career and skill advancement made this week
serenity float 0–1 Inner peace and stress management
connection float 0–1 Relationship health
active_event str|null β€” Random event this step (null if none)
remaining_steps int 0–28 Steps left in episode
reward float β€” Reward received this step
done bool β€” True on the final step
reward_breakdown dict β€” Per-meter deltas; final_score when done

Actions

10 actions, always legal regardless of state.

Action Category Primary Effect
DEEP_WORK Productivity +Progress (large), βˆ’Vitality, βˆ’Cognition
ADMIN_WORK Productivity +Progress (small), light drain
LEARN Productivity +Progress, slight +Serenity
SLEEP Recovery +Vitality (large), +Cognition
EXERCISE Recovery +Vitality, +Serenity
MEDITATE Recovery +Serenity (large), +Cognition
FAMILY_TIME Social +Connection (large), +Serenity
SOCIALIZE Social +Connection
ME_TIME Leisure +Serenity, +Vitality (small)
BINGE_WATCH Leisure +Serenity (small), βˆ’Cognition

Action Effect Matrix

Base deltas per action on each meter, before any profile modifiers or time-of-day multipliers.

Action Vitality Cognition Progress Serenity Connection
deep_work βˆ’0.12 βˆ’0.10 +0.18 βˆ’0.05 0.00
admin_work βˆ’0.06 βˆ’0.05 +0.08 βˆ’0.03 0.00
learn βˆ’0.08 βˆ’0.08 +0.12 +0.02 0.00
sleep +0.20 +0.10 0.00 +0.05 0.00
exercise +0.12 +0.05 0.00 +0.08 0.00
meditate +0.03 +0.08 0.00 +0.15 0.00
family_time βˆ’0.04 βˆ’0.02 0.00 +0.06 +0.15
socialize βˆ’0.06 βˆ’0.03 0.00 +0.04 +0.12
me_time +0.05 +0.03 0.00 +0.10 βˆ’0.02
binge_watch +0.02 βˆ’0.05 βˆ’0.02 +0.06 βˆ’0.03

Hidden Personality Profiles

The person's identity. Hidden from the agent. Controls both reward weights and how actions affect meters. Agent must infer the active profile from reward patterns across episodes.

Profile 1 β€” introvert_morning

Reward weights: Serenity 60%, Progress 20%, Connection 10%, Vitality 5%, Cognition 5%

Hidden modifiers:

  • Social vitality drain Γ—3.0 β€” socialising is exhausting, not neutral
  • Morning (slot 0): cognition and progress gains Γ—2.0 β€” peak productivity window
  • Solo time (me_time, meditate): serenity +0.10 bonus β€” recharges alone
  • Binge watch triggers shame spiral: serenity βˆ’0.15, cognition βˆ’0.06
  • Connection passive decay: βˆ’0.01/step

Agent discovers: Mornings are sacred; social activities are costly; alone time heals.


Profile 2 β€” extrovert_night_owl

Reward weights: Connection 75%, Progress 10%, Vitality 5%, Cognition 5%, Serenity 5%

Hidden modifiers:

  • Social vitality drain Γ—0.2 β€” socialising energises, barely drains
  • Morning (slot 0): cognition and progress gains Γ—0.4 penalty β€” groggy zone
  • Evening/Night (slots 2–3): cognition and progress gains Γ—1.8 β€” peak zone
  • Social actions: connection Γ—2.0 (double connection gain)
  • Social actions: serenity +0.06 bonus β€” people lift mood
  • Connection passive decay: βˆ’0.01/step

Agent discovers: Avoid cognitive work in the morning; socialise to charge up; deep work in evening.


Profile 3 β€” workaholic_stoic

Reward weights: Progress 70%, Serenity 10%, Connection 10%, Vitality 5%, Cognition 5%

Hidden modifiers:

  • Productive work (deep_work, learn, admin_work): vitality +0.06 recovery β€” energised by output
  • Productive work: serenity +0.10 bonus β€” meaning comes from progress
  • Idle actions (me_time, binge_watch, sleep when optional): serenity βˆ’0.10 β€” idle guilt
  • Extra vitality passive decay: βˆ’0.04/step β€” burnout risk
  • Random event negative impact Γ—0.5 β€” stoic resilience
  • Connection passive decay: βˆ’0.02/step β€” faster relational drift

Agent discovers: Keep working; rest only when vitality is critical; neglect at cost of connection.


Time-of-Day Multipliers

Applied to all non-sleep actions based on current slot.

Slot Cognition Gain Multiplier Vitality Drain Multiplier
0 β€” Morning Γ—1.2 Γ—0.8
1 β€” Afternoon Γ—1.0 Γ—1.0
2 β€” Evening Γ—0.8 Γ—1.1
3 β€” Night Γ—0.6 Γ—1.3

These are global. Profile-specific time bonuses (HV1) layer on top.


Passive Decays (every step, before action effects)

Profile Meter Decay
All Connection βˆ’0.01/step
workaholic_stoic Connection βˆ’0.02/step (replaces above)
workaholic_stoic Vitality βˆ’0.04/step

Random Events

Roll probability: 8% per step.

Event Vitality Cognition Progress Serenity Connection
prod_crash βˆ’0.08 βˆ’0.10 βˆ’0.10 βˆ’0.15 0.00
family_emergency βˆ’0.05 βˆ’0.08 0.00 βˆ’0.12 βˆ’0.10
illness βˆ’0.20 βˆ’0.10 0.00 βˆ’0.05 0.00
good_news +0.05 +0.03 0.00 +0.10 +0.05

Negative effects are reduced by event_impact_multiplier per profile (workaholic_stoic = 0.5; others = 1.0 or 0.8).


Reward Computation

Per-step reward

reward = sum(meter_delta Γ— profile_weight for each meter) Γ— 15.0

Profile reward weights are hidden. Same action, different profile β†’ very different reward.

Example β€” DEEP_WORK, step 1, same initial state:

workaholic_stoic:    +1.57  (progress weight = 70%)
introvert_morning:   +0.32  (serenity weight = 60%; deep work slightly drains serenity)
extrovert_night_owl: βˆ’0.39  (connection weight = 75%; deep work gives 0 connection)

Modifiers applied during step (in order)

  1. Roll and apply random event (if any)
  2. Get base action effects (ACTION_EFFECTS matrix)
  3. Apply repetition dampening (same action 3Γ— in a row β†’ 25% / 50% / 75% effect reduction)
  4. Apply time-of-day multipliers (cognition gain, vitality drain)
  5. Apply profile-specific modifiers (HV1/HV2/HV3)
  6. Apply global vitality factor (0.5 + 0.5 Γ— vitality) β€” low vitality reduces positive effects
  7. Apply passive decays (connection, workaholic vitality)
  8. Clamp all meters to [0.0, 1.0]
  9. Compute reward as weighted sum of deltas Γ— REWARD_SCALE (15.0)
  10. Apply critical floor penalty: any meter < 0.10 β†’ βˆ’0.30

Final grade (returned in reward_breakdown["final_score"] when done=True)

Score in [0.0, 1.0]:

score = 0.15 Γ— crash_free_ratio    (1 βˆ’ crash_count / total_possible_crashes)
      + 0.20 Γ— progress            (final progress meter value)
      + 0.10 Γ— connection          (final connection meter value)
      + 0.25 Γ— adaptation_score    (late-half mean per-step reward minus
                                    early-half mean β€” gated by absolute
                                    late-half quality so a "terrible-then-
                                    mediocre" exploit cannot win)
      + 0.10 Γ— efficiency_score    (avg step reward normalised to [0, 1])
      + 0.20 Γ— belief_accuracy     (1 βˆ’ MAE between agent's last-emitted
                                    belief vector and the true profile
                                    vector; 0 if the agent never emitted a
                                    belief β€” heuristic / random baselines)

Two meta-RL signals: adaptation_score is implicit (rewards getting better over time, since per-step rewards are profile-weighted), and belief_accuracy is explicit (rewards INFERRING the profile correctly). Without the explicit term, agents that play heuristic-style "keep meters healthy" score the same as agents that actually do inference, since the other components don't differentiate inference from reflex.

To emit a belief, the agent calls env.record_belief([s, m, w]) once per step (typically right after parsing its own completion). The grader uses the LAST recorded belief.


Internal Tracking Variables

Not in the observation. Used by the environment to compute rewards and grade.

Variable Description
_profile Active profile dict (hidden from agent)
_rng Seeded random instance for event rolls and profile selection
_crash_count Steps where any meter fell below 0.10
_total_reward Running sum of step rewards for efficiency score
_step_history Rolling window of completed steps (action, reward, deltas, anomalies). Used both as the agent-visible history and to compute repetition dampening.
_step_rewards Per-step reward list for adaptation_score in the grader
_timestep Current step index (0–27)