Spaces:

InosLihka
/

rhythm_env

Sleeping

App Files Files Community

rhythm_env / docs /entity_definitions.md

InosLihka

Algorithm Distillation: grader v2 with belief_accuracy + SFT pipeline

ece0bbe 12 days ago

preview code

raw

history blame contribute delete

9.46 kB

RhythmEnv — Entity Definitions

Episode Structure

1 episode  = 1 week (7 days)
1 step     = 1 time slot (Morning / Afternoon / Evening / Night)
4 slots    = 1 day
28 steps   = 1 full week

Observable State

What the agent sees in every observation. No hidden information here.

Variable	Type	Range	Description
`timestep`	int	0–27	Current step (0 = Monday Morning)
`day`	int	0–6	Day of week (0 = Monday, 6 = Sunday)
`slot`	int	0–3	Time of day (0=Morning, 1=Afternoon, 2=Evening, 3=Night)
`vitality`	float	0–1	Physical energy and sleep quality
`cognition`	float	0–1	Mental clarity and focus
`progress`	float	0–1	Career and skill advancement made this week
`serenity`	float	0–1	Inner peace and stress management
`connection`	float	0–1	Relationship health
`active_event`	str\|null	—	Random event this step (null if none)
`remaining_steps`	int	0–28	Steps left in episode
`reward`	float	—	Reward received this step
`done`	bool	—	True on the final step
`reward_breakdown`	dict	—	Per-meter deltas; `final_score` when done

Actions

10 actions, always legal regardless of state.

Action	Category	Primary Effect
`DEEP_WORK`	Productivity	+Progress (large), −Vitality, −Cognition
`ADMIN_WORK`	Productivity	+Progress (small), light drain
`LEARN`	Productivity	+Progress, slight +Serenity
`SLEEP`	Recovery	+Vitality (large), +Cognition
`EXERCISE`	Recovery	+Vitality, +Serenity
`MEDITATE`	Recovery	+Serenity (large), +Cognition
`FAMILY_TIME`	Social	+Connection (large), +Serenity
`SOCIALIZE`	Social	+Connection
`ME_TIME`	Leisure	+Serenity, +Vitality (small)
`BINGE_WATCH`	Leisure	+Serenity (small), −Cognition

Action Effect Matrix

Base deltas per action on each meter, before any profile modifiers or time-of-day multipliers.

Action	Vitality	Cognition	Progress	Serenity	Connection
deep_work	−0.12	−0.10	+0.18	−0.05	0.00
admin_work	−0.06	−0.05	+0.08	−0.03	0.00
learn	−0.08	−0.08	+0.12	+0.02	0.00
sleep	+0.20	+0.10	0.00	+0.05	0.00
exercise	+0.12	+0.05	0.00	+0.08	0.00
meditate	+0.03	+0.08	0.00	+0.15	0.00
family_time	−0.04	−0.02	0.00	+0.06	+0.15
socialize	−0.06	−0.03	0.00	+0.04	+0.12
me_time	+0.05	+0.03	0.00	+0.10	−0.02
binge_watch	+0.02	−0.05	−0.02	+0.06	−0.03

Hidden Personality Profiles

The person's identity. Hidden from the agent. Controls both reward weights and how actions affect meters. Agent must infer the active profile from reward patterns across episodes.

Profile 1 — `introvert_morning`

Reward weights: Serenity 60%, Progress 20%, Connection 10%, Vitality 5%, Cognition 5%

Hidden modifiers:

Social vitality drain ×3.0 — socialising is exhausting, not neutral
Morning (slot 0): cognition and progress gains ×2.0 — peak productivity window
Solo time (me_time, meditate): serenity +0.10 bonus — recharges alone
Binge watch triggers shame spiral: serenity −0.15, cognition −0.06
Connection passive decay: −0.01/step

Agent discovers: Mornings are sacred; social activities are costly; alone time heals.

Profile 2 — `extrovert_night_owl`

Reward weights: Connection 75%, Progress 10%, Vitality 5%, Cognition 5%, Serenity 5%

Hidden modifiers:

Social vitality drain ×0.2 — socialising energises, barely drains
Morning (slot 0): cognition and progress gains ×0.4 penalty — groggy zone
Evening/Night (slots 2–3): cognition and progress gains ×1.8 — peak zone
Social actions: connection ×2.0 (double connection gain)
Social actions: serenity +0.06 bonus — people lift mood
Connection passive decay: −0.01/step

Agent discovers: Avoid cognitive work in the morning; socialise to charge up; deep work in evening.

Profile 3 — `workaholic_stoic`

Reward weights: Progress 70%, Serenity 10%, Connection 10%, Vitality 5%, Cognition 5%

Hidden modifiers:

Productive work (deep_work, learn, admin_work): vitality +0.06 recovery — energised by output
Productive work: serenity +0.10 bonus — meaning comes from progress
Idle actions (me_time, binge_watch, sleep when optional): serenity −0.10 — idle guilt
Extra vitality passive decay: −0.04/step — burnout risk
Random event negative impact ×0.5 — stoic resilience
Connection passive decay: −0.02/step — faster relational drift

Agent discovers: Keep working; rest only when vitality is critical; neglect at cost of connection.

Time-of-Day Multipliers

Applied to all non-sleep actions based on current slot.

Slot	Cognition Gain Multiplier	Vitality Drain Multiplier
0 — Morning	×1.2	×0.8
1 — Afternoon	×1.0	×1.0
2 — Evening	×0.8	×1.1
3 — Night	×0.6	×1.3

These are global. Profile-specific time bonuses (HV1) layer on top.

Passive Decays (every step, before action effects)

Profile	Meter	Decay
All	Connection	−0.01/step
workaholic_stoic	Connection	−0.02/step (replaces above)
workaholic_stoic	Vitality	−0.04/step

Random Events

Roll probability: 8% per step.

Event	Vitality	Cognition	Progress	Serenity	Connection
prod_crash	−0.08	−0.10	−0.10	−0.15	0.00
family_emergency	−0.05	−0.08	0.00	−0.12	−0.10
illness	−0.20	−0.10	0.00	−0.05	0.00
good_news	+0.05	+0.03	0.00	+0.10	+0.05

Negative effects are reduced by event_impact_multiplier per profile (workaholic_stoic = 0.5; others = 1.0 or 0.8).

Reward Computation

Per-step reward

reward = sum(meter_delta × profile_weight for each meter) × 15.0

Profile reward weights are hidden. Same action, different profile → very different reward.

Example — DEEP_WORK, step 1, same initial state:

workaholic_stoic:    +1.57  (progress weight = 70%)
introvert_morning:   +0.32  (serenity weight = 60%; deep work slightly drains serenity)
extrovert_night_owl: −0.39  (connection weight = 75%; deep work gives 0 connection)

Modifiers applied during step (in order)

Roll and apply random event (if any)
Get base action effects (ACTION_EFFECTS matrix)
Apply repetition dampening (same action 3× in a row → 25% / 50% / 75% effect reduction)
Apply time-of-day multipliers (cognition gain, vitality drain)
Apply profile-specific modifiers (HV1/HV2/HV3)
Apply global vitality factor (0.5 + 0.5 × vitality) — low vitality reduces positive effects
Apply passive decays (connection, workaholic vitality)
Clamp all meters to [0.0, 1.0]
Compute reward as weighted sum of deltas × REWARD_SCALE (15.0)
Apply critical floor penalty: any meter < 0.10 → −0.30

Final grade (returned in `reward_breakdown["final_score"]` when `done=True`)

Score in [0.0, 1.0]:

score = 0.15 × crash_free_ratio    (1 − crash_count / total_possible_crashes)
      + 0.20 × progress            (final progress meter value)
      + 0.10 × connection          (final connection meter value)
      + 0.25 × adaptation_score    (late-half mean per-step reward minus
                                    early-half mean — gated by absolute
                                    late-half quality so a "terrible-then-
                                    mediocre" exploit cannot win)
      + 0.10 × efficiency_score    (avg step reward normalised to [0, 1])
      + 0.20 × belief_accuracy     (1 − MAE between agent's last-emitted
                                    belief vector and the true profile
                                    vector; 0 if the agent never emitted a
                                    belief — heuristic / random baselines)

Two meta-RL signals: adaptation_score is implicit (rewards getting better over time, since per-step rewards are profile-weighted), and belief_accuracy is explicit (rewards INFERRING the profile correctly). Without the explicit term, agents that play heuristic-style "keep meters healthy" score the same as agents that actually do inference, since the other components don't differentiate inference from reflex.

To emit a belief, the agent calls env.record_belief([s, m, w]) once per step (typically right after parsing its own completion). The grader uses the LAST recorded belief.

Internal Tracking Variables

Not in the observation. Used by the environment to compute rewards and grade.

Variable	Description
`_profile`	Active profile dict (hidden from agent)
`_rng`	Seeded random instance for event rolls and profile selection
`_crash_count`	Steps where any meter fell below 0.10
`_total_reward`	Running sum of step rewards for efficiency score
`_step_history`	Rolling window of completed steps (action, reward, deltas, anomalies). Used both as the agent-visible history and to compute repetition dampening.
`_step_rewards`	Per-step reward list for adaptation_score in the grader
`_timestep`	Current step index (0–27)

RhythmEnv — Entity Definitions

Episode Structure

Observable State

Actions

Action Effect Matrix

Hidden Personality Profiles

Profile 1 — introvert_morning

Profile 2 — extrovert_night_owl

Profile 3 — workaholic_stoic

Time-of-Day Multipliers

Passive Decays (every step, before action effects)

Random Events

Reward Computation

Per-step reward

Modifiers applied during step (in order)

Final grade (returned in reward_breakdown["final_score"] when done=True)

Internal Tracking Variables

Profile 1 — `introvert_morning`

Profile 2 — `extrovert_night_owl`

Profile 3 — `workaholic_stoic`

Final grade (returned in `reward_breakdown["final_score"]` when `done=True`)