Spaces:

InosLihka
/

rhythm_env

Sleeping

App Files Files Community

rhythm_env / docs /entity_definitions.md

InosLihka

Algorithm Distillation: grader v2 with belief_accuracy + SFT pipeline

ece0bbe 12 days ago

preview code

raw

history blame contribute delete

9.46 kB

	# RhythmEnv — Entity Definitions

	## Episode Structure

	```
	1 episode = 1 week (7 days)
	1 step = 1 time slot (Morning / Afternoon / Evening / Night)
	4 slots = 1 day
	28 steps = 1 full week
	```

	---

	## Observable State

	What the agent sees in every observation. No hidden information here.

	\| Variable \| Type \| Range \| Description \|
	\|---\|---\|---\|---\|
	\| `timestep` \| int \| 0–27 \| Current step (0 = Monday Morning) \|
	\| `day` \| int \| 0–6 \| Day of week (0 = Monday, 6 = Sunday) \|
	\| `slot` \| int \| 0–3 \| Time of day (0=Morning, 1=Afternoon, 2=Evening, 3=Night) \|
	\| `vitality` \| float \| 0–1 \| Physical energy and sleep quality \|
	\| `cognition` \| float \| 0–1 \| Mental clarity and focus \|
	\| `progress` \| float \| 0–1 \| Career and skill advancement made this week \|
	\| `serenity` \| float \| 0–1 \| Inner peace and stress management \|
	\| `connection` \| float \| 0–1 \| Relationship health \|
	\| `active_event` \| str\\|null \| — \| Random event this step (null if none) \|
	\| `remaining_steps` \| int \| 0–28 \| Steps left in episode \|
	\| `reward` \| float \| — \| Reward received this step \|
	\| `done` \| bool \| — \| True on the final step \|
	\| `reward_breakdown` \| dict \| — \| Per-meter deltas; `final_score` when done \|

	---

	## Actions

	10 actions, always legal regardless of state.

	\| Action \| Category \| Primary Effect \|
	\|---\|---\|---\|
	\| `DEEP_WORK` \| Productivity \| +Progress (large), −Vitality, −Cognition \|
	\| `ADMIN_WORK` \| Productivity \| +Progress (small), light drain \|
	\| `LEARN` \| Productivity \| +Progress, slight +Serenity \|
	\| `SLEEP` \| Recovery \| +Vitality (large), +Cognition \|
	\| `EXERCISE` \| Recovery \| +Vitality, +Serenity \|
	\| `MEDITATE` \| Recovery \| +Serenity (large), +Cognition \|
	\| `FAMILY_TIME` \| Social \| +Connection (large), +Serenity \|
	\| `SOCIALIZE` \| Social \| +Connection \|
	\| `ME_TIME` \| Leisure \| +Serenity, +Vitality (small) \|
	\| `BINGE_WATCH` \| Leisure \| +Serenity (small), −Cognition \|

	### Action Effect Matrix

	Base deltas per action on each meter, before any profile modifiers or time-of-day multipliers.

	\| Action \| Vitality \| Cognition \| Progress \| Serenity \| Connection \|
	\|---\|---\|---\|---\|---\|---\|
	\| deep_work \| −0.12 \| −0.10 \| +0.18 \| −0.05 \| 0.00 \|
	\| admin_work \| −0.06 \| −0.05 \| +0.08 \| −0.03 \| 0.00 \|
	\| learn \| −0.08 \| −0.08 \| +0.12 \| +0.02 \| 0.00 \|
	\| sleep \| +0.20 \| +0.10 \| 0.00 \| +0.05 \| 0.00 \|
	\| exercise \| +0.12 \| +0.05 \| 0.00 \| +0.08 \| 0.00 \|
	\| meditate \| +0.03 \| +0.08 \| 0.00 \| +0.15 \| 0.00 \|
	\| family_time \| −0.04 \| −0.02 \| 0.00 \| +0.06 \| +0.15 \|
	\| socialize \| −0.06 \| −0.03 \| 0.00 \| +0.04 \| +0.12 \|
	\| me_time \| +0.05 \| +0.03 \| 0.00 \| +0.10 \| −0.02 \|
	\| binge_watch \| +0.02 \| −0.05 \| −0.02 \| +0.06 \| −0.03 \|

	---

	## Hidden Personality Profiles

	The person's identity. Hidden from the agent. Controls both reward weights and how
	actions affect meters. Agent must infer the active profile from reward patterns across episodes.

	### Profile 1 — `introvert_morning`

	Reward weights: Serenity 60%, Progress 20%, Connection 10%, Vitality 5%, Cognition 5%

	Hidden modifiers:
	- Social vitality drain ×3.0 — socialising is exhausting, not neutral
	- Morning (slot 0): cognition and progress gains ×2.0 — peak productivity window
	- Solo time (me_time, meditate): serenity +0.10 bonus — recharges alone
	- Binge watch triggers shame spiral: serenity −0.15, cognition −0.06
	- Connection passive decay: −0.01/step

	Agent discovers: Mornings are sacred; social activities are costly; alone time heals.

	---

	### Profile 2 — `extrovert_night_owl`

	Reward weights: Connection 75%, Progress 10%, Vitality 5%, Cognition 5%, Serenity 5%

	Hidden modifiers:
	- Social vitality drain ×0.2 — socialising energises, barely drains
	- Morning (slot 0): cognition and progress gains ×0.4 penalty — groggy zone
	- Evening/Night (slots 2–3): cognition and progress gains ×1.8 — peak zone
	- Social actions: connection ×2.0 (double connection gain)
	- Social actions: serenity +0.06 bonus — people lift mood
	- Connection passive decay: −0.01/step

	Agent discovers: Avoid cognitive work in the morning; socialise to charge up; deep work in evening.

	---

	### Profile 3 — `workaholic_stoic`

	Reward weights: Progress 70%, Serenity 10%, Connection 10%, Vitality 5%, Cognition 5%

	Hidden modifiers:
	- Productive work (deep_work, learn, admin_work): vitality +0.06 recovery — energised by output
	- Productive work: serenity +0.10 bonus — meaning comes from progress
	- Idle actions (me_time, binge_watch, sleep when optional): serenity −0.10 — idle guilt
	- Extra vitality passive decay: −0.04/step — burnout risk
	- Random event negative impact ×0.5 — stoic resilience
	- Connection passive decay: −0.02/step — faster relational drift

	Agent discovers: Keep working; rest only when vitality is critical; neglect at cost of connection.

	---

	## Time-of-Day Multipliers

	Applied to all non-sleep actions based on current slot.

	\| Slot \| Cognition Gain Multiplier \| Vitality Drain Multiplier \|
	\|---\|---\|---\|
	\| 0 — Morning \| ×1.2 \| ×0.8 \|
	\| 1 — Afternoon \| ×1.0 \| ×1.0 \|
	\| 2 — Evening \| ×0.8 \| ×1.1 \|
	\| 3 — Night \| ×0.6 \| ×1.3 \|

	These are global. Profile-specific time bonuses (HV1) layer on top.

	---

	## Passive Decays (every step, before action effects)

	\| Profile \| Meter \| Decay \|
	\|---\|---\|---\|
	\| All \| Connection \| −0.01/step \|
	\| workaholic_stoic \| Connection \| −0.02/step (replaces above) \|
	\| workaholic_stoic \| Vitality \| −0.04/step \|

	---

	## Random Events

	Roll probability: 8% per step.

	\| Event \| Vitality \| Cognition \| Progress \| Serenity \| Connection \|
	\|---\|---\|---\|---\|---\|---\|
	\| prod_crash \| −0.08 \| −0.10 \| −0.10 \| −0.15 \| 0.00 \|
	\| family_emergency \| −0.05 \| −0.08 \| 0.00 \| −0.12 \| −0.10 \|
	\| illness \| −0.20 \| −0.10 \| 0.00 \| −0.05 \| 0.00 \|
	\| good_news \| +0.05 \| +0.03 \| 0.00 \| +0.10 \| +0.05 \|

	Negative effects are reduced by `event_impact_multiplier` per profile
	(workaholic_stoic = 0.5; others = 1.0 or 0.8).

	---

	## Reward Computation

	### Per-step reward

	```
	reward = sum(meter_delta × profile_weight for each meter) × 15.0
	```

	Profile reward weights are hidden. Same action, different profile → very different reward.

	Example — DEEP_WORK, step 1, same initial state:
	```
	workaholic_stoic: +1.57 (progress weight = 70%)
	introvert_morning: +0.32 (serenity weight = 60%; deep work slightly drains serenity)
	extrovert_night_owl: −0.39 (connection weight = 75%; deep work gives 0 connection)
	```

	### Modifiers applied during step (in order)

	1. Roll and apply random event (if any)
	2. Get base action effects (ACTION_EFFECTS matrix)
	3. Apply repetition dampening (same action 3× in a row → 25% / 50% / 75% effect reduction)
	4. Apply time-of-day multipliers (cognition gain, vitality drain)
	5. Apply profile-specific modifiers (HV1/HV2/HV3)
	6. Apply global vitality factor (`0.5 + 0.5 × vitality`) — low vitality reduces positive effects
	7. Apply passive decays (connection, workaholic vitality)
	8. Clamp all meters to [0.0, 1.0]
	9. Compute reward as weighted sum of deltas × REWARD_SCALE (15.0)
	10. Apply critical floor penalty: any meter < 0.10 → −0.30

	### Final grade (returned in `reward_breakdown["final_score"]` when `done=True`)

	Score in [0.0, 1.0]:

	```
	score = 0.15 × crash_free_ratio (1 − crash_count / total_possible_crashes)
	+ 0.20 × progress (final progress meter value)
	+ 0.10 × connection (final connection meter value)
	+ 0.25 × adaptation_score (late-half mean per-step reward minus
	early-half mean — gated by absolute
	late-half quality so a "terrible-then-
	mediocre" exploit cannot win)
	+ 0.10 × efficiency_score (avg step reward normalised to [0, 1])
	+ 0.20 × belief_accuracy (1 − MAE between agent's last-emitted
	belief vector and the true profile
	vector; 0 if the agent never emitted a
	belief — heuristic / random baselines)
	```

	Two meta-RL signals: `adaptation_score` is implicit (rewards getting better
	over time, since per-step rewards are profile-weighted), and `belief_accuracy`
	is explicit (rewards INFERRING the profile correctly). Without the explicit
	term, agents that play heuristic-style "keep meters healthy" score the same
	as agents that actually do inference, since the other components don't
	differentiate inference from reflex.

	To emit a belief, the agent calls `env.record_belief([s, m, w])` once per
	step (typically right after parsing its own completion). The grader uses the
	LAST recorded belief.

	---

	## Internal Tracking Variables

	Not in the observation. Used by the environment to compute rewards and grade.

	\| Variable \| Description \|
	\|---\|---\|
	\| `_profile` \| Active profile dict (hidden from agent) \|
	\| `_rng` \| Seeded random instance for event rolls and profile selection \|
	\| `_crash_count` \| Steps where any meter fell below 0.10 \|
	\| `_total_reward` \| Running sum of step rewards for efficiency score \|
	\| `_step_history` \| Rolling window of completed steps (action, reward, deltas, anomalies). Used both as the agent-visible history and to compute repetition dampening. \|
	\| `_step_rewards` \| Per-step reward list for adaptation_score in the grader \|
	\| `_timestep` \| Current step index (0–27) \|