rhythm_env / docs /environment_design.md
InosLihka's picture
docs: reorganize β€” 25 files β†’ 4 focused docs
1a25a1a

Environment Design β€” RhythmEnv Life Simulator

What It Is

A Life Simulator β€” a holistic resource management RL environment where an agent learns a specific person's hidden patterns through experience, not configuration.

The core problem: personal AI assistants give generic advice because they don't know who you are. RhythmEnv is the training ground for an agent that must discover hidden personality dynamics through reward signals alone β€” the same way a good personal assistant adapts over their first weeks on the job.


Why Abstract Activities, Not Tasks

An earlier design used a workday task scheduler (energy/stress meters, task queues with deadlines). We moved away from it because with real tasks, an agent can score well just by being a good scheduler β€” it never needs to infer anything about the person. Abstract life activities (DEEP_WORK, MEDITATE, SOCIALIZE...) force the inference problem: the only way to do well is to figure out who you're helping, because the same action has wildly different value depending on the hidden profile.

Workday Scheduler Life Simulator
Episode 1 day, 20 steps 1 week, 28 steps
State Energy, stress, task queue 5 life meters
Actions 4 (task management) 10 (life activities)
Learning signal How to sequence tasks Which actions serve this specific person

The Life Simulator creates a non-promptable discovery problem: the agent cannot know the person's profile from the prompt β€” it must be inferred from reward patterns. This is structurally different from a task that better prompting solves.


The Three Discovery Layers

Layer 1 β€” Reward Weights (What Matters to This Person)

Same action, same starting state β†’ different rewards per profile:

DEEP_WORK, step 1, all meters at 0.7:
  workaholic_stoic:    +1.57   (progress weight 70% β€” work = meaning)
  introvert_morning:   +0.32   (serenity weight 60% β€” mild net gain)
  extrovert_night_owl: βˆ’0.39   (connection weight 75% β€” work gives 0 connection)

An agent that doesn't adapt plateaus at ~0.60. One that discovers the profile targets pushes above 0.80.

Layer 2 β€” Action Modifiers (How Actions Affect This Person)

The base effect matrix is modified invisibly per profile:

Profile Hidden modifier Observable signal
introvert_morning Social drain Γ—3.0 SOCIALIZE drains vitality 3Γ— faster than expected
introvert_morning Morning deep work Γ—2.0 Same action gives 2Γ— progress at slot 0
extrovert_night_owl Morning penalty Γ—0.4 DEEP_WORK in morning gives 40% expected progress
extrovert_night_owl Evening/night bonus Γ—1.8 Same action gives 1.8Γ— progress at slots 2–3
extrovert_night_owl Social connection Γ—2.0 SOCIALIZE gives 2Γ— connection gain
workaholic_stoic Work recovers vitality +0.06 DEEP_WORK raises vitality instead of draining
workaholic_stoic Idle drains serenity βˆ’0.10 ME_TIME/BINGE_WATCH lower serenity

Layer 3 β€” Stress Spiral

When serenity drops below the profile's tolerance threshold, all negative effects amplify Γ—1.3. Wrong actions β†’ serenity drops β†’ worse outcomes β†’ harder recovery. The agent must learn to protect serenity proactively.


Observable vs Hidden

Observable (agent sees every step) Hidden (must infer from reward patterns)
All 5 meter values (0.0–1.0) Which profile is active
Day of week (0–6) Profile reward weights
Time slot (0–3) Per-action modifiers
Active random event name Stress tolerance threshold
Remaining steps Connection decay rate

Training Story

Random baseline       β†’ final_score β‰ˆ 0.60–0.70
  No pattern. Misses timing windows. Doesn't protect serenity floor.

Heuristic baseline    β†’ final_score β‰ˆ 0.75–0.82
  Follows observable rules only. Cannot differentiate profiles.
  Treats everyone the same.

GRPO-trained agent    β†’ target: final_score > 0.82 on 2+ profiles
  Discovers timing bonuses per profile.
  Adapts action mix to the person's hidden reward structure.
  Introvert's week looks different from workaholic's week.

Anti-Reward-Hacking Measures

Safeguard Mechanism
Three-layer reward format + legality + real env replay β€” all three must pass
Repetition dampening Same action 3Γ— in a row β†’ 25%/50%/75% effect reduction
Critical floor penalty Any meter < 0.10 β†’ βˆ’0.30 per step
Random events 8%/step probability prevents overfitting to deterministic trajectories
Seed-based replay env_reward reconstructs exact episode state β€” reward cannot be fabricated

Hackathon Theme Alignment

Primary: Theme 3.2 β€” World Modeling: Personalized Tasks

The environment models real personal assistant behaviour. The hidden profile represents real individual differences β€” what a person values and how activities physically affect them. Discovery through reward is how a good PA adapts over their first weeks on the job.

Secondary: Theme 2 β€” Long-Horizon Planning

28 steps with delayed, compounding consequences. Neglecting connection decays slowly but recovery gets harder each step. Serenity spiral is triggered by accumulated bad decisions, not a single action.


Implementation Reference

Component File
Environment server/rhythm_environment.py
Data models models.py
Dataset generator training/dataset.py
Reward functions training/reward_functions.py
Baseline evaluation training/inference_eval.py
Training notebook training/RhythmEnv_GRPO_Training.ipynb
Gradio UI ui/app.py
FastAPI server server/app.py
env = RhythmEnvironment()
obs = env.reset(seed=42, profile="introvert_morning")  # profile optional
obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))
# obs.reward, obs.done, obs.reward_breakdown, obs.vitality, ...