Spaces:

InosLihka
/

rhythm_env

Sleeping

App Files Files Community

rhythm_env / docs /environment_design.md

InosLihka

docs: reorganize — 25 files → 4 focused docs

1a25a1a 13 days ago

preview code

raw

history blame contribute delete

6 kB

Environment Design — RhythmEnv Life Simulator

What It Is

A Life Simulator — a holistic resource management RL environment where an agent learns a specific person's hidden patterns through experience, not configuration.

The core problem: personal AI assistants give generic advice because they don't know who you are. RhythmEnv is the training ground for an agent that must discover hidden personality dynamics through reward signals alone — the same way a good personal assistant adapts over their first weeks on the job.

Why Abstract Activities, Not Tasks

An earlier design used a workday task scheduler (energy/stress meters, task queues with deadlines). We moved away from it because with real tasks, an agent can score well just by being a good scheduler — it never needs to infer anything about the person. Abstract life activities (DEEP_WORK, MEDITATE, SOCIALIZE...) force the inference problem: the only way to do well is to figure out who you're helping, because the same action has wildly different value depending on the hidden profile.

	Workday Scheduler	Life Simulator
Episode	1 day, 20 steps	1 week, 28 steps
State	Energy, stress, task queue	5 life meters
Actions	4 (task management)	10 (life activities)
Learning signal	How to sequence tasks	Which actions serve this specific person

The Life Simulator creates a non-promptable discovery problem: the agent cannot know the person's profile from the prompt — it must be inferred from reward patterns. This is structurally different from a task that better prompting solves.

The Three Discovery Layers

Layer 1 — Reward Weights (What Matters to This Person)

Same action, same starting state → different rewards per profile:

DEEP_WORK, step 1, all meters at 0.7:
  workaholic_stoic:    +1.57   (progress weight 70% — work = meaning)
  introvert_morning:   +0.32   (serenity weight 60% — mild net gain)
  extrovert_night_owl: −0.39   (connection weight 75% — work gives 0 connection)

An agent that doesn't adapt plateaus at ~0.60. One that discovers the profile targets pushes above 0.80.

Layer 2 — Action Modifiers (How Actions Affect This Person)

The base effect matrix is modified invisibly per profile:

Profile	Hidden modifier	Observable signal
introvert_morning	Social drain ×3.0	SOCIALIZE drains vitality 3× faster than expected
introvert_morning	Morning deep work ×2.0	Same action gives 2× progress at slot 0
extrovert_night_owl	Morning penalty ×0.4	DEEP_WORK in morning gives 40% expected progress
extrovert_night_owl	Evening/night bonus ×1.8	Same action gives 1.8× progress at slots 2–3
extrovert_night_owl	Social connection ×2.0	SOCIALIZE gives 2× connection gain
workaholic_stoic	Work recovers vitality +0.06	DEEP_WORK raises vitality instead of draining
workaholic_stoic	Idle drains serenity −0.10	ME_TIME/BINGE_WATCH lower serenity

Layer 3 — Stress Spiral

When serenity drops below the profile's tolerance threshold, all negative effects amplify ×1.3. Wrong actions → serenity drops → worse outcomes → harder recovery. The agent must learn to protect serenity proactively.

Observable vs Hidden

Observable (agent sees every step)	Hidden (must infer from reward patterns)
All 5 meter values (0.0–1.0)	Which profile is active
Day of week (0–6)	Profile reward weights
Time slot (0–3)	Per-action modifiers
Active random event name	Stress tolerance threshold
Remaining steps	Connection decay rate

Training Story

Random baseline       → final_score ≈ 0.60–0.70
  No pattern. Misses timing windows. Doesn't protect serenity floor.

Heuristic baseline    → final_score ≈ 0.75–0.82
  Follows observable rules only. Cannot differentiate profiles.
  Treats everyone the same.

GRPO-trained agent    → target: final_score > 0.82 on 2+ profiles
  Discovers timing bonuses per profile.
  Adapts action mix to the person's hidden reward structure.
  Introvert's week looks different from workaholic's week.

Anti-Reward-Hacking Measures

Safeguard	Mechanism
Three-layer reward	format + legality + real env replay — all three must pass
Repetition dampening	Same action 3× in a row → 25%/50%/75% effect reduction
Critical floor penalty	Any meter < 0.10 → −0.30 per step
Random events	8%/step probability prevents overfitting to deterministic trajectories
Seed-based replay	`env_reward` reconstructs exact episode state — reward cannot be fabricated

Hackathon Theme Alignment

Primary: Theme 3.2 — World Modeling: Personalized Tasks

The environment models real personal assistant behaviour. The hidden profile represents real individual differences — what a person values and how activities physically affect them. Discovery through reward is how a good PA adapts over their first weeks on the job.

Secondary: Theme 2 — Long-Horizon Planning

28 steps with delayed, compounding consequences. Neglecting connection decays slowly but recovery gets harder each step. Serenity spiral is triggered by accumulated bad decisions, not a single action.

Implementation Reference

Component	File
Environment	`server/rhythm_environment.py`
Data models	`models.py`
Dataset generator	`training/dataset.py`
Reward functions	`training/reward_functions.py`
Baseline evaluation	`training/inference_eval.py`
Training notebook	`training/RhythmEnv_GRPO_Training.ipynb`
Gradio UI	`ui/app.py`
FastAPI server	`server/app.py`

env = RhythmEnvironment()
obs = env.reset(seed=42, profile="introvert_morning")  # profile optional
obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))
# obs.reward, obs.done, obs.reward_breakdown, obs.vitality, ...