Spaces:
Sleeping
Sleeping
File size: 5,995 Bytes
1a25a1a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | # Environment Design β RhythmEnv Life Simulator
## What It Is
A Life Simulator β a holistic resource management RL environment where an agent learns a specific person's hidden patterns through experience, not configuration.
The core problem: personal AI assistants give generic advice because they don't know who you are. RhythmEnv is the training ground for an agent that must discover hidden personality dynamics through reward signals alone β the same way a good personal assistant adapts over their first weeks on the job.
---
## Why Abstract Activities, Not Tasks
An earlier design used a workday task scheduler (energy/stress meters, task queues with deadlines). We moved away from it because with real tasks, an agent can score well just by being a good scheduler β it never needs to infer anything about the person. Abstract life activities (DEEP_WORK, MEDITATE, SOCIALIZE...) force the inference problem: the only way to do well is to figure out *who you're helping*, because the same action has wildly different value depending on the hidden profile.
| | Workday Scheduler | Life Simulator |
|---|---|---|
| Episode | 1 day, 20 steps | 1 week, 28 steps |
| State | Energy, stress, task queue | 5 life meters |
| Actions | 4 (task management) | 10 (life activities) |
| Learning signal | How to sequence tasks | Which actions serve *this specific person* |
The Life Simulator creates a **non-promptable discovery problem**: the agent cannot know the person's profile from the prompt β it must be inferred from reward patterns. This is structurally different from a task that better prompting solves.
---
## The Three Discovery Layers
### Layer 1 β Reward Weights (What Matters to This Person)
Same action, same starting state β different rewards per profile:
```
DEEP_WORK, step 1, all meters at 0.7:
workaholic_stoic: +1.57 (progress weight 70% β work = meaning)
introvert_morning: +0.32 (serenity weight 60% β mild net gain)
extrovert_night_owl: β0.39 (connection weight 75% β work gives 0 connection)
```
An agent that doesn't adapt plateaus at ~0.60. One that discovers the profile targets pushes above 0.80.
### Layer 2 β Action Modifiers (How Actions Affect This Person)
The base effect matrix is modified invisibly per profile:
| Profile | Hidden modifier | Observable signal |
|---|---|---|
| introvert_morning | Social drain Γ3.0 | SOCIALIZE drains vitality 3Γ faster than expected |
| introvert_morning | Morning deep work Γ2.0 | Same action gives 2Γ progress at slot 0 |
| extrovert_night_owl | Morning penalty Γ0.4 | DEEP_WORK in morning gives 40% expected progress |
| extrovert_night_owl | Evening/night bonus Γ1.8 | Same action gives 1.8Γ progress at slots 2β3 |
| extrovert_night_owl | Social connection Γ2.0 | SOCIALIZE gives 2Γ connection gain |
| workaholic_stoic | Work recovers vitality +0.06 | DEEP_WORK raises vitality instead of draining |
| workaholic_stoic | Idle drains serenity β0.10 | ME_TIME/BINGE_WATCH lower serenity |
### Layer 3 β Stress Spiral
When serenity drops below the profile's tolerance threshold, all negative effects amplify Γ1.3. Wrong actions β serenity drops β worse outcomes β harder recovery. The agent must learn to protect serenity proactively.
---
## Observable vs Hidden
| Observable (agent sees every step) | Hidden (must infer from reward patterns) |
|---|---|
| All 5 meter values (0.0β1.0) | Which profile is active |
| Day of week (0β6) | Profile reward weights |
| Time slot (0β3) | Per-action modifiers |
| Active random event name | Stress tolerance threshold |
| Remaining steps | Connection decay rate |
---
## Training Story
```
Random baseline β final_score β 0.60β0.70
No pattern. Misses timing windows. Doesn't protect serenity floor.
Heuristic baseline β final_score β 0.75β0.82
Follows observable rules only. Cannot differentiate profiles.
Treats everyone the same.
GRPO-trained agent β target: final_score > 0.82 on 2+ profiles
Discovers timing bonuses per profile.
Adapts action mix to the person's hidden reward structure.
Introvert's week looks different from workaholic's week.
```
---
## Anti-Reward-Hacking Measures
| Safeguard | Mechanism |
|---|---|
| Three-layer reward | format + legality + real env replay β all three must pass |
| Repetition dampening | Same action 3Γ in a row β 25%/50%/75% effect reduction |
| Critical floor penalty | Any meter < 0.10 β β0.30 per step |
| Random events | 8%/step probability prevents overfitting to deterministic trajectories |
| Seed-based replay | `env_reward` reconstructs exact episode state β reward cannot be fabricated |
---
## Hackathon Theme Alignment
**Primary: Theme 3.2 β World Modeling: Personalized Tasks**
The environment models real personal assistant behaviour. The hidden profile represents real individual differences β what a person values and how activities physically affect them. Discovery through reward is how a good PA adapts over their first weeks on the job.
**Secondary: Theme 2 β Long-Horizon Planning**
28 steps with delayed, compounding consequences. Neglecting connection decays slowly but recovery gets harder each step. Serenity spiral is triggered by accumulated bad decisions, not a single action.
---
## Implementation Reference
| Component | File |
|---|---|
| Environment | `server/rhythm_environment.py` |
| Data models | `models.py` |
| Dataset generator | `training/dataset.py` |
| Reward functions | `training/reward_functions.py` |
| Baseline evaluation | `training/inference_eval.py` |
| Training notebook | `training/RhythmEnv_GRPO_Training.ipynb` |
| Gradio UI | `ui/app.py` |
| FastAPI server | `server/app.py` |
```python
env = RhythmEnvironment()
obs = env.reset(seed=42, profile="introvert_morning") # profile optional
obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))
# obs.reward, obs.done, obs.reward_breakdown, obs.vitality, ...
```
|