Spaces:

InosLihka
/

rhythm_env

Sleeping

File size: 5,995 Bytes

1a25a1a

# Environment Design — RhythmEnv Life Simulator

## What It Is

A Life Simulator — a holistic resource management RL environment where an agent learns a specific person's hidden patterns through experience, not configuration.

The core problem: personal AI assistants give generic advice because they don't know who you are. RhythmEnv is the training ground for an agent that must discover hidden personality dynamics through reward signals alone — the same way a good personal assistant adapts over their first weeks on the job.

---

## Why Abstract Activities, Not Tasks

An earlier design used a workday task scheduler (energy/stress meters, task queues with deadlines). We moved away from it because with real tasks, an agent can score well just by being a good scheduler — it never needs to infer anything about the person. Abstract life activities (DEEP_WORK, MEDITATE, SOCIALIZE...) force the inference problem: the only way to do well is to figure out *who you're helping*, because the same action has wildly different value depending on the hidden profile.

| | Workday Scheduler | Life Simulator |
|---|---|---|
| Episode | 1 day, 20 steps | 1 week, 28 steps |
| State | Energy, stress, task queue | 5 life meters |
| Actions | 4 (task management) | 10 (life activities) |
| Learning signal | How to sequence tasks | Which actions serve *this specific person* |

The Life Simulator creates a **non-promptable discovery problem**: the agent cannot know the person's profile from the prompt — it must be inferred from reward patterns. This is structurally different from a task that better prompting solves.

---

## The Three Discovery Layers

### Layer 1 — Reward Weights (What Matters to This Person)

Same action, same starting state → different rewards per profile:

```
DEEP_WORK, step 1, all meters at 0.7:
  workaholic_stoic:    +1.57   (progress weight 70% — work = meaning)
  introvert_morning:   +0.32   (serenity weight 60% — mild net gain)
  extrovert_night_owl: −0.39   (connection weight 75% — work gives 0 connection)
```

An agent that doesn't adapt plateaus at ~0.60. One that discovers the profile targets pushes above 0.80.

### Layer 2 — Action Modifiers (How Actions Affect This Person)

The base effect matrix is modified invisibly per profile:

| Profile | Hidden modifier | Observable signal |
|---|---|---|
| introvert_morning | Social drain ×3.0 | SOCIALIZE drains vitality 3× faster than expected |
| introvert_morning | Morning deep work ×2.0 | Same action gives 2× progress at slot 0 |
| extrovert_night_owl | Morning penalty ×0.4 | DEEP_WORK in morning gives 40% expected progress |
| extrovert_night_owl | Evening/night bonus ×1.8 | Same action gives 1.8× progress at slots 2–3 |
| extrovert_night_owl | Social connection ×2.0 | SOCIALIZE gives 2× connection gain |
| workaholic_stoic | Work recovers vitality +0.06 | DEEP_WORK raises vitality instead of draining |
| workaholic_stoic | Idle drains serenity −0.10 | ME_TIME/BINGE_WATCH lower serenity |

### Layer 3 — Stress Spiral

When serenity drops below the profile's tolerance threshold, all negative effects amplify ×1.3. Wrong actions → serenity drops → worse outcomes → harder recovery. The agent must learn to protect serenity proactively.

---

## Observable vs Hidden

| Observable (agent sees every step) | Hidden (must infer from reward patterns) |
|---|---|
| All 5 meter values (0.0–1.0) | Which profile is active |
| Day of week (0–6) | Profile reward weights |
| Time slot (0–3) | Per-action modifiers |
| Active random event name | Stress tolerance threshold |
| Remaining steps | Connection decay rate |

---

## Training Story

```
Random baseline       → final_score ≈ 0.60–0.70
  No pattern. Misses timing windows. Doesn't protect serenity floor.

Heuristic baseline    → final_score ≈ 0.75–0.82
  Follows observable rules only. Cannot differentiate profiles.
  Treats everyone the same.

GRPO-trained agent    → target: final_score > 0.82 on 2+ profiles
  Discovers timing bonuses per profile.
  Adapts action mix to the person's hidden reward structure.
  Introvert's week looks different from workaholic's week.
```

---

## Anti-Reward-Hacking Measures

| Safeguard | Mechanism |
|---|---|
| Three-layer reward | format + legality + real env replay — all three must pass |
| Repetition dampening | Same action 3× in a row → 25%/50%/75% effect reduction |
| Critical floor penalty | Any meter < 0.10 → −0.30 per step |
| Random events | 8%/step probability prevents overfitting to deterministic trajectories |
| Seed-based replay | `env_reward` reconstructs exact episode state — reward cannot be fabricated |

---

## Hackathon Theme Alignment

**Primary: Theme 3.2 — World Modeling: Personalized Tasks**

The environment models real personal assistant behaviour. The hidden profile represents real individual differences — what a person values and how activities physically affect them. Discovery through reward is how a good PA adapts over their first weeks on the job.

**Secondary: Theme 2 — Long-Horizon Planning**

28 steps with delayed, compounding consequences. Neglecting connection decays slowly but recovery gets harder each step. Serenity spiral is triggered by accumulated bad decisions, not a single action.

---

## Implementation Reference

| Component | File |
|---|---|
| Environment | `server/rhythm_environment.py` |
| Data models | `models.py` |
| Dataset generator | `training/dataset.py` |
| Reward functions | `training/reward_functions.py` |
| Baseline evaluation | `training/inference_eval.py` |
| Training notebook | `training/RhythmEnv_GRPO_Training.ipynb` |
| Gradio UI | `ui/app.py` |
| FastAPI server | `server/app.py` |

```python
env = RhythmEnvironment()
obs = env.reset(seed=42, profile="introvert_morning")  # profile optional
obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))
# obs.reward, obs.done, obs.reward_breakdown, obs.vitality, ...
```