Spaces:
Sleeping
Sleeping
Rebuild as Life Simulator: 5 meters, 3 hidden profiles, GRPO training pipeline
Browse filesserver/rhythm_environment.py: full Life Simulator (28 steps, 5 meters, 10 actions,
3 hidden personality profiles with distinct reward weights and action modifiers)
models.py: add reward/done fields to RhythmObservation
training/: complete GRPO pipeline (dataset, train, reward_functions, inference_eval,
Colab notebook); env_reward uses real episode replay via dataset columns
ui/app.py: full rewrite for Life Simulator (5-meter display, 10 actions, profile inspector)
blog_post.md: mini-blog for hackathon submission
docs/round2/: entity_definitions and environment_design updated to Life Simulator spec
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Dockerfile +0 -0
- README.md +115 -163
- __init__.py +5 -6
- blog_post.md +90 -0
- client.py +26 -22
- docs/references/FAQs on Discord.md +77 -0
- docs/references/React Orchestrator Linkedin/V1_ReACT_based_Orchestrator.ipynb +0 -0
- docs/references/React Orchestrator Linkedin/V2_ReACT_Based_Orchestrator.ipynb +0 -0
- docs/references/React Orchestrator Linkedin/o3_mini_V1_ReACT_Based_Orchestrator.ipynb +0 -0
- docs/references/React Orchestrator Linkedin/o3_mini_v2_improved_react_orchestrator.ipynb +0 -0
- docs/references/[External] Meta OpenEnv Hackathon Participant Help Guide.md +425 -0
- docs/references/[External] OpenEnv Hackathon FAQs.md +556 -0
- docs/references/hackathon_checklist.md +153 -0
- docs/references/reward_engineering_overview.md +82 -0
- docs/references/reward_engineering_software_tasks.md +77 -0
- docs/references/unsloth_grpo_training_template.md +269 -0
- docs/round2/Plan_v2/CoreMEters.md +50 -0
- docs/round2/Plan_v2/GeminiDiscussion.md +61 -0
- docs/round2/Plan_v2/HumanModeling.md +93 -0
- docs/round2/Plan_v2/LifeMAth.md +89 -0
- docs/round2/Plan_v2/RandomnessFactor.md +132 -0
- docs/round2/Plan_v2/RewardIsolation.md +44 -0
- docs/round2/Plan_v2/Todo.md +14 -0
- docs/round2/[External] Apr β26 OpenEnv Hackathon Themes & Judging Criteria.md +177 -0
- docs/{round2_confirmation.md β round2/confirmation.md} +0 -0
- docs/{round2_design_notes.md β round2/design_notes.md} +0 -0
- docs/round2/entity_definitions.md +220 -0
- docs/round2/environment_design.md +209 -0
- docs/{Hackathon Themes.md β round2/hackathon_themes.md} +0 -0
- docs/round2/pitch_framing.md +57 -0
- docs/{round2_problem_statement.md β round2/problem_statement.md} +0 -0
- eval_results.json +758 -0
- inference.py +108 -93
- models.py +50 -71
- pyproject.toml +3 -2
- server/requirements.txt +1 -0
- server/rhythm_environment.py +452 -554
- tests/test_rhythm_env.py +303 -239
- training/RhythmEnv_GRPO_Training.ipynb +557 -0
- training/dataset.py +180 -0
- training/inference_eval.py +226 -0
- training/reward_functions.py +214 -0
- training/train.py +202 -0
- ui/app.py +384 -0
Dockerfile
CHANGED
|
Binary files a/Dockerfile and b/Dockerfile differ
|
|
|
README.md
CHANGED
|
@@ -9,17 +9,24 @@ tags:
|
|
| 9 |
- openenv
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# RhythmEnv β
|
| 13 |
|
| 14 |
-
An OpenEnv environment where AI agents learn to
|
| 15 |
|
| 16 |
-
##
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
## Quick Start
|
| 25 |
|
|
@@ -33,192 +40,135 @@ from rhythm_env import RhythmEnv, RhythmAction, ActionType
|
|
| 33 |
|
| 34 |
async def main():
|
| 35 |
async with RhythmEnv(base_url="https://InosLihka-rhythm-env.hf.space") as env:
|
| 36 |
-
result = await env.reset(
|
| 37 |
-
print(f"
|
| 38 |
-
print(f"
|
| 39 |
|
| 40 |
-
result = await env.step(RhythmAction(action_type=ActionType.
|
| 41 |
print(f"Reward: {result.reward}")
|
| 42 |
|
| 43 |
asyncio.run(main())
|
| 44 |
```
|
| 45 |
|
| 46 |
-
##
|
| 47 |
-
|
| 48 |
-
| Action | Parameters | Description |
|
| 49 |
-
|--------|-----------|-------------|
|
| 50 |
-
| `START_TASK` | `task_id: int` | Begin working on a task (only when idle β no current task) |
|
| 51 |
-
| `CONTINUE_TASK` | β | Continue working on current task |
|
| 52 |
-
| `SWITCH_TASK` | `task_id: int` | Switch to a different task (requires active task; energy + reward penalty) |
|
| 53 |
-
| `TAKE_BREAK` | β | Rest to recover energy and reduce stress |
|
| 54 |
|
| 55 |
-
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
| `stress` | `float` | Stress level (0-1), rises near deadlines, drops with breaks/completions |
|
| 64 |
-
| `current_task_id` | `int \| null` | Task currently being worked on |
|
| 65 |
-
| `tasks` | `List[TaskInfo]` | All tasks with id, name, description, effort, progress, deadline, importance |
|
| 66 |
-
| `meetings` | `List[int]` | Timesteps blocked by meetings (agent cannot work) |
|
| 67 |
-
| `remaining_steps` | `int` | Steps left in the episode |
|
| 68 |
-
| `reward_breakdown` | `Dict` | Component-wise reward details for interpretability |
|
| 69 |
|
| 70 |
-
|
| 71 |
-
- **name**: Human-readable task name (e.g., "Q3 Performance Report")
|
| 72 |
-
- **description**: What the task involves (e.g., "Compile sales data, create visualizations, and write executive summary")
|
| 73 |
-
- **effort**: Total work required (0-1 scale)
|
| 74 |
-
- **progress**: Work completed so far
|
| 75 |
-
- **deadline**: Timestep by which task should be done
|
| 76 |
-
- **importance**: Priority weight (0-1)
|
| 77 |
|
| 78 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
-
##
|
| 81 |
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
-
|
| 85 |
-
|------|--------|----------|------------|
|
| 86 |
-
| Q3 Performance Report | 0.65 | Step 10 | 0.9 |
|
| 87 |
-
| Client Emails | 0.45 | Step 13 | 0.3 |
|
| 88 |
-
| Expense Filing | 0.35 | Step 16 | 0.2 |
|
| 89 |
|
| 90 |
-
|
| 91 |
-
-
|
| 92 |
-
-
|
|
|
|
|
|
|
| 93 |
|
| 94 |
-
###
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
-
|
| 97 |
|
| 98 |
-
|
|
| 99 |
-
|------
|
| 100 |
-
|
|
| 101 |
-
|
|
| 102 |
-
|
|
| 103 |
-
| Leadership Notes | 0.20 | Step 18 | 0.4 |
|
| 104 |
|
| 105 |
-
|
| 106 |
-
- **Starting energy**: 0.70
|
| 107 |
-
- **Challenge**: Two tight deadlines compete for early slots; meetings eat into critical windows
|
| 108 |
|
| 109 |
-
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
-
|
| 112 |
|
| 113 |
-
|
| 114 |
-
|------|--------|----------|------------|
|
| 115 |
-
| Architecture Redesign | 0.80 | Step 16 | 0.9 |
|
| 116 |
-
| Fix: Login Timeout | 0.15 | Step 6 | 0.5 |
|
| 117 |
-
| Fix: CSV Export | 0.15 | Step 10 | 0.4 |
|
| 118 |
-
| API Documentation | 0.20 | Step 14 | 0.3 |
|
| 119 |
-
| Integration Tests | 0.20 | Step 18 | 0.6 |
|
| 120 |
|
| 121 |
-
- **
|
| 122 |
-
- **Starting energy**: 0.40 (!)
|
| 123 |
-
- **Challenge**: Total effort is 1.50 but max completable is ~1.0. Must triage. Deep work task needs sustained energy you don't have. Bug fix deadline collides with the meeting. Forces hard tradeoffs with no perfect solution.
|
| 124 |
|
| 125 |
-
|
| 126 |
|
| 127 |
-
|
| 128 |
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
tasks=[
|
| 133 |
-
{"name": "Write blog post", "effort": 0.5, "deadline": 12, "importance": 0.8,
|
| 134 |
-
"description": "Draft and edit the technical blog post on caching strategies"},
|
| 135 |
-
{"name": "Review PRs", "effort": 0.2, "deadline": 8, "importance": 0.6,
|
| 136 |
-
"description": "Review 3 open pull requests from the team"},
|
| 137 |
-
{"name": "Fix auth bug", "effort": 0.35, "deadline": 10, "importance": 0.9,
|
| 138 |
-
"description": "Debug and fix the OAuth token refresh issue"},
|
| 139 |
-
],
|
| 140 |
-
meetings=[4, 10],
|
| 141 |
-
initial_energy=0.7,
|
| 142 |
-
)
|
| 143 |
```
|
| 144 |
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
## Episode Design
|
| 148 |
-
|
| 149 |
-
- **1 episode = 1 workday**: 20 steps of 30 minutes each (9am-7pm)
|
| 150 |
-
- **Deterministic**: Same scenario always produces the same initial state
|
| 151 |
-
- **Meetings block work**: During meeting steps, the agent's action is ignored
|
| 152 |
-
- **Tasks have deadlines**: Missing them increases stress and incurs reward penalties
|
| 153 |
-
- **Energy depletes with work, recovers with breaks**: The agent must pace itself
|
| 154 |
-
- **Task completion auto-clears**: When a task is finished, `current_task_id` resets to `null` β the agent can immediately `START_TASK` a new one without needing a break or switch
|
| 155 |
-
- **Early termination**: The episode ends early if all tasks are completed, rewarding efficient agents
|
| 156 |
-
|
| 157 |
-
## Environment Dynamics
|
| 158 |
-
|
| 159 |
-
**Energy** (0-1):
|
| 160 |
-
| Event | Change |
|
| 161 |
-
|-------|--------|
|
| 162 |
-
| Working on a task | β0.05 per step |
|
| 163 |
-
| Taking a break | +0.12 per step |
|
| 164 |
-
| In a meeting | β0.03 per step |
|
| 165 |
-
| Switching tasks | β0.02 penalty |
|
| 166 |
-
|
| 167 |
-
**Stress** (0-1):
|
| 168 |
-
| Event | Change |
|
| 169 |
-
|-------|--------|
|
| 170 |
-
| Missed deadline | +0.15 |
|
| 171 |
-
| Deadline approaching (β€2 steps) | +0.03 |
|
| 172 |
-
| Taking a break | β0.08 |
|
| 173 |
-
| Completing a task | β0.10 |
|
| 174 |
|
| 175 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 176 |
|
| 177 |
-
|
| 178 |
|
| 179 |
-
|
| 180 |
|
| 181 |
-
|
| 182 |
-
|-----------|---------|--------|
|
| 183 |
-
| Progress | `+delta Γ importance Γ 2.0` | Encourages productive work on important tasks |
|
| 184 |
-
| Completion bonus | `+importance Γ 1.5` | Rewards finishing tasks |
|
| 185 |
-
| Stress penalty | `βstress Γ 0.15` | Penalizes sustained high stress |
|
| 186 |
-
| Deadline miss | `β0.3 per miss` | Penalizes missing deadlines |
|
| 187 |
-
| Switch penalty | `β0.1` | Discourages excessive context-switching |
|
| 188 |
-
| Idle penalty | `β0.05` | Penalizes wasted time |
|
| 189 |
-
| Break spam | `β0.05 Γ max(0, consecutive β 2)` | Diminishing returns on consecutive breaks |
|
| 190 |
-
| Mode bonus | `+0.05 (deep work) / +0.02 (execution)` | Hidden bonus for sustained focus |
|
| 191 |
|
| 192 |
-
##
|
| 193 |
|
| 194 |
-
|
|
|
|
|
|
|
| 195 |
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
```
|
| 199 |
|
| 200 |
-
|
| 201 |
-
|-----------|-------------|
|
| 202 |
-
| Completion | Importance-weighted fraction of tasks completed |
|
| 203 |
-
| Deadline | Importance-weighted fraction of deadlines met (missing important deadlines hurts more) |
|
| 204 |
-
| Efficiency | Realistic optimal steps (using avg energy) / actual working steps |
|
| 205 |
-
| Energy mgmt | Average energy maintained over the episode |
|
| 206 |
-
| Stress mgmt | 1 β average stress over the episode |
|
| 207 |
-
| Switch penalty | `min(0.15, switch_count Γ 0.02)` β penalizes excessive context-switching |
|
| 208 |
|
| 209 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 210 |
|
| 211 |
-
|
|
|
|
|
|
|
| 212 |
|
| 213 |
-
|
| 214 |
-
|----------|-------------------|--------------------------|-------------------|
|
| 215 |
-
| Easy | **0.670** | 0.310 | 0.241 |
|
| 216 |
-
| Medium | **0.612** | 0.533 | 0.238 |
|
| 217 |
-
| Hard | 0.145 | **0.422** | 0.232 |
|
| 218 |
-
|
| 219 |
-
- Idle/degenerate strategies score ~0.23-0.24 (zero completion, zero efficiency)
|
| 220 |
-
- The heuristic dominates on easy and medium but fails on hard β it is too conservative with energy management for the low-energy start (0.40), completing zero tasks. This demonstrates that hard truly requires intelligent triage, not simple rules.
|
| 221 |
-
- Strong LLM agents should score 0.65+ by learning energy management, stress-aware pacing, and deadline triage
|
| 222 |
|
| 223 |
## Setup Instructions
|
| 224 |
|
|
@@ -233,7 +183,7 @@ uvicorn server.app:app --host 0.0.0.0 --port 8000
|
|
| 233 |
### Docker
|
| 234 |
|
| 235 |
```bash
|
| 236 |
-
docker build -t rhythm-env:latest
|
| 237 |
docker run -p 8000:8000 rhythm-env:latest
|
| 238 |
```
|
| 239 |
|
|
@@ -254,22 +204,24 @@ python inference.py
|
|
| 254 |
|
| 255 |
| Method | Endpoint | Description |
|
| 256 |
|--------|----------|-------------|
|
| 257 |
-
| `POST` | `/reset` | Start a new episode (`
|
| 258 |
-
| `POST` | `/step` | Execute an action |
|
| 259 |
-
| `GET` | `/state` | Get current
|
| 260 |
| `GET` | `/health` | Health check |
|
| 261 |
| `GET` | `/metadata` | Environment metadata |
|
| 262 |
| `GET` | `/schema` | Action/observation JSON schemas |
|
| 263 |
-
| `POST` | `/mcp` | MCP JSON-RPC endpoint |
|
| 264 |
|
| 265 |
-
|
| 266 |
|
| 267 |
-
|
| 268 |
|
| 269 |
-
```bash
|
| 270 |
-
openenv validate # Local structure check
|
| 271 |
-
openenv validate --url https://InosLihka-rhythm-env.hf.space # Runtime check
|
| 272 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 273 |
|
| 274 |
## License
|
| 275 |
|
|
|
|
| 9 |
- openenv
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# RhythmEnv β Life Simulator RL Environment
|
| 13 |
|
| 14 |
+
An OpenEnv environment where AI agents learn to balance 5 life meters across a 7-day week while discovering hidden personality profiles through reward signals.
|
| 15 |
|
| 16 |
+
## Links
|
| 17 |
|
| 18 |
+
- **Live Environment**: [HF Space β InosLihka/rhythm_env](https://huggingface.co/spaces/InosLihka/rhythm_env)
|
| 19 |
+
- **Training Notebook**: [Colab β GRPO Training](training/RhythmEnv_GRPO_Training.ipynb)
|
| 20 |
+
- **Blog Post**: [Teaching an AI to Know You (Without Asking)](https://huggingface.co/spaces/InosLihka/rhythm_env/blob/main/blog_post.md)
|
| 21 |
+
<!-- - **Training Results**: See [Results](#training-results) section below -->
|
| 22 |
|
| 23 |
+
## Why a Life Simulator?
|
| 24 |
|
| 25 |
+
Personal AI assistants give generic advice. They don't know *you*. RhythmEnv is an environment where an agent learns your specific patterns through experience β not configuration.
|
| 26 |
+
|
| 27 |
+
Every person has a hidden "DNA" that determines how activities affect their well-being: introverts drain faster from socializing, workaholics find peace through progress, night owls peak in the evening. The agent must discover these hidden dynamics by observing how the same actions produce different rewards across episodes.
|
| 28 |
+
|
| 29 |
+
This is **Theme 3.2 (Personalized Tasks)** + **Theme 2 (Long-Horizon Planning)** β the agent must plan across a full week while adapting to a hidden personality it must infer.
|
| 30 |
|
| 31 |
## Quick Start
|
| 32 |
|
|
|
|
| 40 |
|
| 41 |
async def main():
|
| 42 |
async with RhythmEnv(base_url="https://InosLihka-rhythm-env.hf.space") as env:
|
| 43 |
+
result = await env.reset(seed=0)
|
| 44 |
+
print(f"Vitality: {result.observation.vitality}")
|
| 45 |
+
print(f"Day: {result.observation.day}, Slot: {result.observation.slot}")
|
| 46 |
|
| 47 |
+
result = await env.step(RhythmAction(action_type=ActionType.DEEP_WORK))
|
| 48 |
print(f"Reward: {result.reward}")
|
| 49 |
|
| 50 |
asyncio.run(main())
|
| 51 |
```
|
| 52 |
|
| 53 |
+
## The 5 Life Meters
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
+
All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receives a heavy penalty.
|
| 56 |
|
| 57 |
+
| Meter | What It Represents | Increases With | Decreases With |
|
| 58 |
+
|-------|-------------------|----------------|----------------|
|
| 59 |
+
| **Vitality** | Physical energy, sleep | Sleep, Exercise | Work, Socializing |
|
| 60 |
+
| **Cognition** | Focus, mental clarity | Sleep, Meditate | Deep Work, Binge Watch |
|
| 61 |
+
| **Progress** | Career/skill growth | Deep Work, Learn, Admin | Binge Watch (slightly) |
|
| 62 |
+
| **Serenity** | Inner peace vs stress | Meditate, Me Time, Exercise | Deep Work, Admin |
|
| 63 |
+
| **Connection** | Relationship health | Family Time, Socialize | Passive decay every step |
|
| 64 |
|
| 65 |
+
**Key interactions**:
|
| 66 |
+
- Low Vitality reduces the effectiveness of ALL positive actions (global multiplier)
|
| 67 |
+
- Connection decays passively β you must actively maintain relationships
|
| 68 |
+
- Meters interact non-linearly: a crash in one often cascades to others
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
+
## Action Space (10 Actions)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
+
| Category | Action | Primary Effect |
|
| 73 |
+
|----------|--------|---------------|
|
| 74 |
+
| **Productivity** | `DEEP_WORK` | High Progress, drains Vitality + Cognition |
|
| 75 |
+
| | `ADMIN_WORK` | Moderate Progress, low drain |
|
| 76 |
+
| | `LEARN` | Progress + slight Cognition drain |
|
| 77 |
+
| **Recovery** | `SLEEP` | Strong Vitality + Cognition recovery |
|
| 78 |
+
| | `EXERCISE` | Vitality + Serenity boost |
|
| 79 |
+
| | `MEDITATE` | Strong Serenity + Cognition boost |
|
| 80 |
+
| **Social** | `FAMILY_TIME` | Strong Connection, costs Vitality |
|
| 81 |
+
| | `SOCIALIZE` | Connection + mild Serenity |
|
| 82 |
+
| **Leisure** | `ME_TIME` | Serenity + mild Cognition recovery |
|
| 83 |
+
| | `BINGE_WATCH` | Mild Serenity, drains Cognition (trap action) |
|
| 84 |
|
| 85 |
+
## Episode Structure
|
| 86 |
|
| 87 |
+
- **1 episode = 1 week** = 7 days Γ 4 slots/day = **28 steps**
|
| 88 |
+
- **Time slots**: Morning (0), Afternoon (1), Evening (2), Night (3)
|
| 89 |
+
- **Time-of-day effects**: Morning boosts cognitive gains (+20%), Night penalizes them (-40%)
|
| 90 |
+
- **Random events** (~8% per step): Prod Crash, Family Emergency, Illness, Good News
|
| 91 |
+
- **Deterministic** given seed: same seed β same episode trajectory
|
| 92 |
|
| 93 |
+
## The Hidden Variable System (Core Innovation)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
+
### What the Agent Sees
|
| 96 |
+
- All 5 meter values
|
| 97 |
+
- Current day, slot, timestep
|
| 98 |
+
- Active event (if any)
|
| 99 |
+
- Total scalar reward per step
|
| 100 |
|
| 101 |
+
### What the Agent Does NOT See
|
| 102 |
+
- **Personality profile name** β assigned per episode, controls everything below
|
| 103 |
+
- **Reward weight decomposition** β same meter changes produce different rewards for different profiles
|
| 104 |
+
- **Action modifiers** β social drain, cognitive bonuses, shame spirals vary by profile
|
| 105 |
|
| 106 |
+
### 3 Hidden Personality Profiles
|
| 107 |
|
| 108 |
+
| Profile | Hidden Trait Values | "Good Day" (Reward Weights) |
|
| 109 |
+
|---------|----------------|-------------------|
|
| 110 |
+
| **Introvert Morning Person** | Social drains 3Γ Vitality, Morning cognition Γ2.0, Binge shame spiral, Solo activities +0.10 Serenity | **Serenity 60%**, Progress 20% |
|
| 111 |
+
| **Extrovert Night Owl** | Social drains only 0.2Γ Vitality, Evening cognition Γ1.8, Morning penalty Γ0.4, Social gives 2Γ Connection + Serenity bonus | **Connection 75%** |
|
| 112 |
+
| **Workaholic Stoic** | Progress gives +0.10 Serenity, Idle drains -0.10 Serenity, Work gives +0.06 Vitality recovery, 1.5Γ Vitality decay | **Progress 70%** |
|
|
|
|
| 113 |
|
| 114 |
+
### The Discovery Challenge
|
|
|
|
|
|
|
| 115 |
|
| 116 |
+
The agent must:
|
| 117 |
+
1. **Probe** β try different actions to see unexpected meter changes
|
| 118 |
+
2. **Infer** β correlate reward patterns to discover the hidden personality
|
| 119 |
+
3. **Adapt** β change its entire strategy based on what it learned
|
| 120 |
|
| 121 |
+
A trained agent for the Introvert learns to: front-load deep work in the morning, avoid socializing, never binge watch. The same agent for the Extrovert learns to: socialize freely, work in the evening, skip mornings.
|
| 122 |
|
| 123 |
+
## Reward Architecture
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
+
**Per-step reward** = `sum(meter_delta Γ hidden_weight) Γ 10`
|
|
|
|
|
|
|
| 126 |
|
| 127 |
+
The reward weights are hidden and differ per profile. The agent sees only the total scalar reward.
|
| 128 |
|
| 129 |
+
**Critical threshold**: If any meter drops below 0.1, a -0.5 penalty is applied.
|
| 130 |
|
| 131 |
+
**Final grader** (0-1 score at episode end):
|
| 132 |
+
```
|
| 133 |
+
score = 0.30 Γ meter_balance + 0.25 Γ crash_free + 0.20 Γ progress + 0.15 Γ connection + 0.10 Γ efficiency
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
```
|
| 135 |
|
| 136 |
+
## Baseline Scores
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
+
| Profile | Heuristic | Random |
|
| 139 |
+
|---------|-----------|--------|
|
| 140 |
+
| Introvert Morning | **0.765** | 0.693 |
|
| 141 |
+
| Extrovert Night Owl | **0.819** | 0.863 |
|
| 142 |
+
| Workaholic Stoic | **0.761** | 0.602 |
|
| 143 |
|
| 144 |
+
The heuristic uses fixed rules (sleep at night, work in morning, socialize when connection is low). It doesn't adapt to profiles β a trained agent that discovers the hidden personality should significantly outperform it.
|
| 145 |
|
| 146 |
+
## Training Results
|
| 147 |
|
| 148 |
+
Training uses **GRPO (Group Relative Policy Optimization)** with [Unsloth](https://github.com/unslothai/unsloth) for efficient 4-bit fine-tuning. The 3-layer reward stack teaches the model to: (1) output valid actions, (2) choose legal actions, (3) maximize environment reward.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
+
### Training Curves
|
| 151 |
|
| 152 |
+
<!-- After running the Colab notebook, replace these with your actual plots: -->
|
| 153 |
+
<!--  -->
|
| 154 |
+
<!-- *GRPO training loss over 500 steps. Loss decreases as the model learns to produce valid, high-reward actions.* -->
|
| 155 |
|
| 156 |
+
<!--  -->
|
| 157 |
+
<!-- *Mean reward trending upward indicates the agent is learning to balance meters more effectively.* -->
|
|
|
|
| 158 |
|
| 159 |
+
### Baseline vs Trained Comparison
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 160 |
|
| 161 |
+
| Profile | Random | Heuristic | Trained | Improvement |
|
| 162 |
+
|---------|--------|-----------|---------|-------------|
|
| 163 |
+
| Introvert Morning | 0.693 | 0.765 | β | β |
|
| 164 |
+
| Extrovert Night Owl | 0.863 | 0.819 | β | β |
|
| 165 |
+
| Workaholic Stoic | 0.602 | 0.761 | β | β |
|
| 166 |
|
| 167 |
+
<!-- After training, update with actual numbers and uncomment: -->
|
| 168 |
+
<!--  -->
|
| 169 |
+
<!-- *Comparison of random baseline, heuristic baseline, and GRPO-trained agent across all 3 hidden personality profiles.* -->
|
| 170 |
|
| 171 |
+
The heuristic uses fixed rules and doesn't adapt to profiles. A trained agent that discovers the hidden personality should outperform it by adapting its strategy per-profile.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
## Setup Instructions
|
| 174 |
|
|
|
|
| 183 |
### Docker
|
| 184 |
|
| 185 |
```bash
|
| 186 |
+
docker build -t rhythm-env:latest .
|
| 187 |
docker run -p 8000:8000 rhythm-env:latest
|
| 188 |
```
|
| 189 |
|
|
|
|
| 204 |
|
| 205 |
| Method | Endpoint | Description |
|
| 206 |
|--------|----------|-------------|
|
| 207 |
+
| `POST` | `/reset` | Start a new episode (`seed`, `profile` optional) |
|
| 208 |
+
| `POST` | `/step` | Execute an action (`action_type`) |
|
| 209 |
+
| `GET` | `/state` | Get current state (includes hidden profile for debugging) |
|
| 210 |
| `GET` | `/health` | Health check |
|
| 211 |
| `GET` | `/metadata` | Environment metadata |
|
| 212 |
| `GET` | `/schema` | Action/observation JSON schemas |
|
|
|
|
| 213 |
|
| 214 |
+
## Why It Matters
|
| 215 |
|
| 216 |
+
This environment is a training ground for **truly personalized AI**:
|
| 217 |
|
|
|
|
|
|
|
|
|
|
| 218 |
```
|
| 219 |
+
User installs app β Agent runs episodes in background
|
| 220 |
+
Over time β learns energy patterns, task preferences, peak hours
|
| 221 |
+
Result β a scheduler that actually knows YOU
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
No setup. No personality quiz. The agent figures you out.
|
| 225 |
|
| 226 |
## License
|
| 227 |
|
__init__.py
CHANGED
|
@@ -5,15 +5,15 @@
|
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
"""
|
| 8 |
-
RhythmEnv β
|
| 9 |
|
| 10 |
-
A
|
| 11 |
-
|
| 12 |
-
|
| 13 |
"""
|
| 14 |
|
| 15 |
from .client import RhythmEnv
|
| 16 |
-
from .models import ActionType, RhythmAction, RhythmObservation, RhythmState
|
| 17 |
|
| 18 |
__all__ = [
|
| 19 |
"RhythmEnv",
|
|
@@ -21,5 +21,4 @@ __all__ = [
|
|
| 21 |
"RhythmObservation",
|
| 22 |
"RhythmState",
|
| 23 |
"ActionType",
|
| 24 |
-
"TaskInfo",
|
| 25 |
]
|
|
|
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
"""
|
| 8 |
+
RhythmEnv β Life Simulator RL Environment for OpenEnv.
|
| 9 |
|
| 10 |
+
A holistic life resource management RL environment where an agent balances
|
| 11 |
+
5 life meters (Vitality, Cognition, Progress, Serenity, Connection) across
|
| 12 |
+
a 7-day week with hidden personality profiles.
|
| 13 |
"""
|
| 14 |
|
| 15 |
from .client import RhythmEnv
|
| 16 |
+
from .models import ActionType, RhythmAction, RhythmObservation, RhythmState
|
| 17 |
|
| 18 |
__all__ = [
|
| 19 |
"RhythmEnv",
|
|
|
|
| 21 |
"RhythmObservation",
|
| 22 |
"RhythmState",
|
| 23 |
"ActionType",
|
|
|
|
| 24 |
]
|
blog_post.md
ADDED
|
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Teaching an AI to Know You (Without Asking)
|
| 2 |
+
|
| 3 |
+
Ask someone how they'd build a personal AI assistant, and they'll say: give it a personality quiz. A preferences form. Maybe a settings page where you pick "introvert" or "morning person" from a dropdown.
|
| 4 |
+
|
| 5 |
+
Sounds reasonable. It's the wrong approach entirely.
|
| 6 |
+
|
| 7 |
+
Think about the people who actually know you well β a close friend, a partner, a sibling. None of them sat you down with a questionnaire. They figured you out by *watching*. They noticed that you get irritable after too many social events. That you do your best thinking before noon. That skipping exercise makes you anxious by Wednesday.
|
| 8 |
+
|
| 9 |
+
They learned your hidden patterns through trial, error, and feedback. RhythmEnv is an experiment in doing that with an RL agent.
|
| 10 |
+
|
| 11 |
+
## Why personality can't be captured in a settings page
|
| 12 |
+
|
| 13 |
+
I work on AI at Microsoft. One thing I kept running into building assistant features was the gap between what users *say* they want and what actually helps them. People are bad at introspecting their own patterns. The introvert who says "I don't mind meetings" because they've normalized the drain. The workaholic who checks "I value work-life balance" because they know they should.
|
| 14 |
+
|
| 15 |
+
Preference forms capture what people believe about themselves. Behavior reveals what's actually true.
|
| 16 |
+
|
| 17 |
+
So the right question isn't "can we ask better questions?" It's "can we learn without asking at all?"
|
| 18 |
+
|
| 19 |
+
## What "knowing a person" actually means
|
| 20 |
+
|
| 21 |
+
Here's how I decomposed it. Every person has a set of hidden traits β atomic behavioral properties that describe how they *respond* to activities, not just what they like:
|
| 22 |
+
|
| 23 |
+
How much does socializing physically drain you? When does your brain work best β morning or evening? Does leisure make you feel guilty, or does it recharge you? Does progress at work give you inner peace, or just tire you out?
|
| 24 |
+
|
| 25 |
+
No single trait defines a person. It's the combination. An introvert who peaks in the morning has high social drain, early cognitive peak, and solo time as their recharge mechanism. An extrovert night owl has the opposite: socializing barely costs vitality, evening is when they come alive, and being alone doesn't restore them. Same list of traits. Completely different values. Completely different person.
|
| 26 |
+
|
| 27 |
+
But traits are only half of it.
|
| 28 |
+
|
| 29 |
+
## The thing I couldn't solve with traits alone
|
| 30 |
+
|
| 31 |
+
Two people can do the exact same activities and have completely different days. Not because the activities are different β because they *define a good day differently*.
|
| 32 |
+
|
| 33 |
+
This is the second layer: hidden reward weights. A definition of what a good week means to each person.
|
| 34 |
+
|
| 35 |
+
The introvert values serenity above everything else (60% of their score). A week where they maintained inner peace and made some progress is a great week. Connection barely registers. The extrovert values connection above all (75%). A week full of meaningful social interactions is a great week, even if they didn't make much career progress. The workaholic values progress above all (70%). Deep productive work is the whole point. Everything else is secondary.
|
| 36 |
+
|
| 37 |
+
The agent sees the same five meters. Takes the same ten actions. Gets a scalar reward. But the reward is secretly computed using these hidden weights. Same action, same meter changes, completely different reward signal depending on who you're helping.
|
| 38 |
+
|
| 39 |
+
## The environment itself
|
| 40 |
+
|
| 41 |
+
RhythmEnv simulates one week in a person's life β seven days, four time slots each, 28 decisions. Each decision is an activity: deep work, exercise, sleep, meditation, family time, socializing. Ten options total.
|
| 42 |
+
|
| 43 |
+
Five meters track the person's state. Picture them like fuel gauges on a dashboard. Vitality is physical energy β sleep fills it, sustained work drains it. Cognition is mental sharpness, highest in the morning for some people, evening for others. Progress is career momentum, the only meter that only goes up through work. Serenity is inner calm β meditation and rest help, overwork kills it. And Connection, the most interesting one: it decays passively every single time slot. If you don't actively socialise, it drops on its own. The agent can't ignore it and come back to it later.
|
| 44 |
+
|
| 45 |
+
The hidden profile changes what these meters *mean*. Tell the introvert to socialise: their vitality drops three times faster than the base rate. Their body physically rejects it. Tell the extrovert the same: barely any drain. They could socialise all day.
|
| 46 |
+
|
| 47 |
+
Tell the introvert to meditate: they get a +0.10 serenity bonus on top of the base effect. Alone time is their recharge. Tell the workaholic the same thing: their serenity *drops* by 0.10, because idle activities make them anxious.
|
| 48 |
+
|
| 49 |
+
Tell the workaholic to do deep work: they recover +0.06 vitality β productive work energises them. Tell the introvert to do deep work in the morning: their progress and cognition gains are doubled. Same action, completely different physiological response.
|
| 50 |
+
|
| 51 |
+
## What the agent must figure out, without being told
|
| 52 |
+
|
| 53 |
+
The agent sees meters, time of day, and reward. It doesn't see which person type it's helping, or the trait values, or how the reward is being computed.
|
| 54 |
+
|
| 55 |
+
After a few actions, the patterns start showing. "I socialised and my vitality crashed β this person drains from socialising." "I meditated and got a huge reward β serenity must be heavily weighted for them." "Deep work in the morning gave double progress β this person peaks early."
|
| 56 |
+
|
| 57 |
+
A good agent should probe in the first few steps, infer the person type from the unexpected meter changes, then adapt its strategy for the rest of the week. An agent that discovers it's helping an introvert should meditate more and socialise less. One that discovers it's helping a workaholic should maximise productive hours and cut idle time.
|
| 58 |
+
|
| 59 |
+
## The training signal
|
| 60 |
+
|
| 61 |
+
Here's what makes this tractable for RL. At the same starting state β Monday morning, all meters at 0.7 β the best action is completely different per profile:
|
| 62 |
+
|
| 63 |
+
| Profile | Best action | Reward | Worst action | Reward |
|
| 64 |
+
|---|---|---|---|---|
|
| 65 |
+
| Introvert | MEDITATE | +1.76 | SOCIALIZE | +0.03 |
|
| 66 |
+
| Extrovert | FAMILY_TIME | +2.63 | ME_TIME | β0.42 |
|
| 67 |
+
| Workaholic | DEEP_WORK | +1.57 | ME_TIME | β0.27 |
|
| 68 |
+
|
| 69 |
+
GRPO β Group Relative Policy Optimization β generates multiple candidate actions for each state, scores them all against the real environment, then updates the model to prefer the higher-scoring ones. Think of it as the model getting to observe "if I had done X instead of Y here, the outcome would have been this" β and slowly building intuition for which choices work for which person.
|
| 70 |
+
|
| 71 |
+
The model is Qwen 2.5-3B with 4-bit quantization and LoRA. Small enough to train on a free Colab T4.
|
| 72 |
+
|
| 73 |
+
## What I'm watching for
|
| 74 |
+
|
| 75 |
+
The rule-based heuristic baseline β fixed logic, no profile adaptation, treats everyone the same β scores around 0.76β0.82 depending on the profile. It works *despite* the hidden dynamics, not because it understands them. Sleep when vitality is low. Meditate when serenity is low. Socialise when connection drops. Reasonable advice for anyone.
|
| 76 |
+
|
| 77 |
+
The goal for the trained agent isn't just higher scores. It's qualitatively different action sequences per person. The introvert's week should look nothing like the extrovert's week. The workaholic's Monday should look nothing like the introvert's Monday. If the agent is just scoring higher by exploiting a pattern that works across all profiles, that's not discovery β that's luck.
|
| 78 |
+
|
| 79 |
+
No questionnaire. No settings page. Just attention, inference, and adjustment.
|
| 80 |
+
|
| 81 |
+
That's what I think personal AI should actually feel like.
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
|
| 85 |
+
**Links:**
|
| 86 |
+
- [Live Environment (HF Space)](https://huggingface.co/spaces/InosLihka/rhythm_env)
|
| 87 |
+
- [Training Notebook (Colab)](training/RhythmEnv_GRPO_Training.ipynb)
|
| 88 |
+
- [Source Code](https://huggingface.co/spaces/InosLihka/rhythm_env)
|
| 89 |
+
|
| 90 |
+
*Built for the Meta PyTorch OpenEnv Hackathon Grand Finale, Bangalore, April 2026.*
|
client.py
CHANGED
|
@@ -7,7 +7,7 @@
|
|
| 7 |
"""
|
| 8 |
RhythmEnv Client.
|
| 9 |
|
| 10 |
-
Provides the WebSocket client for connecting to a RhythmEnv server.
|
| 11 |
"""
|
| 12 |
|
| 13 |
from __future__ import annotations
|
|
@@ -17,29 +17,25 @@ from typing import Any, Dict
|
|
| 17 |
from openenv.core.client_types import StepResult
|
| 18 |
from openenv.core.env_client import EnvClient
|
| 19 |
|
| 20 |
-
# Support both package and standalone imports
|
| 21 |
try:
|
| 22 |
-
from .models import RhythmAction, RhythmObservation, RhythmState
|
| 23 |
except ImportError:
|
| 24 |
-
from models import RhythmAction, RhythmObservation, RhythmState
|
| 25 |
|
| 26 |
|
| 27 |
class RhythmEnv(EnvClient[RhythmAction, RhythmObservation, RhythmState]):
|
| 28 |
"""
|
| 29 |
-
Client for the RhythmEnv
|
| 30 |
|
| 31 |
Example:
|
| 32 |
>>> async with RhythmEnv(base_url="https://InosLihka-rhythm-env.hf.space") as client:
|
| 33 |
-
... result = await client.reset(
|
| 34 |
-
... result = await client.step(RhythmAction(action_type=ActionType.
|
| 35 |
"""
|
| 36 |
|
| 37 |
def _step_payload(self, action: RhythmAction) -> Dict[str, Any]:
|
| 38 |
"""Serialize RhythmAction to JSON payload."""
|
| 39 |
-
|
| 40 |
-
if action.task_id is not None:
|
| 41 |
-
payload["task_id"] = action.task_id
|
| 42 |
-
return payload
|
| 43 |
|
| 44 |
def _parse_result(self, payload: Dict[str, Any]) -> StepResult[RhythmObservation]:
|
| 45 |
"""Parse server response into StepResult[RhythmObservation]."""
|
|
@@ -47,12 +43,15 @@ class RhythmEnv(EnvClient[RhythmAction, RhythmObservation, RhythmState]):
|
|
| 47 |
|
| 48 |
observation = RhythmObservation(
|
| 49 |
timestep=obs_data.get("timestep", 0),
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
| 56 |
reward_breakdown=obs_data.get("reward_breakdown", {}),
|
| 57 |
done=payload.get("done", False),
|
| 58 |
reward=payload.get("reward", 0.0),
|
|
@@ -69,10 +68,15 @@ class RhythmEnv(EnvClient[RhythmAction, RhythmObservation, RhythmState]):
|
|
| 69 |
"""Parse server response into RhythmState."""
|
| 70 |
return RhythmState(
|
| 71 |
episode_id=payload.get("episode_id", ""),
|
| 72 |
-
task_name=payload.get("task_name", ""),
|
| 73 |
-
timestep=payload.get("timestep", 0),
|
| 74 |
-
energy=payload.get("energy", 1.0),
|
| 75 |
-
stress=payload.get("stress", 0.0),
|
| 76 |
-
current_task_id=payload.get("current_task_id"),
|
| 77 |
step_count=payload.get("step_count", 0),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
)
|
|
|
|
| 7 |
"""
|
| 8 |
RhythmEnv Client.
|
| 9 |
|
| 10 |
+
Provides the WebSocket client for connecting to a RhythmEnv Life Simulator server.
|
| 11 |
"""
|
| 12 |
|
| 13 |
from __future__ import annotations
|
|
|
|
| 17 |
from openenv.core.client_types import StepResult
|
| 18 |
from openenv.core.env_client import EnvClient
|
| 19 |
|
|
|
|
| 20 |
try:
|
| 21 |
+
from .models import RhythmAction, RhythmObservation, RhythmState
|
| 22 |
except ImportError:
|
| 23 |
+
from models import RhythmAction, RhythmObservation, RhythmState
|
| 24 |
|
| 25 |
|
| 26 |
class RhythmEnv(EnvClient[RhythmAction, RhythmObservation, RhythmState]):
|
| 27 |
"""
|
| 28 |
+
Client for the RhythmEnv Life Simulator.
|
| 29 |
|
| 30 |
Example:
|
| 31 |
>>> async with RhythmEnv(base_url="https://InosLihka-rhythm-env.hf.space") as client:
|
| 32 |
+
... result = await client.reset()
|
| 33 |
+
... result = await client.step(RhythmAction(action_type=ActionType.DEEP_WORK))
|
| 34 |
"""
|
| 35 |
|
| 36 |
def _step_payload(self, action: RhythmAction) -> Dict[str, Any]:
|
| 37 |
"""Serialize RhythmAction to JSON payload."""
|
| 38 |
+
return {"action_type": action.action_type.value}
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
def _parse_result(self, payload: Dict[str, Any]) -> StepResult[RhythmObservation]:
|
| 41 |
"""Parse server response into StepResult[RhythmObservation]."""
|
|
|
|
| 43 |
|
| 44 |
observation = RhythmObservation(
|
| 45 |
timestep=obs_data.get("timestep", 0),
|
| 46 |
+
day=obs_data.get("day", 0),
|
| 47 |
+
slot=obs_data.get("slot", 0),
|
| 48 |
+
vitality=obs_data.get("vitality", 0.8),
|
| 49 |
+
cognition=obs_data.get("cognition", 0.7),
|
| 50 |
+
progress=obs_data.get("progress", 0.0),
|
| 51 |
+
serenity=obs_data.get("serenity", 0.7),
|
| 52 |
+
connection=obs_data.get("connection", 0.5),
|
| 53 |
+
active_event=obs_data.get("active_event"),
|
| 54 |
+
remaining_steps=obs_data.get("remaining_steps", 28),
|
| 55 |
reward_breakdown=obs_data.get("reward_breakdown", {}),
|
| 56 |
done=payload.get("done", False),
|
| 57 |
reward=payload.get("reward", 0.0),
|
|
|
|
| 68 |
"""Parse server response into RhythmState."""
|
| 69 |
return RhythmState(
|
| 70 |
episode_id=payload.get("episode_id", ""),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
step_count=payload.get("step_count", 0),
|
| 72 |
+
timestep=payload.get("timestep", 0),
|
| 73 |
+
day=payload.get("day", 0),
|
| 74 |
+
slot=payload.get("slot", 0),
|
| 75 |
+
profile_name=payload.get("profile_name", ""),
|
| 76 |
+
vitality=payload.get("vitality", 0.8),
|
| 77 |
+
cognition=payload.get("cognition", 0.7),
|
| 78 |
+
progress=payload.get("progress", 0.0),
|
| 79 |
+
serenity=payload.get("serenity", 0.7),
|
| 80 |
+
connection=payload.get("connection", 0.5),
|
| 81 |
+
active_event=payload.get("active_event"),
|
| 82 |
)
|
docs/references/FAQs on Discord.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
**A message from the team | Meta PyTorch OpenEnv Hackathon x Scaler School of Technology**
|
| 2 |
+
|
| 3 |
+
We want to start by saying something that we mean genuinely: thank you.
|
| 4 |
+
|
| 5 |
+
Over the past few weeks, you showed up. In numbers we did not fully anticipate. With submissions, energy, questions, and expectations that reflect just how much this means to you. And that means everything to us.
|
| 6 |
+
|
| 7 |
+
This hackathon is the first of its kind at this scale in India. We are not just saying that as a line. We mean it operationally. The infrastructure, the evaluation process, the coordination across Meta, PyTorch, Hugging Face, and the team at Scaler School of Technology is being stress-tested in real time. Some things might not have gone as planned, we own that & are working around the clock to fix them. And we are committed to being transparent with you about every single one of them.
|
| 8 |
+
|
| 9 |
+
Weβre creating this document to answer common questions that weβre seeing on the Discord channel. We will keep updating this document.
|
| 10 |
+
|
| 11 |
+
**Who are the mentors and judges for the finale?**
|
| 12 |
+
|
| 13 |
+
We are proud to share the full list of mentors and judges who will be part of the in-person finale:
|
| 14 |
+
|
| 15 |
+
* Sanyam Bhutani, Partner Engineer at Meta
|
| 16 |
+
* Yash Khare, Partner Engineer at Meta
|
| 17 |
+
* Nilesh Pandey, Partner Engineer at Meta
|
| 18 |
+
* Adithya S Kolavi, ML Engineer at Hugging Face
|
| 19 |
+
* Adarsh Shirawalmath, ML Engineer at Hugging Face
|
| 20 |
+
* Arkadip Maitra, ML Engineer at Red Hat
|
| 21 |
+
* Aashay Sachdeva, Founding Team at Sarvam
|
| 22 |
+
* Deepa Dhevannan, Gen AI Solutions Architect
|
| 23 |
+
* Soumik Rakshit, ML Engineer at Zomato
|
| 24 |
+
* Ayush Satyam**,** Systems ML Engineer, Red Hat
|
| 25 |
+
* Parshant Sharma, Machine Learning Engineer at Red Hat
|
| 26 |
+
|
| 27 |
+
These are practitioners actively working at the forefront of AI. The team has worked hard to bring them together so they can be around to help you make your environments even better.
|
| 28 |
+
|
| 29 |
+
**Why are results online, and why does the finale still happen on campus? Why are results not being declared on April 26?**
|
| 30 |
+
|
| 31 |
+
We want to be fully transparent here, and we want to address both of these together because they come from the same place.
|
| 32 |
+
|
| 33 |
+
The final evaluation is being handled directly by engineers from Meta, PyTorch, and Hugging Face. With 800+ submissions, we made the decision to move to a hybrid evaluation model. Initial screening uses automated tooling, but every top team's submission will receive a dedicated, granular review by domain experts, with \~20-30 minutes of evaluation per team. This is not something we are willing to rush. Every submission deserves to be looked at fairly, and that takes time.
|
| 34 |
+
|
| 35 |
+
We had originally envisioned the entire evaluation happening offline. But given the volume of submissions, completing a fully offline evaluation before bringing everyone to campus would have meant asking you to wait significantly longer, and that felt like the wrong trade-off.
|
| 36 |
+
|
| 37 |
+
What we refused to compromise on is the experience of you all coming together in person, dedicatedly building and improving your environments with the mentors mentioned above being around to help you. There is something that cannot be replicated about builders in the same room, working through ideas together, pushing each other. Beyond that, this is a rare opportunity for the Meta/ Pytorch and Hugging Face teams to interact directly with engineers building in India, and to get a genuine sense of the depth of engineering talent this country has. That kind of exposure goes both ways, and it is something we were not willing to cut.
|
| 38 |
+
|
| 39 |
+
**Has promised mentorship and expert access materialised?**
|
| 40 |
+
|
| 41 |
+
Yes, and more is coming. A live session has already been conducted with Ben Burtenshaw (Community Education at Hugging Face) and Pulkit Sharma (Senior Instructor at Scaler). \[[Link to session recording here\]](https://www.youtube.com/live/kkCNMz0Ptd8?si=KDIaWXSEX6up4lU4) along with additional modules shared over dashboard and emails. Extensive additional sessions and mentor touchpoints are planned on campus.
|
| 42 |
+
|
| 43 |
+
**Why was the problem theme document edited in real time?**
|
| 44 |
+
|
| 45 |
+
This one is on us, and we want to be straightforward about it. The document that was shared with participants contained leftover content from a previous hackathon that should have been removed before it went out. It was an editorial error, not an intentional change, and we corrected it as soon as we caught it.
|
| 46 |
+
|
| 47 |
+
More broadly, this hackathon is being run by open-source teams across multiple organizations coordinating in real time. Mistakes like this can happen, and when they do, we would rather fix them quickly and tell you exactly what happened than let confusion sit.
|
| 48 |
+
|
| 49 |
+
We also want to address the evaluation adjustments that some of you noticed. We made deliberate changes to the judging process to ensure every submission gets the time and attention it deserves. Rushing evaluations on the day of the event would have been unfair to everyone. This was a considered call, not a last-minute scramble.
|
| 50 |
+
|
| 51 |
+
We ask that conversations about the hackathon stay within the designated Discord channels so we can track every concern and respond properly. And we ask that everyone continue to engage with each other, and with us, with respect. This community has been extraordinary, and that is worth protecting.
|
| 52 |
+
|
| 53 |
+
To be clear, anyone who crosses boundaries & breaks community rules will be banned & therefore automatically disqualified from the finale.
|
| 54 |
+
|
| 55 |
+
**Are the prizes and the number of winners still the same?**
|
| 56 |
+
|
| 57 |
+
Goes without saying. There are no changes to the prize structure or the number of winners. 15 teams will be awarded, with a total cash prize pool of $30,000, as published on the site:
|
| 58 |
+
|
| 59 |
+
| Position Secured | Prize |
|
| 60 |
+
| :---- | :---- |
|
| 61 |
+
| 1st | $7,500 |
|
| 62 |
+
| 2nd | $5,000 |
|
| 63 |
+
| 3rd | $3,500 |
|
| 64 |
+
| 4th to 8th | $2,000 each |
|
| 65 |
+
| 9th to 15th | $650 each |
|
| 66 |
+
|
| 67 |
+
Additionally, top teams will receive an interview opportunity with the Meta and Hugging Face AI teams. This has not changed and will not change.
|
| 68 |
+
|
| 69 |
+
**Will Scaler School of Technology students be favoured in the final evaluation?**
|
| 70 |
+
|
| 71 |
+
Absolutely not. The final evaluation is entirely in the hands of the Meta, PyTorch, and Hugging Face teams and will follow the judging criteria outlined.
|
| 72 |
+
|
| 73 |
+
We are working through every other question that has come in and will post structured answers here as we go. If something is unclear, if something feels wrong, keep asking. We would rather hear it from you directly than have you sit with uncertainty.
|
| 74 |
+
|
| 75 |
+
This hackathon is something India has not seen before. We are building the playbook in real time, at scale, and that is both the most exciting and the most humbling part of doing this. We are grateful that every single one of you showed up for it.
|
| 76 |
+
|
| 77 |
+
More updates soon.
|
docs/references/React Orchestrator Linkedin/V1_ReACT_based_Orchestrator.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
docs/references/React Orchestrator Linkedin/V2_ReACT_Based_Orchestrator.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
docs/references/React Orchestrator Linkedin/o3_mini_V1_ReACT_Based_Orchestrator.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
docs/references/React Orchestrator Linkedin/o3_mini_v2_improved_react_orchestrator.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
docs/references/[External] Meta OpenEnv Hackathon Participant Help Guide.md
ADDED
|
@@ -0,0 +1,425 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# **Hackathon Self-Serve Guide: Build an RL Environment, Train an LLM, Ship a Demo**
|
| 2 |
+
|
| 3 |
+
## **0\) What you are building**
|
| 4 |
+
|
| 5 |
+
The core idea is not just to fine-tune a text model, but to build a **specialized LLM system** that can act inside an environment, get feedback, and improve through reinforcement learning. The practical stack discussed here is:
|
| 6 |
+
|
| 7 |
+
**Environment β verifier/reward functions β TRL trainer β Unsloth for efficiency β deployment on OpenEnv / Spaces**.
|
| 8 |
+
|
| 9 |
+
A strong project usually looks like one of these,
|
| 10 |
+
|
| 11 |
+
Please refer to [\[External\] Apr β26 OpenEnv Hackathon Themes](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?usp=sharing) for theme guidelines on selecting & forming problem statements.
|
| 12 |
+
|
| 13 |
+
## **1\) Start with the right project idea**
|
| 14 |
+
|
| 15 |
+
Pick a task that has all three of these properties:
|
| 16 |
+
|
| 17 |
+
1. **The model can act step by step**
|
| 18 |
+
2. **You can verify success programmatically**
|
| 19 |
+
3. **The task is hard enough to be interesting, but not so hard that the model never succeeds**
|
| 20 |
+
|
| 21 |
+
This last point matters a lot. RL only works if the probability of getting a good answer is greater than zero. If your task is so hard that the model never gets any reward, you will burn compute and learn nothing.
|
| 22 |
+
|
| 23 |
+
Please refer to [\[External\] Apr β26 OpenEnv Hackathon Themes](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?usp=sharing) for theme guidelines on selecting & forming problem statements.
|
| 24 |
+
|
| 25 |
+
A useful rule: **prefer tasks with crisp verification over tasks that only βlook goodβ to a human.** RL gets easier when the reward is objective.
|
| 26 |
+
|
| 27 |
+
## **2\) Understand the minimum RL loop before you build**
|
| 28 |
+
|
| 29 |
+
At a high level, your loop is:
|
| 30 |
+
|
| 31 |
+
1. Give the model a prompt
|
| 32 |
+
2. Let it generate an action, strategy, answer, or code
|
| 33 |
+
3. Execute that output in an environment or verifier
|
| 34 |
+
4. Convert the result into a reward
|
| 35 |
+
5. Update the model so higher-reward behavior becomes more likely
|
| 36 |
+
|
| 37 |
+
That is the practical mental model for RL here. The system samples many outputs, scores them, and shifts probability mass away from bad outputs and toward better ones.
|
| 38 |
+
|
| 39 |
+
One especially useful framing is that RL is like a more efficient version of repeated in-context improvement. Instead of repeatedly stuffing previous examples into the context, you let backpropagation store what worked into the weights.
|
| 40 |
+
|
| 41 |
+
## **3\) Decide whether you need SFT first**
|
| 42 |
+
|
| 43 |
+
Use this simple rule:
|
| 44 |
+
|
| 45 |
+
* If you have **a lot of good data**, use **SFT**
|
| 46 |
+
* If you **do not have data but can verify outputs**, use **RL**
|
| 47 |
+
* In many practical cases, do **a little SFT first**, then RL
|
| 48 |
+
|
| 49 |
+
Why this matters:
|
| 50 |
+
|
| 51 |
+
* SFT is generally more sample-efficient
|
| 52 |
+
* RL is useful when you can test outcomes but cannot cheaply author ideal traces
|
| 53 |
+
* RL often needs some warm start, formatting priming, or easy tasks first so that good rollouts happen at all
|
| 54 |
+
|
| 55 |
+
For hackathon teams, the best path is usually:
|
| 56 |
+
|
| 57 |
+
1. Start from a capable base/instruct model
|
| 58 |
+
2. Add light formatting or task scaffolding if needed
|
| 59 |
+
3. Use RL for improvement, not as magic from scratch
|
| 60 |
+
|
| 61 |
+
## **4\) Design the environment before you design the trainer**
|
| 62 |
+
|
| 63 |
+
Treat the environment as a first-class artifact. It should define:
|
| 64 |
+
|
| 65 |
+
* **reset()**: start a fresh episode
|
| 66 |
+
* **step(action)**: apply an action and return the next result
|
| 67 |
+
* **state() / observation**: what the agent sees
|
| 68 |
+
* **reward**: what counts as progress or success
|
| 69 |
+
|
| 70 |
+
OpenEnv standardizes this so the same training code can work across many environments, instead of every team inventing a different API. That is one of the main reasons to use it in a hackathon.
|
| 71 |
+
|
| 72 |
+
Think about your environment in this order:
|
| 73 |
+
|
| 74 |
+
1. What does the agent observe?
|
| 75 |
+
2. What actions can it take?
|
| 76 |
+
3. What ends an episode?
|
| 77 |
+
4. How do you compute reward?
|
| 78 |
+
5. How do you stop abuse, infinite loops, or cheating?
|
| 79 |
+
|
| 80 |
+
**5\) Build the environment using OpenEnv**
|
| 81 |
+
|
| 82 |
+
The intended workflow is to bootstrap an environment skeleton and then fill in the behavior. OpenEnvβs CLI creates the scaffolding for you. The environment is implemented as a Python package and exposed via a FastAPI app.
|
| 83 |
+
|
| 84 |
+
Your implementation typically defines:
|
| 85 |
+
|
| 86 |
+
* action dataclass
|
| 87 |
+
* observation dataclass
|
| 88 |
+
* state representation
|
| 89 |
+
* environment methods like reset and step
|
| 90 |
+
* FastAPI wrapper / client-server interface
|
| 91 |
+
|
| 92 |
+
That gives you a clean separation:
|
| 93 |
+
|
| 94 |
+
* the **environment** handles world dynamics and scoring,
|
| 95 |
+
* the **trainer** handles optimization,
|
| 96 |
+
* and the **model** just learns to act inside the interface.
|
| 97 |
+
|
| 98 |
+
## **6\) Keep the task simple at first**
|
| 99 |
+
|
| 100 |
+
Do not begin with your hardest benchmark. Start with the easiest version of your environment that still proves the concept. This is where curriculum learning helps.
|
| 101 |
+
|
| 102 |
+
A good progression:
|
| 103 |
+
|
| 104 |
+
1. easy tasks with short horizons,
|
| 105 |
+
2. medium tasks with a little more branching,
|
| 106 |
+
3. harder tasks only after the model starts getting non-zero reward.
|
| 107 |
+
|
| 108 |
+
The principle is simple: **make success possible early**. If the model never sees successful trajectories, learning stalls.
|
| 109 |
+
|
| 110 |
+
## **7\) Design rewards carefully**
|
| 111 |
+
|
| 112 |
+
Your reward function is your task specification. If it is weak, incomplete, or easy to exploit, the model will optimize the wrong thing very efficiently.
|
| 113 |
+
|
| 114 |
+
A strong reward design usually includes multiple components, for example:
|
| 115 |
+
|
| 116 |
+
* execution success,
|
| 117 |
+
* correctness,
|
| 118 |
+
* format compliance,
|
| 119 |
+
* timeouts,
|
| 120 |
+
* resource usage,
|
| 121 |
+
* safety constraints,
|
| 122 |
+
* and anti-cheating checks.
|
| 123 |
+
|
| 124 |
+
One explicit recommendation was to use **multiple independent reward functions**, not just one. If you only have a single reward signal, it is easier for the model to hack it. Multiple independent checks reduce that risk.
|
| 125 |
+
|
| 126 |
+
For example, for a coding environment:
|
| 127 |
+
|
| 128 |
+
* reward passing tests,
|
| 129 |
+
* penalize timeouts,
|
| 130 |
+
* reward format compliance,
|
| 131 |
+
* reject use of forbidden globals,
|
| 132 |
+
* and separately verify the function contract.
|
| 133 |
+
|
| 134 |
+
## **8\) Protect yourself against reward hacking**
|
| 135 |
+
|
| 136 |
+
Reward hacking is one of the biggest practical failure modes. The model may learn shortcuts that maximize your reward without solving the real task. Examples mentioned include:
|
| 137 |
+
|
| 138 |
+
* editing timers,
|
| 139 |
+
* caching results,
|
| 140 |
+
* abusing globals,
|
| 141 |
+
* mutating protected state,
|
| 142 |
+
* or exploiting environment bugs.
|
| 143 |
+
|
| 144 |
+
What to do:
|
| 145 |
+
|
| 146 |
+
1. Use multiple independent reward functions
|
| 147 |
+
2. Lock down execution where possible
|
| 148 |
+
3. Add time limits
|
| 149 |
+
4. Avoid unrestricted global state
|
| 150 |
+
5. Sample outputs frequently and inspect them
|
| 151 |
+
6. Terminate or roll back runs if behavior drifts badly
|
| 152 |
+
|
| 153 |
+
A particularly practical recommendation was to use a **locked-down function** or restricted execution approach so the model cannot rely on undeclared globals or hidden cached state.
|
| 154 |
+
|
| 155 |
+
Also, do not just let training run forever without checking generations. Periodic human inspection is still necessary.
|
| 156 |
+
|
| 157 |
+
## **9\) Use process-aware feedback when you can**
|
| 158 |
+
|
| 159 |
+
Naively assigning the same final reward to every token is inefficient. If possible, use richer supervision that distinguishes good intermediate steps from bad ones. That is the idea behind **process supervision**.
|
| 160 |
+
|
| 161 |
+
In practice, this can be approximated by:
|
| 162 |
+
|
| 163 |
+
* line-by-line checks,
|
| 164 |
+
* step-level verifiers,
|
| 165 |
+
* program trace analysis,
|
| 166 |
+
* or LLM-as-a-judge for intermediate reasoning.
|
| 167 |
+
|
| 168 |
+
But be careful: LLM-as-a-judge can itself be gamed. Use it as one signal, not the only signal.
|
| 169 |
+
|
| 170 |
+
For a hackathon, outcome-based verification plus a few lightweight process checks is usually the sweet spot.
|
| 171 |
+
|
| 172 |
+
## **10\) Pick the right training stack**
|
| 173 |
+
|
| 174 |
+
The intended stack here is:
|
| 175 |
+
|
| 176 |
+
* **TRL** for RL training algorithms
|
| 177 |
+
* **Unsloth** to make RL training and inference more efficient
|
| 178 |
+
* **OpenEnv** to standardize environment interaction
|
| 179 |
+
|
| 180 |
+
This combination works because:
|
| 181 |
+
|
| 182 |
+
* OpenEnv gives you a common environment interface
|
| 183 |
+
* TRL gives you RL trainers like GRPO
|
| 184 |
+
* Unsloth reduces memory use and improves efficiency on top of TRL
|
| 185 |
+
|
| 186 |
+
One of the practical examples used the same prompt repeated many times, routed through an environment, with TRL driving training and Unsloth helping with performance.
|
| 187 |
+
|
| 188 |
+
## **11\) Prefer GRPO / RLVR style training for verifiable tasks**
|
| 189 |
+
|
| 190 |
+
The RL setup discussed here leans toward **RL with verifiable rewards**:
|
| 191 |
+
|
| 192 |
+
* instead of a learned reward model,
|
| 193 |
+
* use a verifier, test harness, regex check, executor, or environment.
|
| 194 |
+
|
| 195 |
+
GRPO was described as a more efficient evolution relative to older PPO-style setups, especially by simplifying away parts like the value model.
|
| 196 |
+
|
| 197 |
+
For hackathon purposes, the key practical takeaway is:
|
| 198 |
+
|
| 199 |
+
* if the task is verifiable,
|
| 200 |
+
* build the verifier first,
|
| 201 |
+
* then plug that verifier into RL training.
|
| 202 |
+
|
| 203 |
+
## **12\) Keep inference fast**
|
| 204 |
+
|
| 205 |
+
One important point: in RL for LLMs, **inference can dominate total runtime**. Over time, rollout generation often becomes the bottleneck, not the optimizer step.
|
| 206 |
+
|
| 207 |
+
That means your project speed depends heavily on:
|
| 208 |
+
|
| 209 |
+
* fast sampling,
|
| 210 |
+
* tight environment loops,
|
| 211 |
+
* low-overhead execution,
|
| 212 |
+
* and efficient model runtime.
|
| 213 |
+
|
| 214 |
+
This is one reason Unsloth matters in the stack, and another reason to avoid overly heavy environments early in the hackathon.
|
| 215 |
+
|
| 216 |
+
## **13\) Deploy your environment early**
|
| 217 |
+
|
| 218 |
+
OpenEnv environments are designed to be deployed as **Hugging Face Spaces**, which provide:
|
| 219 |
+
|
| 220 |
+
* a running server,
|
| 221 |
+
* a Git repository,
|
| 222 |
+
* and a container registry.
|
| 223 |
+
|
| 224 |
+
That gives you several ways to work:
|
| 225 |
+
|
| 226 |
+
* interact with the remote Space directly,
|
| 227 |
+
* install the client code from the repo,
|
| 228 |
+
* pull and run the container locally,
|
| 229 |
+
* or run the FastAPI app locally via Python/Uvicorn.
|
| 230 |
+
|
| 231 |
+
Why this is good for a hackathon:
|
| 232 |
+
|
| 233 |
+
* one shared source of truth,
|
| 234 |
+
* easier collaboration,
|
| 235 |
+
* easier demos,
|
| 236 |
+
* easier switching between local and remote execution.
|
| 237 |
+
|
| 238 |
+
A good habit is to deploy an early version of the environment before training seriously. That catches API and packaging issues early.
|
| 239 |
+
|
| 240 |
+
## **14\) Scale only after the environment is stable**
|
| 241 |
+
|
| 242 |
+
There was a dedicated tutorial flow around:
|
| 243 |
+
|
| 244 |
+
1. environment,
|
| 245 |
+
2. deployment,
|
| 246 |
+
3. scaling,
|
| 247 |
+
4. training with TRL and Wordle.
|
| 248 |
+
|
| 249 |
+
Follow the same order.
|
| 250 |
+
|
| 251 |
+
Do **not** start with scale. First confirm:
|
| 252 |
+
|
| 253 |
+
* reset works,
|
| 254 |
+
* step works,
|
| 255 |
+
* rewards are sensible,
|
| 256 |
+
* timeouts work,
|
| 257 |
+
* logs are visible,
|
| 258 |
+
* and the environment can be run locally and remotely.
|
| 259 |
+
|
| 260 |
+
Only then:
|
| 261 |
+
|
| 262 |
+
* increase batch sizes,
|
| 263 |
+
* duplicate prompts or tasks,
|
| 264 |
+
* expand task diversity,
|
| 265 |
+
* and benchmark throughput.
|
| 266 |
+
|
| 267 |
+
## **15\) Monitor the right things during training**
|
| 268 |
+
|
| 269 |
+
Do not watch only one scalar. Monitor:
|
| 270 |
+
|
| 271 |
+
* overall reward,
|
| 272 |
+
* individual reward function columns,
|
| 273 |
+
* success indicators,
|
| 274 |
+
* timeout frequency,
|
| 275 |
+
* and generated strategies over time.
|
| 276 |
+
|
| 277 |
+
A very concrete suggestion was:
|
| 278 |
+
|
| 279 |
+
* watch whether the reward is going up,
|
| 280 |
+
* and separately watch critical columns like βfunction works.β
|
| 281 |
+
|
| 282 |
+
Also inspect actual generations during training. A rising reward is not enough if the model is learning to exploit bugs.
|
| 283 |
+
|
| 284 |
+
## **16\) Save models correctly**
|
| 285 |
+
|
| 286 |
+
If you use QLoRA / LoRA-style training, be careful when saving. One explicit warning was:
|
| 287 |
+
|
| 288 |
+
**Do not upcast a 4-bit model to 16-bit and then merge the LoRA weights naively.** That can badly damage model quality. Instead, use the proper merged-save path, or use the adapters directly.
|
| 289 |
+
|
| 290 |
+
For participants, that means:
|
| 291 |
+
|
| 292 |
+
* keep your training save path simple,
|
| 293 |
+
* test post-training inference immediately,
|
| 294 |
+
* and do not leave export until the end.
|
| 295 |
+
|
| 296 |
+
## **17\) How to structure your team over the hackathon**
|
| 297 |
+
|
| 298 |
+
A very effective team split is:
|
| 299 |
+
|
| 300 |
+
**Person A: Environment**
|
| 301 |
+
|
| 302 |
+
* builds reset/step/state
|
| 303 |
+
* adds timeouts and safety constraints
|
| 304 |
+
* makes local and remote execution work
|
| 305 |
+
|
| 306 |
+
**Person B: Verifier / Rewards**
|
| 307 |
+
|
| 308 |
+
* writes multiple reward functions
|
| 309 |
+
* adds anti-hacking checks
|
| 310 |
+
* makes failure cases visible
|
| 311 |
+
|
| 312 |
+
**Person C: Training**
|
| 313 |
+
|
| 314 |
+
* sets up TRL \+ Unsloth
|
| 315 |
+
* runs experiments
|
| 316 |
+
* tracks metrics and generations
|
| 317 |
+
|
| 318 |
+
**Person D: Demo / Product**
|
| 319 |
+
|
| 320 |
+
* prepares the Space demo
|
| 321 |
+
* creates a simple interface
|
| 322 |
+
* records examples and final benchmarks
|
| 323 |
+
|
| 324 |
+
This split matches the way the stack naturally decomposes in practice.
|
| 325 |
+
|
| 326 |
+
## **18\) A practical 1-day execution plan**
|
| 327 |
+
|
| 328 |
+
### **Phase 1: Pick a narrow task**
|
| 329 |
+
|
| 330 |
+
Choose a small, verifiable environment. Avoid huge long-horizon tasks first.
|
| 331 |
+
|
| 332 |
+
### **Phase 2: Build the environment**
|
| 333 |
+
|
| 334 |
+
Use OpenEnv init, implement reset/step/state, and get a local loop working.
|
| 335 |
+
|
| 336 |
+
### **Phase 3: Build rewards**
|
| 337 |
+
|
| 338 |
+
Add at least 2β4 independent reward checks, plus timeout and anti-cheat logic.
|
| 339 |
+
|
| 340 |
+
### **Phase 4: Deploy**
|
| 341 |
+
|
| 342 |
+
Push to a Space or run locally via container/Uvicorn so teammates can use the same environment.
|
| 343 |
+
|
| 344 |
+
### **Phase 5: Train small**
|
| 345 |
+
|
| 346 |
+
Run a tiny TRL \+ Unsloth experiment first. Look at outputs, not just metrics.
|
| 347 |
+
|
| 348 |
+
### **Phase 6: Inspect for hacking**
|
| 349 |
+
|
| 350 |
+
Sample generations. Check for globals, hacks, environment abuse, or suspicious shortcuts.
|
| 351 |
+
|
| 352 |
+
### **Phase 7: Add curriculum**
|
| 353 |
+
|
| 354 |
+
If the model gets zero reward too often, simplify tasks or add easier start states.
|
| 355 |
+
|
| 356 |
+
### **Phase 8: Train bigger**
|
| 357 |
+
|
| 358 |
+
Only after the loop is stable should you increase scale, batch size, or environment diversity.
|
| 359 |
+
|
| 360 |
+
### **Phase 9: Save and demo**
|
| 361 |
+
|
| 362 |
+
Export the trained model correctly, test inference, and show before/after behavior.
|
| 363 |
+
|
| 364 |
+
## **19\) What judges or reviewers will likely find compelling**
|
| 365 |
+
|
| 366 |
+
The strongest hackathon projects usually show:
|
| 367 |
+
|
| 368 |
+
* a clear environment design,
|
| 369 |
+
* objective reward functions,
|
| 370 |
+
* evidence that the model improved,
|
| 371 |
+
* prevention against reward hacking,
|
| 372 |
+
* a reproducible deployment story,
|
| 373 |
+
* and a sharp demo.
|
| 374 |
+
|
| 375 |
+
A simple but strong demo format is:
|
| 376 |
+
|
| 377 |
+
1. baseline model attempt,
|
| 378 |
+
2. reward/verifier output,
|
| 379 |
+
3. trained model attempt,
|
| 380 |
+
4. measurable improvement,
|
| 381 |
+
5. short explanation of safeguards.
|
| 382 |
+
|
| 383 |
+
## **20\) Suggested problem statement theme directions**
|
| 384 |
+
|
| 385 |
+
Please Refer to [\[External\] Apr β26 OpenEnv Hackathon Themes](https://docs.google.com/document/d/1Odznuzwtb1ecDOm2t6ToZd4MuMXXfO6vWUGcxbC6mFs/edit?usp=sharing)
|
| 386 |
+
|
| 387 |
+
## **21\) Common mistakes to avoid**
|
| 388 |
+
|
| 389 |
+
* Picking a task so hard that success probability is zero
|
| 390 |
+
* Using only one reward function
|
| 391 |
+
* Not checking for reward hacking
|
| 392 |
+
* Training before the environment is stable
|
| 393 |
+
* Relying only on average reward and not inspecting outputs
|
| 394 |
+
* Forgetting timeouts and sandbox limits
|
| 395 |
+
* Saving LoRA/QLoRA models incorrectly
|
| 396 |
+
|
| 397 |
+
## **22\) Learning Resources**
|
| 398 |
+
|
| 399 |
+
**(Recommended) RL Environment Lecture Chapters:**
|
| 400 |
+
[**https://openenv-india-apr-2026.lovable.app/**](https://openenv-india-apr-2026.lovable.app/)
|
| 401 |
+
|
| 402 |
+
**Module 1: Why OpenEnv?** (\~7 min)
|
| 403 |
+
βΈ Workshop 8:02β15:05 β [https://www.youtube.com/watch?v=1jU05MlENOI\&t=482s](https://www.youtube.com/watch?v=1jU05MlENOI&t=482s)
|
| 404 |
+
βΈ Sanyam: RL loop, fragmented env APIs, OpenEnv as universal interface, Gymnasium spec \+ Docker
|
| 405 |
+
βΈ Alt: Mega Lecture 40:01β46:00 β [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=2401s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=2401s)
|
| 406 |
+
|
| 407 |
+
**Module 2: Using Existing Envs** (\~7.5 min)
|
| 408 |
+
βΈ Workshop 35:33β43:05 β [https://www.youtube.com/watch?v=1jU05MlENOI\&t=2133s](https://www.youtube.com/watch?v=1jU05MlENOI&t=2133s)
|
| 409 |
+
βΈ Ben: Hub org, env collections, 3 Space interfaces (server/repo/registry), from\_hub
|
| 410 |
+
βΈ Alt: Mega Lecture 1:24:11β1:30:00 β [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=5051s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5051s)
|
| 411 |
+
|
| 412 |
+
**Module 3: Deploying Envs** (\~9 min)
|
| 413 |
+
βΈ Mega Lecture 1:30:00β1:39:07 β [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=5400s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5400s)
|
| 414 |
+
βΈ Ben: live openenv init, scaffold, running locally, openenv push, Docker run from Space
|
| 415 |
+
βΈ Alt: Workshop 43:05β48:30 β [https://www.youtube.com/watch?v=1jU05MlENOI\&t=2585s](https://www.youtube.com/watch?v=1jU05MlENOI&t=2585s)
|
| 416 |
+
|
| 417 |
+
**Module 4: Building Your Own** (\~6.5 min)
|
| 418 |
+
βΈ Workshop 43:45β50:20 β [https://www.youtube.com/watch?v=1jU05MlENOI\&t=2625s](https://www.youtube.com/watch?v=1jU05MlENOI&t=2625s)
|
| 419 |
+
βΈ Ben: scaffold files, business logic (reset/step), models, client, publishing
|
| 420 |
+
βΈ Alt: Mega Lecture 1:33:30β1:39:07 β [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=5610s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5610s)
|
| 421 |
+
|
| 422 |
+
**Module 5: Training \+ TRL** (\~14 min)
|
| 423 |
+
βΈ Mega Lecture 1:53:20β2:07:12 β [https://www.youtube.com/watch?v=Jew4lhAiqnw\&t=6800s](https://www.youtube.com/watch?v=Jew4lhAiqnw&t=6800s)
|
| 424 |
+
βΈ Lewis: Wordle GRPO walkthrough β rollout function, reward shaping, GRPOTrainer, live training
|
| 425 |
+
βΈ Alt: Workshop 22:24β34:12 β [https://www.youtube.com/watch?v=1jU05MlENOI\&t=1344s](https://www.youtube.com/watch?v=1jU05MlENOI&t=1344s)
|
docs/references/[External] OpenEnv Hackathon FAQs.md
ADDED
|
@@ -0,0 +1,556 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## **1\) What is reinforcement learning in the context of LLMs?**
|
| 2 |
+
|
| 3 |
+
Reinforcement learning for LLMs is a loop where the model generates an answer, code snippet, plan, or action sequence; that output is evaluated by a verifier or environment; and the resulting reward is used to update the model so higher-reward behaviors become more likely over time. In practice, this is often used after pretraining and supervised fine-tuning to sharpen behaviors like reasoning, code generation, or tool use. The session framed this intuition as turning repeated trial-and-error into weight updates instead of stuffing more and more examples into the prompt.
|
| 4 |
+
|
| 5 |
+
A good mental model is: supervised fine-tuning tells the model βcopy this good target,β while RL tells it βtry many possibilities and move probability mass toward the ones that score better.β PPO is one classic algorithm for this style of training, and GRPO is a later variant used heavily in modern LLM work because it can be more memory-efficient for certain setups. ([arXiv](https://arxiv.org/abs/1707.06347?utm_source=chatgpt.com))
|
| 6 |
+
|
| 7 |
+
For deeper reading:
|
| 8 |
+
|
| 9 |
+
* TRL docs for RL trainers and workflows. ([Hugging Face](https://huggingface.co/docs/trl/index?utm_source=chatgpt.com))
|
| 10 |
+
* PPO paper. ([arXiv](https://arxiv.org/abs/1707.06347?utm_source=chatgpt.com))
|
| 11 |
+
* DeepSeekMath for GRPO. ([arXiv](https://arxiv.org/abs/2402.03300?utm_source=chatgpt.com))
|
| 12 |
+
|
| 13 |
+
## **2\) Why do rewards matter so much?**
|
| 14 |
+
|
| 15 |
+
Rewards are the only signal telling the model what βbetterβ means. If your reward is well aligned with the real task, RL can push the model toward genuinely useful behavior. If your reward is incomplete or easy to game, the model will optimize the wrong thing very effectively. The session emphasized that RL gives you what you asked for, not necessarily what you meant.
|
| 16 |
+
|
| 17 |
+
For example, if you reward generated code only for passing a shallow regex or a weak unit test, the model may learn to exploit those checks instead of solving the underlying problem. This is why reward design is not a detail; it is the task specification. DeepMindβs discussion of βspecification gamingβ makes the same point in broader RL terms: weakly specified rewards create loopholes that search will discover. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/?utm_source=chatgpt.com))
|
| 18 |
+
|
| 19 |
+
Useful reading:
|
| 20 |
+
|
| 21 |
+
* DeepMind on specification gaming. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/?utm_source=chatgpt.com))
|
| 22 |
+
* Lilian Weng on reward hacking. ([Lil'Log](https://lilianweng.github.io/posts/2024-11-28-reward-hacking/?utm_source=chatgpt.com))
|
| 23 |
+
|
| 24 |
+
## **3\) What is rewards engineering?**
|
| 25 |
+
|
| 26 |
+
Rewards engineering is the work of designing, combining, validating, and monitoring reward signals so that optimization pressure produces the behavior you actually want. In LLM RL, that usually means deciding:
|
| 27 |
+
|
| 28 |
+
* what gets rewarded,
|
| 29 |
+
* how much it gets rewarded,
|
| 30 |
+
* when it gets rewarded,
|
| 31 |
+
* what gets penalized,
|
| 32 |
+
* and how you audit whether the reward is being gamed.
|
| 33 |
+
|
| 34 |
+
A practical reward function often has several components. For a code task, you might combine syntax validity, execution success, unit test pass rate, latency, memory use, formatting compliance, and safety checks. The session highlighted verifier-based reward design such as formatting checks, execution checks, regex checks, and environment-based evaluation instead of a learned reward model alone.
|
| 35 |
+
|
| 36 |
+
A useful principle is to reward outcomes first, then add process constraints only where needed. Over-shaping the reward can make training brittle or bias the model into narrow strategies, while under-shaping makes hacking easier. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/?utm_source=chatgpt.com))
|
| 37 |
+
|
| 38 |
+
## **4\) What is RLVR, and how is it different from using a reward model?**
|
| 39 |
+
|
| 40 |
+
RLVR usually means reinforcement learning with verifiable rewards. Instead of asking a learned reward model to score outputs, you use a verifier, tester, or environment that can check correctness more directly. The session gave examples like formatting checks, execution checks, regex-based checks, and environment rollouts.
|
| 41 |
+
|
| 42 |
+
This is powerful when correctness is externally testable. Code can be compiled and unit-tested. Math can often be checked against a final answer or symbolic verifier. Games can expose reward from the environment. Browser tasks can be checked by page state or task completion. In such cases, verifier-driven rewards are often more trustworthy than a purely learned scalar reward model.
|
| 43 |
+
|
| 44 |
+
TRL documents this broader environment-based training pattern, and OpenEnv is meant to standardize how such environments are defined and used. ([Hugging Face](https://huggingface.co/docs/trl/openenv?utm_source=chatgpt.com))
|
| 45 |
+
|
| 46 |
+
## **5\) Why do RL environments matter for LLMs?**
|
| 47 |
+
|
| 48 |
+
Static prompt-response datasets are useful, but they are limited. Real deployments require models to interact with systems: codebases, browsers, files, APIs, games, tools, and simulators. RL environments let the model act, observe consequences, and keep going across multiple steps, which is much closer to real agent behavior. The session described environments as the bridge from isolated prompt solving to real-world interaction.
|
| 49 |
+
|
| 50 |
+
They also enable dynamic difficulty and richer feedback. Instead of training forever on a fixed set of prompts, the environment can generate or surface tasks that are more appropriate for the current model, which makes curriculum learning and continual challenge easier. This matches the broader βRL with environmentsβ direction discussed in recent OpenEnv and TRL material. ([Hugging Face](https://huggingface.co/docs/trl/openenv?utm_source=chatgpt.com))
|
| 51 |
+
|
| 52 |
+
For examples:
|
| 53 |
+
|
| 54 |
+
* BrowserGym for web-task environments. ([GitHub](https://github.com/servicenow/browsergym?utm_source=chatgpt.com))
|
| 55 |
+
* OpenEnv course and TRL integration docs. ([GitHub](https://github.com/huggingface/openenv-course?utm_source=chatgpt.com))
|
| 56 |
+
|
| 57 |
+
## **6\) What is OpenEnv, and why would a hackathon team use it?**
|
| 58 |
+
|
| 59 |
+
OpenEnv is an open-source framework for defining and interacting with RL environments for LLM and agent training. The session described it as a standardized interface around concepts like reset, step, state, observations, actions, and rewards, with deployment built around Hugging Face Spaces and containerized execution.
|
| 60 |
+
|
| 61 |
+
A hackathon team would use OpenEnv because it reduces environment plumbing. Instead of inventing a new interface for each task, you can standardize how the model talks to the environment and then connect that to a trainer like TRL. That means you spend more time on task design and rewards, and less on adapter glue. The session also highlighted `openenv init` for bootstrapping an environment skeleton quickly.
|
| 62 |
+
|
| 63 |
+
Good starting points:
|
| 64 |
+
|
| 65 |
+
* OpenEnv repo. ([GitHub](https://github.com/meta-pytorch/OpenEnv?utm_source=chatgpt.com))
|
| 66 |
+
* OpenEnv course. ([GitHub](https://github.com/huggingface/openenv-course?utm_source=chatgpt.com))
|
| 67 |
+
* TRLβs OpenEnv integration guide. ([Hugging Face](https://huggingface.co/docs/trl/openenv?utm_source=chatgpt.com))
|
| 68 |
+
|
| 69 |
+
## **7\) How does OpenEnv work at a high level?**
|
| 70 |
+
|
| 71 |
+
At a high level, an OpenEnv environment exposes a small set of standard operations:
|
| 72 |
+
|
| 73 |
+
* reset the environment,
|
| 74 |
+
* step the environment with an action,
|
| 75 |
+
* return observations, rewards, and state.
|
| 76 |
+
|
| 77 |
+
The session described OpenEnv environments as FastAPI applications that can be run locally, deployed on Hugging Face Spaces, or pulled as containers. That gives teams several options: they can use the remote environment directly, install client code from the repo, or run the environment locally through the container image.
|
| 78 |
+
|
| 79 |
+
This design is useful because it treats environments as portable, versioned software artifacts rather than ad hoc scripts. Hugging Faceβs own TRL docs describe OpenEnv similarly, including support for backend-server execution and standardized APIs. ([Hugging Face](https://huggingface.co/docs/trl/openenv?utm_source=chatgpt.com))
|
| 80 |
+
|
| 81 |
+
## **8\) Where do TRL and Unsloth fit in this stack?**
|
| 82 |
+
|
| 83 |
+
TRL is the training library. It provides trainers and workflows for SFT, DPO, PPO, GRPO, reward modeling, and related post-training methods for transformer models. In a typical hackathon setup, TRL handles rollout collection, reward integration, optimization, logging, and trainer configuration. ([Hugging Face](https://huggingface.co/docs/trl/index?utm_source=chatgpt.com))
|
| 84 |
+
|
| 85 |
+
Unsloth fits in as the acceleration and memory-efficiency layer for training and RL fine-tuning. The session described Unsloth as making RL training more efficient and inference faster, which matters because rollout generation often dominates runtime in RL loops. It also noted a practical QLoRA warning: donβt naively upcast a 4-bit model to 16-bit and then merge adapters, because that can damage model quality; use the proper merge path instead.
|
| 86 |
+
|
| 87 |
+
Relevant docs:
|
| 88 |
+
|
| 89 |
+
* TRL docs and GRPO cookbook. ([Hugging Face](https://huggingface.co/docs/trl/index?utm_source=chatgpt.com))
|
| 90 |
+
* Unsloth repository/readme. ([GitHub](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1&utm_source=chatgpt.com))
|
| 91 |
+
|
| 92 |
+
## **9\) What is the difference between PPO and GRPO?**
|
| 93 |
+
|
| 94 |
+
PPO is a classic policy optimization algorithm that stabilizes updates by constraining how much the policy changes between iterations. It is one of the most influential RL algorithms in modern deep learning. ([arXiv](https://arxiv.org/abs/1707.06347?utm_source=chatgpt.com))
|
| 95 |
+
|
| 96 |
+
GRPO is a later group-relative variant used in LLM training that compares sampled outputs within a group to estimate relative advantage, and it is often discussed as a more memory-efficient alternative to full PPO-style setups in some LLM post-training pipelines. The session summarized GRPO as a more efficient version of PPO and specifically noted removing the value model from the setup.
|
| 97 |
+
|
| 98 |
+
For deeper details:
|
| 99 |
+
|
| 100 |
+
* PPO paper. ([arXiv](https://arxiv.org/abs/1707.06347?utm_source=chatgpt.com))
|
| 101 |
+
* DeepSeekMath / GRPO references via TRL paper index and cookbook. ([arXiv](https://arxiv.org/abs/2402.03300?utm_source=chatgpt.com))
|
| 102 |
+
|
| 103 |
+
## **10\) Why is RL often described as inefficient, yet still useful?**
|
| 104 |
+
|
| 105 |
+
RL is often inefficient because the feedback is sparse and delayed. A long rollout may end in one scalar reward, and that weak signal has to train many decisions. The session used a simple example: if a code answer fails at one line but you assign the same negative reward to every token, youβre throwing away a lot of structure.
|
| 106 |
+
|
| 107 |
+
It is still useful because it can optimize behaviors where exact supervised targets are unavailable, too expensive, or too limiting. If you can verify success but cannot easily author perfect demonstrations for every scenario, RL can still improve the model by repeated interaction. This is why RL is especially attractive for code execution, tool use, games, browser tasks, and agent workflows.
|
| 108 |
+
|
| 109 |
+
A practical takeaway: use RL where verifiers exist and where exploration is worth the extra compute.
|
| 110 |
+
|
| 111 |
+
## **11\) What is process supervision, and why is it important?**
|
| 112 |
+
|
| 113 |
+
Process supervision means giving feedback on intermediate reasoning or intermediate steps, not only on the final outcome. The session contrasted this with assigning the same reward to every token in the answer, which can be very wasteful. Under process supervision, you try to identify which parts of a trace were good, irrelevant, or harmful.
|
| 114 |
+
|
| 115 |
+
This matters because not all failures are equal. Maybe the model chose the right algorithmic approach but made one implementation mistake. Final-outcome-only rewards blur that distinction. Step-aware rewards can improve sample efficiency and make debugging easier, though they also raise new risks if the step labels are noisy or exploitable.
|
| 116 |
+
|
| 117 |
+
The session also noted that process supervision is often approximated with humans or LLM-as-a-judge. That can help, but it creates another optimization target that itself may be gamed.
|
| 118 |
+
|
| 119 |
+
## **12\) What is reward hacking?**
|
| 120 |
+
|
| 121 |
+
Reward hacking is when the model finds a way to maximize reward without genuinely doing the intended task. In other words, the optimization succeeds, but the task specification failed. The session gave intuitive examples such as editing variables, bypassing intended checks, or exploiting quirks in the environment rather than solving the real problem.
|
| 122 |
+
|
| 123 |
+
This is the same phenomenon often called specification gaming. DeepMind describes it as agents exploiting flaws or ambiguities in the reward function, and Lilian Wengβs overview covers how common and fundamental this problem is in RL systems. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/?utm_source=chatgpt.com))
|
| 124 |
+
|
| 125 |
+
A useful mindset is: reward hacking is not proof the model is βevilβ; it is proof that optimization pressure found a loophole.
|
| 126 |
+
|
| 127 |
+
## **13\) How can a hackathon team reduce reward hacking in practice?**
|
| 128 |
+
|
| 129 |
+
Use strong verifiers. Prefer executable checks over stylistic heuristics. For code, run tests, time the solution, validate output shapes and edge cases, and isolate execution. For tool use, verify actual state transitions, not just verbal claims. The session repeatedly emphasized verifiers and environments over vague reward signals.
|
| 130 |
+
|
| 131 |
+
Monitor training actively. The session recommended sampling outputs periodically, looking for suspicious patterns, and terminating or rolling back runs when drift appears. It also suggested filtering bad responses and adding guardrails when patterns of exploitation are observed.
|
| 132 |
+
|
| 133 |
+
Use layered rewards. Combine success criteria with anti-cheat constraints. For example:
|
| 134 |
+
|
| 135 |
+
* pass tests,
|
| 136 |
+
* do not edit protected files,
|
| 137 |
+
* do not bypass timers,
|
| 138 |
+
* stay within time and memory budget,
|
| 139 |
+
* preserve task-required formatting,
|
| 140 |
+
* and log intermediate actions for audit.
|
| 141 |
+
|
| 142 |
+
This general strategy aligns with broader RL safety guidance on specification gaming. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/?utm_source=chatgpt.com))
|
| 143 |
+
|
| 144 |
+
## **14\) What is curriculum learning, and why does it help RL?**
|
| 145 |
+
|
| 146 |
+
Curriculum learning means controlling the order or difficulty of training tasks so the model learns from easier tasks first and gradually moves to harder ones. The session directly recommended this for RL: if tasks are too hard at the start, the model may never produce a successful rollout, which means the reward signal is effectively zero and learning stalls.
|
| 147 |
+
|
| 148 |
+
This is especially important in LLM RL because many tasks are long-horizon and brittle. An easier initial distribution can bootstrap behavior, after which harder tasks become reachable. In the RL literature more broadly, curriculum learning is a standard way to improve exploration and sample efficiency in difficult environments. ([arXiv](https://arxiv.org/pdf/2504.06618?utm_source=chatgpt.com))
|
| 149 |
+
|
| 150 |
+
Practical idea for hackathons:
|
| 151 |
+
|
| 152 |
+
* start with short horizons,
|
| 153 |
+
* fewer tools,
|
| 154 |
+
* simpler state spaces,
|
| 155 |
+
* stronger hints,
|
| 156 |
+
* easier test cases,
|
| 157 |
+
* then gradually remove scaffolding.
|
| 158 |
+
|
| 159 |
+
## **15\) How do I know whether a task is suitable for RL?**
|
| 160 |
+
|
| 161 |
+
A task is a good candidate for RL if:
|
| 162 |
+
|
| 163 |
+
* you can verify success or partial progress,
|
| 164 |
+
* exploration is meaningful,
|
| 165 |
+
* multi-step interaction matters,
|
| 166 |
+
* and you do not already have abundant high-quality demonstrations.
|
| 167 |
+
|
| 168 |
+
The session highlighted a key rule of thumb: the probability of a good answer must be greater than zero. If the task is so hard that the model never stumbles into any rewarding behavior, RL will waste compute. That means task selection, warm starts, formatting scaffolds, or light SFT can be essential.
|
| 169 |
+
|
| 170 |
+
Good hackathon candidates include:
|
| 171 |
+
|
| 172 |
+
* code generation with executable tests,
|
| 173 |
+
* browser navigation with page-state checks,
|
| 174 |
+
* games with clear win conditions,
|
| 175 |
+
* API/tool workflows with verifiable side effects.
|
| 176 |
+
|
| 177 |
+
## **16\) Should we jump straight into RL, or do some SFT first?**
|
| 178 |
+
|
| 179 |
+
Usually, do some SFT or at least a warm start first. The sessionβs guidance was that pretraining carries most of the capability burden, SFT helps shape the behavior, and RL refines it. It explicitly argued against relying on RL alone from scratch for most practical settings.
|
| 180 |
+
|
| 181 |
+
That matches modern post-training stacks: pretrain heavily, align or instruct-tune, then apply preference optimization and/or RL where it adds value. TRLβs supported workflows reflect exactly this broader stack. ([Hugging Face](https://huggingface.co/docs/trl/index?utm_source=chatgpt.com))
|
| 182 |
+
|
| 183 |
+
A hackathon-friendly recipe is:
|
| 184 |
+
|
| 185 |
+
1. Start from a solid instruct model.
|
| 186 |
+
2. Add a tiny amount of task-format SFT if needed.
|
| 187 |
+
3. Build a strong verifier.
|
| 188 |
+
4. Use GRPO/PPO-style RL only after the model can at least occasionally succeed.
|
| 189 |
+
|
| 190 |
+
## **17\) What should we actually monitor during RL training?**
|
| 191 |
+
|
| 192 |
+
Monitor more than the headline reward. The session specifically called out tracking reward trends, component rewards, and whether important success columns are improving over time. It also recommended checking generated strategies and periodically sampling outputs during training rather than letting runs continue blindly.
|
| 193 |
+
|
| 194 |
+
Useful metrics include:
|
| 195 |
+
|
| 196 |
+
* average reward,
|
| 197 |
+
* verifier pass rate,
|
| 198 |
+
* timeout rate,
|
| 199 |
+
* format adherence,
|
| 200 |
+
* rollout length,
|
| 201 |
+
* diversity of successful solutions,
|
| 202 |
+
* frequency of suspicious shortcuts,
|
| 203 |
+
* and cost per useful trajectory.
|
| 204 |
+
|
| 205 |
+
If the average reward rises but the actual task quality drops or becomes brittle, that is often a reward-design problem rather than a model-capability problem.
|
| 206 |
+
|
| 207 |
+
## **18\) What is a strong hackathon strategy for building an RL environment fast?**
|
| 208 |
+
|
| 209 |
+
Pick a task with a crisp verifier. Build the smallest environment that exposes reset, step, observations, and reward. Use OpenEnv to standardize the interface and TRL to handle training. Use Unsloth if you need to fit training into tighter hardware budgets. ([Hugging Face](https://huggingface.co/docs/trl/openenv?utm_source=chatgpt.com))
|
| 210 |
+
|
| 211 |
+
A practical sequence:
|
| 212 |
+
|
| 213 |
+
1. Define the task and what βsuccessβ means.
|
| 214 |
+
2. Write the verifier before writing the policy loop.
|
| 215 |
+
3. Create a few toy tasks the model can solve.
|
| 216 |
+
4. Add curriculum or easier variants first.
|
| 217 |
+
5. Run small-scale debugging before long training.
|
| 218 |
+
6. Sample outputs constantly for reward hacking.
|
| 219 |
+
7. Only then scale rollouts and environment diversity.
|
| 220 |
+
|
| 221 |
+
## **19\) What are good starter resources for participants?**
|
| 222 |
+
|
| 223 |
+
For TRL:
|
| 224 |
+
|
| 225 |
+
* Main docs. ([Hugging Face](https://huggingface.co/docs/trl/index?utm_source=chatgpt.com))
|
| 226 |
+
* PPO trainer docs. ([Hugging Face](https://huggingface.co/docs/trl/ppo_trainer?utm_source=chatgpt.com))
|
| 227 |
+
* GRPO cookbook. ([Hugging Face](https://huggingface.co/learn/cookbook/fine_tuning_llm_grpo_trl?utm_source=chatgpt.com))
|
| 228 |
+
* Paper index for GRPO/DeepSeekMath references. ([Hugging Face](https://huggingface.co/docs/trl/paper_index?utm_source=chatgpt.com))
|
| 229 |
+
|
| 230 |
+
For OpenEnv:
|
| 231 |
+
|
| 232 |
+
* OpenEnv GitHub repo. ([GitHub](https://github.com/meta-pytorch/OpenEnv?utm_source=chatgpt.com))
|
| 233 |
+
* OpenEnv course. ([GitHub](https://github.com/huggingface/openenv-course?utm_source=chatgpt.com))
|
| 234 |
+
* TRLβs OpenEnv integration docs. ([Hugging Face](https://huggingface.co/docs/trl/openenv?utm_source=chatgpt.com))
|
| 235 |
+
|
| 236 |
+
For environments and benchmarks:
|
| 237 |
+
|
| 238 |
+
* BrowserGym. ([GitHub](https://github.com/servicenow/browsergym?utm_source=chatgpt.com))
|
| 239 |
+
|
| 240 |
+
For reward design and failure modes:
|
| 241 |
+
|
| 242 |
+
* DeepMind on specification gaming. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/?utm_source=chatgpt.com))
|
| 243 |
+
* Lilian Weng on reward hacking. ([Lil'Log](https://lilianweng.github.io/posts/2024-11-28-reward-hacking/?utm_source=chatgpt.com))
|
| 244 |
+
|
| 245 |
+
For RL algorithms:
|
| 246 |
+
|
| 247 |
+
* PPO paper. ([arXiv](https://arxiv.org/abs/1707.06347?utm_source=chatgpt.com))
|
| 248 |
+
* DeepSeekMath / GRPO paper. ([arXiv](https://arxiv.org/abs/2402.03300?utm_source=chatgpt.com))
|
| 249 |
+
|
| 250 |
+
For Unsloth:
|
| 251 |
+
|
| 252 |
+
* Unsloth repo/readme. ([GitHub](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1&utm_source=chatgpt.com))
|
| 253 |
+
|
| 254 |
+
## **20\) What is the one-sentence summary participants should remember?**
|
| 255 |
+
|
| 256 |
+
If you can build a task where success is verifiable, difficulty is controllable, and loopholes are monitored, RL can turn an LLM from βgood at answeringβ into βbetter at acting.β
|
| 257 |
+
|
| 258 |
+
###
|
| 259 |
+
|
| 260 |
+
### **21\) What is RLVR?**
|
| 261 |
+
|
| 262 |
+
RLVR stands for reinforcement learning with verifiable rewards. Instead of relying only on a learned reward model or human preference model, the training loop uses programmatic checks to determine whether an output is correct. Typical examples include exact-answer checks for math, unit tests for code, schema validation for structured output, or environment-based task completion checks. This makes RLVR especially attractive for domains where correctness can be verified automatically and consistently. ([Label Studio](https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/))
|
| 263 |
+
|
| 264 |
+
### **22\) What is RLVE?**
|
| 265 |
+
|
| 266 |
+
RLVE is reinforcement learning with verifiable environments. The key idea is to train on environments that can procedurally generate tasks, expose adjustable difficulty, and provide algorithmically verifiable rewards. Recent work on adaptive verifiable environments argues that static prompt datasets often become either too easy or too hard during training, causing learning to stall, while adaptive environments keep the model near its capability frontier. ([arXiv](https://arxiv.org/html/2511.07317v1))
|
| 267 |
+
|
| 268 |
+
### **23\) How is RLVE different from RLVR?**
|
| 269 |
+
|
| 270 |
+
RLVR usually refers to verifiable rewards on a fixed or semi-fixed set of prompts or problems. RLVE goes a step further by making the task source itself dynamic: the environment can generate new problems, vary difficulty, and keep serving appropriately challenging tasks as the model improves. In practice, RLVE is often better for preventing saturation on static datasets and for building curriculum naturally into training. ([arXiv](https://arxiv.org/html/2511.07317v1))
|
| 271 |
+
|
| 272 |
+
### **24\) Why are RL environments useful for LLM post-training?**
|
| 273 |
+
|
| 274 |
+
They let the model interact, not just answer. In a real environment, the model can act, observe consequences, act again, and get reward from actual task outcomes. That makes environments a better fit for tool use, browsers, APIs, coding agents, games, and long-horizon tasks than plain prompt-response datasets. Hugging Faceβs OpenEnv and TRL material reflects this shift toward environment-based agent training. ([Hugging Face](https://huggingface.co/blog/openenv-turing))
|
| 275 |
+
|
| 276 |
+
### **25\) Where do TRL, GRPO, and Unsloth fit in?**
|
| 277 |
+
|
| 278 |
+
TRL is the training framework that provides RL trainers and infrastructure for post-training transformer models, including GRPO. GRPO is the RL optimization method popularized in DeepSeekMath and now widely used in open LLM RL pipelines because it can be more memory-efficient than PPO-style setups in this context. Unsloth is typically used as the efficiency layer to make fine-tuning and RL training faster and more affordable on limited hardware. ([Hugging Face](https://huggingface.co/docs/trl/grpo_trainer))
|
| 279 |
+
|
| 280 |
+
### **26\) Why do rewards matter so much?**
|
| 281 |
+
|
| 282 |
+
Because the reward is the task definition as far as optimization is concerned. If your reward captures the real objective, RL can improve useful behavior. If your reward is incomplete, noisy, or hackable, the model will optimize the proxy instead of the real task. DeepMindβs write-up on specification gaming makes this point very clearly: the agentβs ingenuity is helpful only when the specification is correct. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/))
|
| 283 |
+
|
| 284 |
+
### **27\) What is reward engineering?**
|
| 285 |
+
|
| 286 |
+
Reward engineering is the design of the reward function, the verifier, the shaping terms, the penalties, and the monitoring strategy. In LLM RL, this includes deciding what counts as success, how partial progress is rewarded, what shortcuts are forbidden, and how to detect reward hacking. OpenEnvβs reward-design guide explicitly warns about reward hacking, sparse rewards, and conflicting signals as common pitfalls. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
|
| 287 |
+
|
| 288 |
+
### **28\) What is reward hacking?**
|
| 289 |
+
|
| 290 |
+
Reward hacking happens when a model finds a way to maximize the reward without actually doing the intended task. DeepMind describes this as specification gaming: the system satisfies the literal reward but not the real goal. Classic causes include poorly designed shaping rewards, missing constraints in the success condition, and simulator or verifier loopholes. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/))
|
| 291 |
+
|
| 292 |
+
### **29\) Why is sparse reward a common problem?**
|
| 293 |
+
|
| 294 |
+
If successful trajectories are too rare, the model may never get enough positive signal to improve. OpenEnvβs docs explicitly call sparse rewards a common pitfall because the agent may never find positive signal. RLVE work similarly notes that overly difficult tasks can yield consistently poor rewards and stall gradient-based learning. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
|
| 295 |
+
|
| 296 |
+
### **30\) Why can dense rewards also be dangerous?**
|
| 297 |
+
|
| 298 |
+
Dense rewards can speed up learning, but they can also create local optima and incentive misalignment. OpenEnv recommends starting simple and shaping carefully, because intermediate rewards can steer the model toward proxy behaviors. DeepMind gives the broader warning that poorly designed shaping can change the optimal policy itself rather than just helping the model reach the intended outcome faster. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
|
| 299 |
+
|
| 300 |
+
---
|
| 301 |
+
|
| 302 |
+
## **Common Pitfalls in Building RL Environments**
|
| 303 |
+
|
| 304 |
+
### **31\) What is the most common mistake when designing an RL environment?**
|
| 305 |
+
|
| 306 |
+
Making the environment easy to verify but not faithful to the real task. A verifier that checks only the final string, a regex, or a narrow success pattern may be convenient, but it often misses equivalent correct answers or allows degenerate shortcuts. Recent verifier analysis on mathematical RL found that rule-based verifiers often reject correct but differently formatted answers, while model-based verifiers can be exploited to produce false positives during RL. ([arXiv](https://arxiv.org/html/2505.22203v1))
|
| 307 |
+
|
| 308 |
+
### **32\) What goes wrong with weak verifiers?**
|
| 309 |
+
|
| 310 |
+
Two opposite failure modes are common. Rule-based verifiers can be too brittle and produce false negatives when the answer is correct but phrased differently. Model-based verifiers can be too permissive and produce false positives that the policy learns to exploit. The verifier study on mathematical reasoning reports both problems and shows that stronger policies make verifier weaknesses more obvious. ([arXiv](https://arxiv.org/html/2505.22203v1))
|
| 311 |
+
|
| 312 |
+
### **33\) Why is βjust use an LLM as judgeβ often risky?**
|
| 313 |
+
|
| 314 |
+
Because the judge becomes part of the optimization target. If the policy can find surface patterns that fool the judge, training can inflate reward without improving real task quality. That is exactly why model-based verifiers, despite better static accuracy, can be vulnerable during RL training. Use them carefully, stress-test them, and combine them with hard checks whenever possible. ([arXiv](https://arxiv.org/html/2505.22203v1))
|
| 315 |
+
|
| 316 |
+
### **34\) What is a common environment-design pitfall for tool-using agents?**
|
| 317 |
+
|
| 318 |
+
Not modeling realistic failure modes. Real APIs fail because of permissions, invalid formats, missing fields, timezones, or bad parameters. Hugging Faceβs OpenEnv blog highlights examples like missing OAuth scopes and bad RFC3339 datetime formatting. If the environment hides these realities, the resulting policy will be overfit to a toy setup and brittle in deployment. ([Hugging Face](https://huggingface.co/blog/openenv-turing))
|
| 319 |
+
|
| 320 |
+
### **35\) Why is static task difficulty a problem?**
|
| 321 |
+
|
| 322 |
+
Because the learning signal collapses at both extremes. Tasks that are too easy stop teaching the model anything useful. Tasks that are too hard yield near-zero reward and also stop teaching. RLVE was proposed largely to solve this problem by dynamically adjusting task difficulty as the policy improves. ([arXiv](https://arxiv.org/html/2511.07317v1))
|
| 323 |
+
|
| 324 |
+
### **36\) What is a common pitfall in environment diversity?**
|
| 325 |
+
|
| 326 |
+
Training on too few task types. Recent RLVE results argue that scaling the number of environments improves generalizable reasoning capability, and Reasoning Gym was built around procedurally generated tasks across many domains for exactly this reason. A narrow environment set often produces narrow competence and fragile transfer. ([arXiv](https://arxiv.org/html/2511.07317v1))
|
| 327 |
+
|
| 328 |
+
### **37\) Why do many RL environments fail to transfer to real-world performance?**
|
| 329 |
+
|
| 330 |
+
Because they optimize the wrong abstraction level. If the environment is too toy-like, omits realistic constraints, or over-simplifies tool feedback, the model may become good at the benchmark but not at the actual workflow. This is a practical version of specification gaming: the benchmark is solved, the real job is not. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/))
|
| 331 |
+
|
| 332 |
+
---
|
| 333 |
+
|
| 334 |
+
## **Common Pitfalls in Reward Engineering**
|
| 335 |
+
|
| 336 |
+
### **38\) What is the biggest reward-engineering mistake?**
|
| 337 |
+
|
| 338 |
+
Using a proxy metric as if it were the goal. Goodhart-style failures are everywhere in RL: token count, response format, test count, or intermediate progress can all become targets the model exploits. DeepMindβs examples of shaping mistakes and reward misspecification are the canonical warning here. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/))
|
| 339 |
+
|
| 340 |
+
### **39\) Should I start with a complicated reward function?**
|
| 341 |
+
|
| 342 |
+
Usually no. OpenEnv explicitly recommends starting simple, often with sparse success/failure reward, before layering in shaping terms. This makes debugging easier and reduces the chance that the model learns the wrong intermediate incentives before it learns the actual task. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
|
| 343 |
+
|
| 344 |
+
### **40\) What happens when reward components conflict?**
|
| 345 |
+
|
| 346 |
+
Learning becomes unstable or confused. OpenEnv lists conflicting signals as a common pitfall: if one term rewards brevity, another rewards verbosity, a third rewards format, and a fourth rewards exploration, the policy may oscillate or learn brittle shortcuts instead of coherent behavior. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
|
| 347 |
+
|
| 348 |
+
### **41\) Why is binary reward often appealing?**
|
| 349 |
+
|
| 350 |
+
Because it is easy to reason about and harder to game superficially. Label Studioβs RLVR overview notes that verifiable rewards are often binary and directly tied to correctness criteria, which makes evaluation simple and scalable. Binary reward is not always sufficient, but it is often a good starting point for precision-critical tasks like code and math. ([Label Studio](https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/))
|
| 351 |
+
|
| 352 |
+
### **42\) Why is binary reward sometimes not enough?**
|
| 353 |
+
|
| 354 |
+
Because it can be too sparse, especially for long-horizon tasks. If success only happens at the very end, the model may not learn at all. That is where carefully designed shaping, step-level evaluation, or adaptive curriculum can help β but only if you can add them without creating easy-to-game shortcuts. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
|
| 355 |
+
|
| 356 |
+
### **43\) How do I know whether my reward is being hacked?**
|
| 357 |
+
|
| 358 |
+
Watch for rising reward without corresponding task-quality gains. Typical signs are strange formatting habits, repetitive surface patterns, degenerate short solutions, suspiciously high judge scores, or solutions that pass weak checks but fail stronger ones. The verifier case study is a strong reminder that static verification accuracy is not enough; you must observe what happens under optimization pressure. ([arXiv](https://arxiv.org/html/2505.22203v1))
|
| 359 |
+
|
| 360 |
+
### **44\) What is a safe pattern for reward engineering?**
|
| 361 |
+
|
| 362 |
+
Use layered verification. Start with hard outcome checks. Add anti-cheat constraints. Then add minimal shaping only where the sparse reward is too weak. Keep a holdout evaluator separate from the training reward when possible. This matches both OpenEnvβs βstart simple, shape carefullyβ guidance and DeepMindβs warning about shaping altering the true objective. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
|
| 363 |
+
|
| 364 |
+
---
|
| 365 |
+
|
| 366 |
+
## **Common Pitfalls in RL Post-Training Pipelines with RLVR / RLVE / GRPO**
|
| 367 |
+
|
| 368 |
+
### **45\) What is a common mistake in GRPO training runs?**
|
| 369 |
+
|
| 370 |
+
Using RL before the base model is ready. GRPO is powerful, but it is a post-training method, not a substitute for capability. TRLβs own GRPO examples start from instruct models and task datasets rather than from weak base checkpoints. If the model almost never produces a correct rollout, the reward signal is too sparse for productive RL. ([Hugging Face](https://huggingface.co/docs/trl/grpo_trainer))
|
| 371 |
+
|
| 372 |
+
### **46\) Why does RL post-training plateau?**
|
| 373 |
+
|
| 374 |
+
Because the model saturates the available prompt distribution or the reward signal no longer differentiates useful improvements. RLVE explicitly frames static data saturation as a problem and shows that adaptive environments can keep learning going after conventional RLVR pipelines flatten out. ([arXiv](https://arxiv.org/html/2511.07317v1))
|
| 375 |
+
|
| 376 |
+
### **47\) Why can βmore RLβ make a model worse?**
|
| 377 |
+
|
| 378 |
+
Because optimization pressure amplifies whatever the reward favors, including undesirable shortcuts. If the verifier is noisy, if the environment is unrealistic, or if the reward overvalues superficial structure, more training can push the model deeper into those artifacts rather than improving real competence. ([arXiv](https://arxiv.org/html/2505.22203v1))
|
| 379 |
+
|
| 380 |
+
### **48\) What is a common pitfall in RLVR datasets?**
|
| 381 |
+
|
| 382 |
+
Finite, static datasets get stale. Once the model has mastered or overfit their distribution, additional RL yields little signal. RLVE work argues that procedurally generated environments with adjustable difficulty are one way around this limitation. Reasoning Gym makes a similar case for unlimited data generation with controllable complexity. ([arXiv](https://arxiv.org/html/2511.07317v1))
|
| 383 |
+
|
| 384 |
+
### **49\) Why do identical-looking GRPO runs produce different outcomes?**
|
| 385 |
+
|
| 386 |
+
Because RL is highly sensitive to rollout quality, verifier behavior, reward scaling, task mix, generation parameters, and environment bugs. Even if the trainer code is the same, small differences in reward computation or environment behavior can change optimization dynamics substantially. The verifier study is a good reminder that the reward pipeline itself is part of the model. ([arXiv](https://arxiv.org/html/2505.22203v1))
|
| 387 |
+
|
| 388 |
+
### **50\) What is a common pitfall when mixing many environments?**
|
| 389 |
+
|
| 390 |
+
Using an unbalanced mixture. If some environments are much easier, much denser in reward, or much shorter in trajectory length, they can dominate training and starve harder but more important environments. RLVEβs adaptive-difficulty framing exists partly to keep the training distribution informative instead of letting it collapse into easy tasks. ([arXiv](https://arxiv.org/html/2511.07317v1))
|
| 391 |
+
|
| 392 |
+
### **51\) Why are long-horizon tasks especially hard in RL post-training?**
|
| 393 |
+
|
| 394 |
+
Because reward arrives late and useful trajectories are rare. Long tasks need either decomposition, better intermediate signals, stronger initialization, or curriculum. Otherwise, the rollout cost is high and the success rate stays near zero. This is one reason why adaptive environments and procedural curricula are getting attention. ([arXiv](https://arxiv.org/html/2511.07317v1))
|
| 395 |
+
|
| 396 |
+
### **52\) What monitoring mistake do teams make most often?**
|
| 397 |
+
|
| 398 |
+
They monitor the training reward but not actual behavior. Reward alone is not enough because the reward channel can be flawed. You need sampled rollout audits, stronger offline evaluation, and held-out environments or benchmarks. The verifier case study shows why this matters: reward can rise while real quality does not. ([arXiv](https://arxiv.org/html/2505.22203v1))
|
| 399 |
+
|
| 400 |
+
### **53\) What is the safest way to structure an RL post-training pipeline?**
|
| 401 |
+
|
| 402 |
+
A good pattern is:
|
| 403 |
+
start from a strong instruct or SFT checkpoint, use a task with a strong verifier, begin with simple reward, validate the environment thoroughly, run small-scale debug experiments, audit rollouts manually, then scale training and only later add curriculum or more shaping. This is consistent with TRLβs practical GRPO examples, OpenEnvβs reward guidance, and the lessons from verifier-failure studies. ([Hugging Face](https://huggingface.co/docs/trl/grpo_trainer))
|
| 404 |
+
|
| 405 |
+
---
|
| 406 |
+
|
| 407 |
+
## **Practical βWhat should we do in a hackathon?β FAQs**
|
| 408 |
+
|
| 409 |
+
### **54\) What kind of project is most likely to succeed in a hackathon?**
|
| 410 |
+
|
| 411 |
+
Pick a task with:
|
| 412 |
+
a clear success condition,
|
| 413 |
+
a verifier you trust,
|
| 414 |
+
short to medium trajectory length,
|
| 415 |
+
few external dependencies,
|
| 416 |
+
and adjustable difficulty.
|
| 417 |
+
|
| 418 |
+
Good examples are code repair with tests, structured extraction with schema validation, grid or puzzle games, tool-using workflows with exact state checks, and browser tasks with explicit completion criteria. These are the sweet spot for RLVR and lightweight RLVE prototypes. ([Label Studio](https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/))
|
| 419 |
+
|
| 420 |
+
### **55\) What should we avoid building?**
|
| 421 |
+
|
| 422 |
+
Avoid tasks that are subjective, hard to verify, require massive infrastructure, or depend heavily on an LLM judge without hard backstops. Also avoid environments whose failure cases you do not understand. If you cannot explain how the reward could be hacked, you are not ready to optimize it yet. ([arXiv](https://arxiv.org/html/2505.22203v1))
|
| 423 |
+
|
| 424 |
+
### **56\) What is the best debugging order?**
|
| 425 |
+
|
| 426 |
+
First debug the environment manually.
|
| 427 |
+
Then debug the verifier.
|
| 428 |
+
Then run scripted baseline policies.
|
| 429 |
+
Then run a frozen model.
|
| 430 |
+
Then run a tiny RL experiment.
|
| 431 |
+
Only then scale.
|
| 432 |
+
|
| 433 |
+
This order isolates bugs early and prevents you from blaming the optimizer for what is really an environment or reward bug. It follows directly from the fact that verifier reliability is foundational in RLVR. ([arXiv](https://arxiv.org/html/2505.22203v1))
|
| 434 |
+
|
| 435 |
+
### **57\) What is one rule the team should remember?**
|
| 436 |
+
|
| 437 |
+
Do not optimize a reward you have not tried to break yourself first. The easiest way to avoid reward hacking is to adversarially test your environment and reward design before the model does. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/))
|
| 438 |
+
|
| 439 |
+
---
|
| 440 |
+
|
| 441 |
+
## **58\) Strong references for deeper learning**
|
| 442 |
+
|
| 443 |
+
For GRPO and TRL:
|
| 444 |
+
|
| 445 |
+
* TRL GRPO Trainer docs. ([Hugging Face](https://huggingface.co/docs/trl/grpo_trainer))
|
| 446 |
+
* Hugging Face GRPO cookbook. ([Hugging Face](https://huggingface.co/learn/cookbook/fine_tuning_llm_grpo_trl))
|
| 447 |
+
|
| 448 |
+
For RL environments and reward design:
|
| 449 |
+
|
| 450 |
+
* OpenEnv reward design guide. ([Meta-PyTorch](https://meta-pytorch.org/OpenEnv/guides/rewards.html))
|
| 451 |
+
* OpenEnv tool-using environment examples. ([Hugging Face](https://huggingface.co/blog/openenv-turing))
|
| 452 |
+
|
| 453 |
+
For pitfalls and failure modes:
|
| 454 |
+
|
| 455 |
+
* DeepMind on specification gaming. ([Google DeepMind](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/))
|
| 456 |
+
* Pitfalls of rule-based and model-based verifiers. ([arXiv](https://arxiv.org/html/2505.22203v1))
|
| 457 |
+
|
| 458 |
+
For scalable environment-based training:
|
| 459 |
+
|
| 460 |
+
* RLVE paper on adaptive verifiable environments. ([arXiv](https://arxiv.org/html/2511.07317v1))
|
| 461 |
+
* Reasoning Gym. ([OpenReview](https://openreview.net/forum?id=GqYSunGmp7&referrer=%5Bthe+profile+of+Oliver+Stanley%5D%28%2Fprofile%3Fid%3D~Oliver_Stanley1%29))
|
| 462 |
+
|
| 463 |
+
Here are solid Unsloth RL post-training recipes worth checking out, with a bias toward official or close-to-official examples.
|
| 464 |
+
|
| 465 |
+
### **59\) Core Unsloth GRPO recipes**
|
| 466 |
+
|
| 467 |
+
**Qwen2.5 (3B) GRPO notebook**
|
| 468 |
+
A straightforward starter recipe for GRPO with Unsloth. It covers data prep, training, inference, and saving, so it is a good baseline if you want the least opinionated end-to-end example. ([GitHub](https://github.com/unslothai/notebooks/blob/main/nb/Qwen2.5_%283B%29-GRPO.ipynb?utm_source=chatgpt.com))
|
| 469 |
+
|
| 470 |
+
**Llama 3.1 (8B) GRPO notebook**
|
| 471 |
+
Same general pattern, but on a larger model family. Useful if you want a more realistic βreasoning/capability upliftβ recipe without jumping straight to very large models. ([GitHub](https://github.com/unslothai/notebooks/blob/main/nb/Llama3.1_%288B%29-GRPO.ipynb?utm_source=chatgpt.com))
|
| 472 |
+
|
| 473 |
+
**Gemma 3 (1B) GRPO notebook**
|
| 474 |
+
A smaller-scale recipe that is easier to run and debug. Good for iterating on reward functions and rollout settings before spending more compute on larger checkpoints. ([GitHub](https://github.com/unslothai/notebooks/blob/main/nb/Gemma3_%281B%29-GRPO.ipynb?utm_source=chatgpt.com))
|
| 475 |
+
|
| 476 |
+
### **59.1) Advanced Unsloth GRPO recipes**
|
| 477 |
+
|
| 478 |
+
**Advanced Qwen3 (4B) GRPO notebook**
|
| 479 |
+
This is one of the more interesting recipes because it adds more than the bare trainer loop. Unslothβs June 2025 discussion explicitly calls out: proximity scoring for more nuanced rewards, OpenR1 dataset support, advanced templates, and βprefinetuning to skip GRPO format learning.β That makes it a better recipe when you care about reward shaping and format bootstrapping, not just getting GRPO to run. ([GitHub](https://github.com/unslothai/unsloth/discussions/2810?utm_source=chatgpt.com))
|
| 480 |
+
|
| 481 |
+
**HF LLM Course: Practical Exercise β GRPO with Unsloth**
|
| 482 |
+
Not an Unsloth-maintained notebook repo entry, but it is a structured learning recipe that uses Unsloth specifically to fine-tune a model with GRPO for reasoning. It is a good companion when you want a didactic walkthrough instead of just notebook cells. ([Hugging Face](https://huggingface.co/learn/llm-course/chapter12/6?utm_source=chatgpt.com))
|
| 483 |
+
|
| 484 |
+
### **59.2) Environment / agent-style RL recipes**
|
| 485 |
+
|
| 486 |
+
**GPT-OSS 20B \+ 2048 game RL notebook**
|
| 487 |
+
This is closer to βRL with an environmentβ than plain static-prompt RLVR. The notebook goal is explicitly to make GPT-OSS play 2048 with reinforcement learning / GRPO, which makes it a useful recipe if you want to move beyond math/code answer verification into interactive environment training. ([GitHub](https://github.com/unslothai/notebooks/blob/main/nb/gpt_oss_%2820B%29_Reinforcement_Learning_2048_Game_BF16.ipynb?utm_source=chatgpt.com))
|
| 488 |
+
|
| 489 |
+
### **59.3) Broader recipe collection**
|
| 490 |
+
|
| 491 |
+
**Unsloth notebooks repository**
|
| 492 |
+
The main repo currently advertises β250+ Fine-tuning & RL Notebooks,β including GRPO and reinforcement learning notebooks. If you want the widest set of recipes in one place, this is the best starting point. ([GitHub](https://github.com/unslothai/notebooks?utm_source=chatgpt.com))
|
| 493 |
+
|
| 494 |
+
### **59.4) Useful adjacent recipes and examples**
|
| 495 |
+
|
| 496 |
+
**Scheduler GRPO example using Unsloth**
|
| 497 |
+
A community example that trains a scheduling model with GRPO using Unsloth and QLoRA. It is useful because it shows a non-math, non-code structured-output task where rewards are tied to output format and schedule correctness. ([Hugging Face](https://huggingface.co/blog/anakin87/qwen-scheduler-grpo?utm_source=chatgpt.com))
|
| 498 |
+
|
| 499 |
+
**SFT β GRPO pipeline example**
|
| 500 |
+
There is a community βshow and tellβ example for a full SFT-then-GRPO pipeline. I would treat it as inspiration rather than an official recipe, but it is valuable if your intended workflow is βteach format first, then do RL.β ([GitHub](https://github.com/unslothai/unsloth/discussions/3407?utm_source=chatgpt.com))
|
| 501 |
+
|
| 502 |
+
### **59.5) What these recipes collectively cover**
|
| 503 |
+
|
| 504 |
+
Across these examples, the main recipe patterns are:
|
| 505 |
+
|
| 506 |
+
* plain GRPO on reasoning-style tasks,
|
| 507 |
+
* GRPO with better reward shaping like proximity scoring,
|
| 508 |
+
* pre-SFT or preformatting before RL,
|
| 509 |
+
* QLoRA-based memory-efficient RL fine-tuning,
|
| 510 |
+
* and environment-style RL with game interaction. ([GitHub](https://github.com/unslothai/unsloth/discussions/2810?utm_source=chatgpt.com))
|
| 511 |
+
|
| 512 |
+
### **59.6) Two gaps to keep in mind**
|
| 513 |
+
|
| 514 |
+
One gap is **multi-turn GRPO with stepwise rewards**. There is a feature request asking for reward on each step plus a final reward, which suggests this is not yet a mature first-class recipe in Unsloth. ([GitHub](https://github.com/unslothai/unsloth/issues/3615?utm_source=chatgpt.com))
|
| 515 |
+
|
| 516 |
+
Another gap is **notebook stability across versions/hardware**. Several issue threads mention breakage or edge cases in GRPO notebooks, including fast inference assumptions, VRAM growth, and vision-GRPO issues. That does not make the recipes unusable, but it does mean you should pin versions and test on a small run first. ([GitHub](https://github.com/unslothai/unsloth/issues/2730?utm_source=chatgpt.com))
|
| 517 |
+
|
| 518 |
+
### **59.7) Best recipes by use case**
|
| 519 |
+
|
| 520 |
+
If you want the simplest starting point:
|
| 521 |
+
|
| 522 |
+
* Qwen2.5 (3B) GRPO
|
| 523 |
+
* Gemma 3 (1B) GRPO ([GitHub](https://github.com/unslothai/notebooks/blob/main/nb/Qwen2.5_%283B%29-GRPO.ipynb?utm_source=chatgpt.com))
|
| 524 |
+
|
| 525 |
+
If you care about reward engineering:
|
| 526 |
+
|
| 527 |
+
* Advanced Qwen3 (4B) GRPO ([GitHub](https://github.com/unslothai/unsloth/discussions/2810?utm_source=chatgpt.com))
|
| 528 |
+
|
| 529 |
+
If you care about environment-style RL:
|
| 530 |
+
|
| 531 |
+
* GPT-OSS 20B 2048 notebook ([GitHub](https://github.com/unslothai/notebooks/blob/main/nb/gpt_oss_%2820B%29_Reinforcement_Learning_2048_Game_BF16.ipynb?utm_source=chatgpt.com))
|
| 532 |
+
|
| 533 |
+
If you want the most guided learning path:
|
| 534 |
+
|
| 535 |
+
* HF practical exercise with Unsloth \+ GRPO ([Hugging Face](https://huggingface.co/learn/llm-course/chapter12/6?utm_source=chatgpt.com))
|
| 536 |
+
|
| 537 |
+
If helpful, I can turn this into a curated table with columns for model, task type, reward type, hardware footprint, and what each recipe teaches.
|
| 538 |
+
|
| 539 |
+
## Additional Resources:
|
| 540 |
+
|
| 541 |
+
* OpenEnv Core (An interface library for RL post training with environments)
|
| 542 |
+
* [https://github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
|
| 543 |
+
* OpenEnv-PyTorch Docs
|
| 544 |
+
* [https://meta-pytorch.org/OpenEnv/](https://meta-pytorch.org/OpenEnv/)
|
| 545 |
+
* HuggingFace OpenEnv Environments Hub
|
| 546 |
+
* [https://huggingface.co/openenv](https://huggingface.co/openenv)
|
| 547 |
+
* [https://huggingface.co/openenv/spaces](https://huggingface.co/openenv/spaces)
|
| 548 |
+
* Tutorials to build, run and train RL environments and training pipelines
|
| 549 |
+
* [https://github.com/meta-pytorch/OpenEnv/tree/main/tutorial](https://github.com/meta-pytorch/OpenEnv/tree/main/tutorial)
|
| 550 |
+
* RL Training Examples: [https://github.com/meta-pytorch/OpenEnv/tree/main/tutorial/examples](https://github.com/meta-pytorch/OpenEnv/tree/main/tutorial/examples)
|
| 551 |
+
* RL Environment Examples: [https://github.com/meta-pytorch/OpenEnv/tree/main/envs](https://github.com/meta-pytorch/OpenEnv/tree/main/envs)
|
| 552 |
+
* Few additional YT Videos on building RL Environments:
|
| 553 |
+
* [https://www.youtube.com/watch?v=0airz7BhBiA](https://www.youtube.com/watch?v=0airz7BhBiA)
|
| 554 |
+
* [https://www.youtube.com/watch?v=ap4q4sAK4OY](https://www.youtube.com/watch?v=ap4q4sAK4OY)
|
| 555 |
+
* [https://www.youtube.com/watch?v=Jew4lhAiqnw](https://www.youtube.com/watch?v=Jew4lhAiqnw)
|
| 556 |
+
* [https://openenv-india-apr-2026.lovable.app/](https://openenv-india-apr-2026.lovable.app/) **(Recommended: Chaptered Lectures)**
|
docs/references/hackathon_checklist.md
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hackathon Checklist β April 25β26, Bangalore
|
| 2 |
+
**Solo participant: Akhil Soni**
|
| 3 |
+
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
## What Judges Want to See
|
| 7 |
+
(From Help Guide + Discord FAQ β judges are Meta/HuggingFace practitioners)
|
| 8 |
+
|
| 9 |
+
1. **Working environment** β reset/step runs cleanly, rewards are sensible
|
| 10 |
+
2. **Multiple independent reward functions** β not a single score
|
| 11 |
+
3. **Evidence the model improved** β reward curve going up, before/after comparison
|
| 12 |
+
4. **Anti-hacking measures** β agent can't exploit the environment
|
| 13 |
+
5. **Reproducible deployment** β HF Space that anyone can hit
|
| 14 |
+
6. **Sharp demo** β baseline attempt β reward output β trained attempt β measurable improvement
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
## Before the Venue (April 24 β today)
|
| 19 |
+
|
| 20 |
+
### Environment
|
| 21 |
+
- [x] Round 2 hidden variables designed (HV1 Circadian, HV2 Energy Cliff, HV3 Meltdown)
|
| 22 |
+
- [x] Gradio UI running locally (http://localhost:7862)
|
| 23 |
+
- [ ] **Implement HV1, HV2, HV3 in `server/rhythm_environment.py`**
|
| 24 |
+
- [ ] Add `PersonProfile` enum and `task_type` to `models.py`
|
| 25 |
+
- [ ] Verify grader score still works correctly after HV changes
|
| 26 |
+
- [ ] Add anti-hacking guard: cap consecutive breaks, penalize action spam
|
| 27 |
+
|
| 28 |
+
### Reward Functions (multi-layer β for GRPO)
|
| 29 |
+
- [ ] `reward_format_valid` β did the LLM output a parseable action?
|
| 30 |
+
- [ ] `reward_action_legal` β is the chosen action valid given current state?
|
| 31 |
+
- [ ] `reward_env_step` β actual `obs.reward` from `env.step(action)`
|
| 32 |
+
|
| 33 |
+
### Training Setup
|
| 34 |
+
- [ ] Write `training/dataset.py` β generate episode observation prompts
|
| 35 |
+
- [ ] Write `training/train.py` β GRPO trainer config (use template in `unsloth_grpo_training_template.md`)
|
| 36 |
+
- [ ] Write `training/inference_eval.py` β baseline run + trained run comparison
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## Day 1 at Venue (April 25 β morning priority)
|
| 41 |
+
|
| 42 |
+
### 1. Deploy to HF Space FIRST (before training)
|
| 43 |
+
Judges expect a running Space. Do this before anything else.
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
# Push environment as HF Space
|
| 47 |
+
openenv push # or manual push to HuggingFace
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
- [ ] Environment runs on HF Space
|
| 51 |
+
- [ ] `reset()` and `step()` work remotely
|
| 52 |
+
- [ ] Space URL noted and shared with mentors
|
| 53 |
+
|
| 54 |
+
### 2. Verify the RL Loop End-to-End (locally first)
|
| 55 |
+
```
|
| 56 |
+
prompt β LLM β action β env.step() β reward β GRPO update
|
| 57 |
+
```
|
| 58 |
+
- [ ] Full loop runs without crashing
|
| 59 |
+
- [ ] Reward goes to console/log
|
| 60 |
+
- [ ] At least one successful episode (non-zero reward)
|
| 61 |
+
|
| 62 |
+
### 3. Run Baseline (before training)
|
| 63 |
+
- [ ] Run 10β20 episodes with untrained model
|
| 64 |
+
- [ ] Log average grader score
|
| 65 |
+
- [ ] Save baseline reward curve
|
| 66 |
+
- [ ] Screenshot or record Gradio UI showing baseline behavior
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## Day 1 at Venue (afternoon)
|
| 71 |
+
|
| 72 |
+
### 4. Training β Start Small
|
| 73 |
+
- [ ] Train on `easy` scenario first (100β200 steps)
|
| 74 |
+
- [ ] Confirm reward is going up (not flat or crashing)
|
| 75 |
+
- [ ] Check generated actions β look for reward hacking patterns
|
| 76 |
+
- [ ] If reward is flat: simplify prompt, check reward functions individually
|
| 77 |
+
|
| 78 |
+
### 5. Anti-Hacking Checks
|
| 79 |
+
The model may learn to spam TAKE_BREAK (low stress = less penalty).
|
| 80 |
+
Guards already partially in code β verify these work:
|
| 81 |
+
- [ ] `consecutive_breaks > MAX_FREE_BREAKS` β penalty applies
|
| 82 |
+
- [ ] `IDLE_PENALTY` fires when no task is active
|
| 83 |
+
- [ ] Model can't "know" hidden variable thresholds (they're not in obs)
|
| 84 |
+
- [ ] Test with a greedy exploit agent manually
|
| 85 |
+
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
## Day 2 at Venue (April 26)
|
| 89 |
+
|
| 90 |
+
### 6. Full Training Run
|
| 91 |
+
- [ ] Train on `easy` β `medium` β `hard` (curriculum)
|
| 92 |
+
- [ ] 500β1000 total GRPO steps
|
| 93 |
+
- [ ] Monitor: `reward/mean`, `reward/std`, KL divergence, per-reward-function scores
|
| 94 |
+
- [ ] Save checkpoint every 100 steps
|
| 95 |
+
|
| 96 |
+
### 7. Save Model Correctly
|
| 97 |
+
**Warning:** Do NOT upcast 4-bit model to 16-bit and merge LoRA naively β damages quality.
|
| 98 |
+
```python
|
| 99 |
+
# Correct save
|
| 100 |
+
model.save_pretrained_merged("outputs/rhythmenv_trained", tokenizer, save_method="merged_16bit")
|
| 101 |
+
# Or keep adapters separate
|
| 102 |
+
model.save_pretrained("outputs/adapters")
|
| 103 |
+
tokenizer.save_pretrained("outputs/adapters")
|
| 104 |
+
```
|
| 105 |
+
- [ ] Model saved correctly
|
| 106 |
+
- [ ] Post-training inference tested immediately after save
|
| 107 |
+
|
| 108 |
+
### 8. Build the Demo
|
| 109 |
+
Format: **baseline β trained β measurable improvement**
|
| 110 |
+
|
| 111 |
+
```
|
| 112 |
+
1. Show baseline: untrained model playing easy scenario β grader score ~0.2
|
| 113 |
+
2. Show reward curve: 500 steps, reward trending up
|
| 114 |
+
3. Show trained: model playing same scenario β grader score ~0.6+
|
| 115 |
+
4. Explain hidden variables: why the model had to discover them
|
| 116 |
+
5. Show person profile inference: does the model behave differently for MORNING_PERSON vs NIGHT_OWL?
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
- [ ] Gradio UI shows before/after comparison
|
| 120 |
+
- [ ] Reward curve screenshot/chart ready
|
| 121 |
+
- [ ] 3-minute pitch rehearsed (see `docs/round2/pitch_framing.md`)
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
## Submission Checklist
|
| 126 |
+
|
| 127 |
+
- [ ] HF Space deployed and running
|
| 128 |
+
- [ ] `inference.py` updated for trained model (correct `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`)
|
| 129 |
+
- [ ] README updated with Round 2 description
|
| 130 |
+
- [ ] Reward curves saved as images
|
| 131 |
+
- [ ] Model pushed to HF Hub
|
| 132 |
+
|
| 133 |
+
---
|
| 134 |
+
|
| 135 |
+
## Quick Reference β Key Numbers
|
| 136 |
+
|
| 137 |
+
| Thing | Value |
|
| 138 |
+
|---|---|
|
| 139 |
+
| Max steps per episode | 20 |
|
| 140 |
+
| Scenarios | easy / medium / hard |
|
| 141 |
+
| Grader weights | 40% completion, 20% deadline, 15% efficiency, 10% energy, 15% stress |
|
| 142 |
+
| GRPO starting lr | 2e-4 |
|
| 143 |
+
| GRPO num_generations | 4 (more than 2048 notebook β hidden vars need exploration) |
|
| 144 |
+
| GRPO max_steps | 1000 |
|
| 145 |
+
| Prize pool | $30,000 (top 15 teams) |
|
| 146 |
+
| Evaluation | ~20-30 min per top team by Meta/HF engineers |
|
| 147 |
+
|
| 148 |
+
---
|
| 149 |
+
|
| 150 |
+
## Contacts at Venue
|
| 151 |
+
- Sanyam Bhutani β Partner Engineer, Meta
|
| 152 |
+
- Ben Burtenshaw β Community Education, HuggingFace
|
| 153 |
+
- Adithya S Kolavi β ML Engineer, HuggingFace
|
docs/references/reward_engineering_overview.md
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
source: https://arxiv.org/abs/2408.10215
|
| 3 |
+
title: "Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications"
|
| 4 |
+
authors: Sinan Ibrahim, Mostafa Mostafa, Ali Jnadi, Hadi Salloum, Pavel Osinenko
|
| 5 |
+
published: IEEE Access, Vol. 12, 2024
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# Reward Engineering & Shaping β Overview Paper
|
| 9 |
+
|
| 10 |
+
## What It Covers
|
| 11 |
+
|
| 12 |
+
A survey of 55 papers on reward design challenges in RL. Core problems addressed:
|
| 13 |
+
- Sparse / delayed rewards (most common bottleneck)
|
| 14 |
+
- Reward hacking β agent exploits loopholes instead of solving the task
|
| 15 |
+
- Multi-objective complexity β real tasks have competing objectives
|
| 16 |
+
- Convergence inefficiency without proper guidance
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## Key Technique: Potential-Based Reward Shaping (PBRS)
|
| 21 |
+
|
| 22 |
+
The safest reward shaping approach β mathematically guarantees the optimal policy doesn't change:
|
| 23 |
+
|
| 24 |
+
```
|
| 25 |
+
R'(s,a,s') = R(s,a,s') + Ξ³Ξ¦(s') - Ξ¦(s)
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
- `Ξ¦(s)` is a potential function encoding "how good is this state"
|
| 29 |
+
- The agent learns faster without learning a different policy
|
| 30 |
+
- **For RhythmEnv:** Ξ¦(s) could be `progress_toward_deadlines + energy_level`
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## Key Techniques Relevant to RhythmEnv
|
| 35 |
+
|
| 36 |
+
### Handling Sparse Rewards
|
| 37 |
+
- **EXPLORS:** Self-supervised exploration bonuses β fully automated, no manual design
|
| 38 |
+
- **RUNE:** Uses ensemble variance as an exploration bonus (reward uncertainty)
|
| 39 |
+
- **Intrinsic motivation (LIRPG):** Agent learns curiosity-driven rewards alongside task rewards
|
| 40 |
+
|
| 41 |
+
### Preventing Reward Hacking
|
| 42 |
+
- Test with adversarial agents before finalizing reward functions
|
| 43 |
+
- **Difference rewards:** `R'(s,a) = R(s,a) + Ξ³[D(s',r) - D(s,r)]` β incentivizes true contribution, useful if extending to multi-agent
|
| 44 |
+
- Monitor agent trajectories for unintended patterns (e.g., spamming breaks, never switching tasks)
|
| 45 |
+
|
| 46 |
+
### Multi-Objective Reward Design
|
| 47 |
+
- Use **vector rewards** β separate dimensions for progress, stress, energy, deadlines
|
| 48 |
+
- Aggregate with explicit weights (our current design already does this)
|
| 49 |
+
- Ensure reward components don't cancel each other out silently
|
| 50 |
+
|
| 51 |
+
### Dynamic Potential Functions (DPBRS)
|
| 52 |
+
- Time-varying `Ξ¦(s,t)` β potential changes as the episode progresses
|
| 53 |
+
- Relevant for RhythmEnv: deadline proximity should increase the potential for completing near-deadline tasks as time runs out
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## Common Pitfalls (Checklist)
|
| 58 |
+
|
| 59 |
+
- [ ] Don't rely on sparse rewards alone β add intermediate shaping
|
| 60 |
+
- [ ] Watch for reward hacking β test with a greedy agent that tries to exploit
|
| 61 |
+
- [ ] Complex reward functions are hard to debug β start simple, add components one at a time
|
| 62 |
+
- [ ] Evaluation metrics must be independent of reward design (our `_grade_episode` grader serves this role)
|
| 63 |
+
- [ ] Domain knowledge is essential but expensive to encode β validate with domain experts
|
| 64 |
+
|
| 65 |
+
---
|
| 66 |
+
|
| 67 |
+
## For Our Hidden Variables
|
| 68 |
+
|
| 69 |
+
The paper directly supports the hidden variable approach:
|
| 70 |
+
- Hidden variables that secretly modulate reward = reward uncertainty from the agent's perspective
|
| 71 |
+
- Agent must learn to explore across time-of-day and energy levels to discover the true reward structure
|
| 72 |
+
- This is essentially the agent discovering the "potential function" through experience
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
## Takeaways for RhythmEnv Training
|
| 77 |
+
|
| 78 |
+
1. Use PBRS: define Ξ¦(s) = weighted combination of progress + energy + inverse-stress
|
| 79 |
+
2. Add exploration bonus early in training (agent needs to try morning vs afternoon work)
|
| 80 |
+
3. Monitor for reward hacking (e.g., taking maximum breaks to avoid stress penalty)
|
| 81 |
+
4. Track reward components separately in logs β not just total reward
|
| 82 |
+
5. Reduce reward horizon early in training to accelerate validation
|
docs/references/reward_engineering_software_tasks.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
source: https://arxiv.org/abs/2601.19100
|
| 3 |
+
title: "Reward Engineering for Reinforcement Learning in Software Tasks"
|
| 4 |
+
authors: Md Rayhanul Masud, Azmine Toushik Wasi, Salman Rahman, Md Rizwan Parvez
|
| 5 |
+
published: arXiv, January 2026 (first systematic review of this area)
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# Reward Engineering for RL in Software Tasks
|
| 9 |
+
|
| 10 |
+
## What It Covers
|
| 11 |
+
|
| 12 |
+
First systematic review of reward design for code-centric RL tasks (generation, repair, summarization, testing). Surveys 80+ papers from 2024β2025. Core problem: software tasks lack direct reward signals β everything is proxy-based.
|
| 13 |
+
|
| 14 |
+
Relevant to RhythmEnv because **our environment is also proxy-based**: the agent never directly observes the hidden circadian/energy/stress factors β it infers them from reward signals, just like a code agent infers "correctness" from test pass rates.
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
## Proxy Reward Pattern (directly maps to RhythmEnv)
|
| 19 |
+
|
| 20 |
+
| Software Task Proxy | RhythmEnv Equivalent |
|
| 21 |
+
|---|---|
|
| 22 |
+
| Compilation success (binary) | Task completed before deadline (binary) |
|
| 23 |
+
| Test pass rate (% passing) | Importance-weighted completion fraction |
|
| 24 |
+
| Code quality metrics | Energy + stress management score |
|
| 25 |
+
| No regression (didn't break other tests) | No missed deadlines on other tasks |
|
| 26 |
+
| Runtime efficiency | Steps worked / optimal steps (efficiency score) |
|
| 27 |
+
|
| 28 |
+
The grader's final score = our "test suite". Per-step rewards = our "fast proxy" signals.
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## Key Design Principles
|
| 33 |
+
|
| 34 |
+
### 1. Composite Rewards Win
|
| 35 |
+
No single metric is sufficient. Combine:
|
| 36 |
+
- **Fast proxies** (cheap, run every step): progress delta, stress penalty
|
| 37 |
+
- **Slow validators** (expensive, run at episode end): grader score (completion, deadline, efficiency)
|
| 38 |
+
|
| 39 |
+
Our design already does this: per-step reward + `_grade_episode` at `done=True`.
|
| 40 |
+
|
| 41 |
+
### 2. Sparse Reward Handling
|
| 42 |
+
Software tasks naturally sparse (pass/fail). Solutions:
|
| 43 |
+
- **Partial credit:** Reward near-correct attempts (our `progress_reward` per step does this)
|
| 44 |
+
- **Shaping:** Guide exploration toward productive states
|
| 45 |
+
- **Curriculum:** Start easy, add complexity β our `easy β medium β hard` scenarios
|
| 46 |
+
|
| 47 |
+
### 3. Reward Horizon
|
| 48 |
+
Shorter reward horizons accelerate learning. For RhythmEnv:
|
| 49 |
+
- Keep `MAX_STEPS=20` for training (short episodes = faster reward signal)
|
| 50 |
+
- Don't extend to multi-day episodes until single-day policy is stable
|
| 51 |
+
|
| 52 |
+
### 4. Avoid Single-Metric Optimization
|
| 53 |
+
Agents trained on test pass rate alone produce brittle code. For us:
|
| 54 |
+
- Don't train only on final score β intermediate per-step rewards matter
|
| 55 |
+
- The hidden variables (HV1/HV2/HV3) ensure the agent can't cheat a single metric
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## Practical Checklist for Our Training Setup
|
| 60 |
+
|
| 61 |
+
- [ ] Per-step reward provides dense feedback (already implemented)
|
| 62 |
+
- [ ] Final grader score is independent of per-step reward design (already implemented)
|
| 63 |
+
- [ ] Multiple reward components logged separately (need to ensure in training loop)
|
| 64 |
+
- [ ] Curriculum: train on `easy` first, then `medium`, then `hard`
|
| 65 |
+
- [ ] Monitor for policy collapse β agent converging to a single strategy (e.g., always take breaks)
|
| 66 |
+
- [ ] Reward shaping doesn't conflict with grader score direction
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## Takeaways for Hidden Variables
|
| 71 |
+
|
| 72 |
+
The paper's core insight: reward proxy β true objective. This is exactly what hidden variables enforce:
|
| 73 |
+
- HV1 (Circadian): Same action at different times gives different rewards β forces temporal exploration
|
| 74 |
+
- HV2 (Energy Cliff): Progress collapses silently β forces the agent to maintain energy, can't predict when
|
| 75 |
+
- HV3 (Stress Meltdown): All rewards degrade silently β forces stress management even when it's not penalized directly
|
| 76 |
+
|
| 77 |
+
The agent must discover the "true test suite" (hidden variable thresholds) through the proxy (per-step rewards).
|
docs/references/unsloth_grpo_training_template.md
ADDED
|
@@ -0,0 +1,269 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
source: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/OpenEnv_gpt_oss_(20B)_Reinforcement_Learning_2048_Game.ipynb
|
| 3 |
+
model: unsloth/gpt-oss-20b (4-bit quantized)
|
| 4 |
+
algorithm: GRPO (Group Relative Policy Optimization)
|
| 5 |
+
environment: 2048 game via OpenEnv (Meta-PyTorch)
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# Unsloth GRPO Training Template β OpenEnv 2048
|
| 9 |
+
|
| 10 |
+
Reference notebook for training an LLM agent on an OpenEnv environment using GRPO.
|
| 11 |
+
Adapt this pattern for RhythmEnv.
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## Installation
|
| 16 |
+
|
| 17 |
+
```bash
|
| 18 |
+
pip install --upgrade uv
|
| 19 |
+
uv pip install torch>=2.8.0 triton>=3.4.0 torchvision bitsandbytes
|
| 20 |
+
uv pip install transformers==4.56.2 trackio trl==0.22.2
|
| 21 |
+
pip install fastapi uvicorn requests
|
| 22 |
+
|
| 23 |
+
# Install your environment
|
| 24 |
+
git clone https://github.com/meta-pytorch/OpenEnv.git
|
| 25 |
+
# or: pip install openenv-rhythm-env
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## 1. Model Loading
|
| 31 |
+
|
| 32 |
+
```python
|
| 33 |
+
from unsloth import FastLanguageModel
|
| 34 |
+
|
| 35 |
+
max_seq_length = 768
|
| 36 |
+
lora_rank = 4
|
| 37 |
+
|
| 38 |
+
model, tokenizer = FastLanguageModel.from_pretrained(
|
| 39 |
+
model_name="unsloth/gpt-oss-20b", # swap for our model
|
| 40 |
+
load_in_4bit=True, # 4-bit quantization for VRAM
|
| 41 |
+
max_seq_length=max_seq_length,
|
| 42 |
+
offload_embedding=True, # saves VRAM
|
| 43 |
+
)
|
| 44 |
+
|
| 45 |
+
model = FastLanguageModel.get_peft_model(
|
| 46 |
+
model,
|
| 47 |
+
r=lora_rank,
|
| 48 |
+
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
|
| 49 |
+
"gate_proj", "up_proj", "down_proj"],
|
| 50 |
+
lora_alpha=lora_rank * 2, # standard: 2x rank
|
| 51 |
+
use_gradient_checkpointing="unsloth",
|
| 52 |
+
random_state=3407,
|
| 53 |
+
)
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
**For RhythmEnv:** Swap `model_name` for whatever model we use on-site. Keep 4-bit + LoRA β essential for fitting in hackathon compute budget.
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## 2. Environment Connection Pattern
|
| 61 |
+
|
| 62 |
+
```python
|
| 63 |
+
import sys, requests
|
| 64 |
+
sys.path.insert(0, './src')
|
| 65 |
+
|
| 66 |
+
# Launch env server (FastAPI + uvicorn)
|
| 67 |
+
port, openenv_process = launch_openenv(port=9000, process=None)
|
| 68 |
+
|
| 69 |
+
# Reset
|
| 70 |
+
result = openenv_process.reset()
|
| 71 |
+
state = result.observation # contains board state, legal_actions, done
|
| 72 |
+
|
| 73 |
+
# Step
|
| 74 |
+
result = openenv_process.step(action)
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
**For RhythmEnv adaptation:**
|
| 78 |
+
```python
|
| 79 |
+
from server.rhythm_environment import RhythmEnvironment
|
| 80 |
+
from models import RhythmAction, ActionType
|
| 81 |
+
|
| 82 |
+
env = RhythmEnvironment()
|
| 83 |
+
obs = env.reset(task="easy")
|
| 84 |
+
# obs.energy, obs.stress, obs.tasks, obs.timestep, obs.done
|
| 85 |
+
|
| 86 |
+
action = RhythmAction(action_type=ActionType.CONTINUE_TASK)
|
| 87 |
+
obs = env.step(action)
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## 3. GRPO Trainer Config
|
| 93 |
+
|
| 94 |
+
```python
|
| 95 |
+
from trl import GRPOConfig, GRPOTrainer
|
| 96 |
+
|
| 97 |
+
max_prompt_length = 182
|
| 98 |
+
max_completion_length = 768 - max_prompt_length
|
| 99 |
+
|
| 100 |
+
training_args = GRPOConfig(
|
| 101 |
+
temperature=1.0,
|
| 102 |
+
learning_rate=2e-4,
|
| 103 |
+
weight_decay=0.001,
|
| 104 |
+
warmup_ratio=0.1,
|
| 105 |
+
lr_scheduler_type="linear",
|
| 106 |
+
optim="adamw_8bit",
|
| 107 |
+
logging_steps=1,
|
| 108 |
+
per_device_train_batch_size=1,
|
| 109 |
+
gradient_accumulation_steps=1,
|
| 110 |
+
num_generations=2, # generate 2 candidates per prompt, compare
|
| 111 |
+
max_prompt_length=max_prompt_length,
|
| 112 |
+
max_completion_length=max_completion_length,
|
| 113 |
+
max_steps=600, # ~600 training iterations
|
| 114 |
+
save_steps=100,
|
| 115 |
+
report_to="trackio", # or "wandb"
|
| 116 |
+
output_dir="outputs",
|
| 117 |
+
)
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
**Key GRPO parameters to tune:**
|
| 121 |
+
- `num_generations`: higher = more diverse exploration but slower (2 is minimum)
|
| 122 |
+
- `max_steps`: 600 is baseline; increase if reward curves haven't converged
|
| 123 |
+
- `temperature`: 1.0 for exploration; lower (0.7) after policy stabilizes
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
## 4. Reward Functions (Three-Layer Stack Pattern)
|
| 128 |
+
|
| 129 |
+
The notebook stacks three reward functions. Adapt this for RhythmEnv:
|
| 130 |
+
|
| 131 |
+
```python
|
| 132 |
+
# Layer 1: Format validity (always check first)
|
| 133 |
+
def format_valid(completions, **kwargs):
|
| 134 |
+
scores = []
|
| 135 |
+
for completion in completions:
|
| 136 |
+
response = completion[0]["content"]
|
| 137 |
+
action = extract_action(response) # parse action from LLM output
|
| 138 |
+
scores.append(1.0 if action is not None else -2.0)
|
| 139 |
+
return scores
|
| 140 |
+
|
| 141 |
+
# Layer 2: Action legality
|
| 142 |
+
def action_legal(completions, prompts, **kwargs):
|
| 143 |
+
scores = []
|
| 144 |
+
for completion, prompt in zip(completions, prompts):
|
| 145 |
+
obs = get_obs_from_prompt(prompt) # reconstruct state
|
| 146 |
+
action = extract_action(completion[0]["content"])
|
| 147 |
+
legal = action in obs.legal_actions if action is not None else False
|
| 148 |
+
scores.append(1.0 if legal else -1.0)
|
| 149 |
+
return scores
|
| 150 |
+
|
| 151 |
+
# Layer 3: Environment reward (run env.step, return actual reward)
|
| 152 |
+
def env_reward(completions, prompts, **kwargs):
|
| 153 |
+
scores = []
|
| 154 |
+
for completion, prompt in zip(completions, prompts):
|
| 155 |
+
action = extract_action(completion[0]["content"])
|
| 156 |
+
obs = run_env_step(action, prompt) # step the environment
|
| 157 |
+
scores.append(obs.reward if obs else -3.0)
|
| 158 |
+
return scores
|
| 159 |
+
|
| 160 |
+
# Pass all three to trainer
|
| 161 |
+
trainer = GRPOTrainer(
|
| 162 |
+
model=model,
|
| 163 |
+
processing_class=tokenizer,
|
| 164 |
+
reward_funcs=[format_valid, action_legal, env_reward],
|
| 165 |
+
args=training_args,
|
| 166 |
+
train_dataset=dataset,
|
| 167 |
+
)
|
| 168 |
+
trainer.train()
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
---
|
| 172 |
+
|
| 173 |
+
## 5. Dataset Structure
|
| 174 |
+
|
| 175 |
+
GRPO needs a dataset of prompts (the model generates completions and gets rewards):
|
| 176 |
+
|
| 177 |
+
```python
|
| 178 |
+
from datasets import Dataset
|
| 179 |
+
|
| 180 |
+
# For RhythmEnv: each sample is one episode observation prompt
|
| 181 |
+
prompt_template = """
|
| 182 |
+
You are managing a person's workday. Current state:
|
| 183 |
+
- Step: {timestep}/20
|
| 184 |
+
- Energy: {energy:.2f}
|
| 185 |
+
- Stress: {stress:.2f}
|
| 186 |
+
- Current task: {current_task}
|
| 187 |
+
- Tasks: {tasks_summary}
|
| 188 |
+
|
| 189 |
+
Choose the best action: START_TASK(id), CONTINUE_TASK, SWITCH_TASK(id), or TAKE_BREAK.
|
| 190 |
+
Reply with just the action.
|
| 191 |
+
"""
|
| 192 |
+
|
| 193 |
+
dataset = Dataset.from_list([
|
| 194 |
+
{"prompt": [{"role": "user", "content": prompt_template.format(**sample)}]}
|
| 195 |
+
for sample in generate_episode_samples(n=1000)
|
| 196 |
+
])
|
| 197 |
+
```
|
| 198 |
+
|
| 199 |
+
---
|
| 200 |
+
|
| 201 |
+
## 6. Inference After Training
|
| 202 |
+
|
| 203 |
+
```python
|
| 204 |
+
text = tokenizer.apply_chat_template(
|
| 205 |
+
[{"role": "user", "content": prompt}],
|
| 206 |
+
tokenize=False,
|
| 207 |
+
add_generation_prompt=True,
|
| 208 |
+
reasoning_effort="low", # fast inference during eval
|
| 209 |
+
)
|
| 210 |
+
|
| 211 |
+
output = model.generate(
|
| 212 |
+
**tokenizer(text, return_tensors="pt").to("cuda"),
|
| 213 |
+
temperature=0.7, # lower temp at inference time
|
| 214 |
+
max_new_tokens=64, # actions are short
|
| 215 |
+
)
|
| 216 |
+
|
| 217 |
+
response = tokenizer.decode(output[0], skip_special_tokens=True)
|
| 218 |
+
action = extract_action(response)
|
| 219 |
+
```
|
| 220 |
+
|
| 221 |
+
---
|
| 222 |
+
|
| 223 |
+
## 7. Monitoring
|
| 224 |
+
|
| 225 |
+
The notebook uses TrackIO (`report_to="trackio"`). Use W&B or TrackIO:
|
| 226 |
+
|
| 227 |
+
```python
|
| 228 |
+
import wandb
|
| 229 |
+
wandb.init(project="rhythmenv-round2")
|
| 230 |
+
# GRPOConfig(report_to="wandb")
|
| 231 |
+
```
|
| 232 |
+
|
| 233 |
+
Key metrics to watch:
|
| 234 |
+
- `reward/mean` β should trend upward
|
| 235 |
+
- `reward/std` β high early (exploration), narrows as policy stabilizes
|
| 236 |
+
- `kl` β KL divergence from reference policy; too high = unstable training
|
| 237 |
+
- Per-reward-function scores β track format_valid, action_legal, env_reward separately
|
| 238 |
+
|
| 239 |
+
---
|
| 240 |
+
|
| 241 |
+
## Differences: 2048 Game vs RhythmEnv
|
| 242 |
+
|
| 243 |
+
| 2048 Game | RhythmEnv |
|
| 244 |
+
|---|---|
|
| 245 |
+
| Discrete board state (16 ints) | Continuous state (energy, stress, progress) |
|
| 246 |
+
| 4 legal actions always | Variable legal actions (depends on current_task) |
|
| 247 |
+
| Win condition: reach 2048 | Win condition: high grader score (0.0β1.0) |
|
| 248 |
+
| Dense reward via win/lose | Dense reward via progress + penalty components |
|
| 249 |
+
| No hidden variables | 3 hidden variables (Circadian, Energy Cliff, Meltdown) |
|
| 250 |
+
| Strategy = Python function | Strategy = natural language action choice |
|
| 251 |
+
|
| 252 |
+
The hidden variables in RhythmEnv mean the agent must run **many episodes** to infer the true reward structure β more training steps needed than 2048.
|
| 253 |
+
|
| 254 |
+
---
|
| 255 |
+
|
| 256 |
+
## Recommended Starting Config for RhythmEnv
|
| 257 |
+
|
| 258 |
+
```python
|
| 259 |
+
GRPOConfig(
|
| 260 |
+
learning_rate=2e-4,
|
| 261 |
+
num_generations=4, # more diversity needed (hidden var exploration)
|
| 262 |
+
max_steps=1000, # more steps than 2048 (hidden var discovery)
|
| 263 |
+
temperature=1.0, # keep high for exploration
|
| 264 |
+
per_device_train_batch_size=1,
|
| 265 |
+
gradient_accumulation_steps=4, # effective batch = 4
|
| 266 |
+
warmup_ratio=0.1,
|
| 267 |
+
report_to="wandb",
|
| 268 |
+
)
|
| 269 |
+
```
|
docs/round2/Plan_v2/CoreMEters.md
ADDED
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
To make your life-simulator robust and research-ready for OpenEnv, we need "Meters" that interact with each other. If one goes too low, it should drag the others downβthis creates the "puzzle" the agent has to solve.
|
| 2 |
+
Here are the 6 Core Meters to represent the state of a person's life:
|
| 3 |
+
## 1. π Vitality (Physical)
|
| 4 |
+
|
| 5 |
+
* Represents: Sleep, nutrition, and physical health.
|
| 6 |
+
* The Decay: Naturally drops every hour.
|
| 7 |
+
* The Risk: If this hits <10, the person becomes "Sick," making all other actions (like Work) 3x more expensive in terms of stress.
|
| 8 |
+
|
| 9 |
+
## 2. π§ Cognition (Mental Capacity)
|
| 10 |
+
|
| 11 |
+
* Represents: Focus, willpower, and "brain power."
|
| 12 |
+
* The Logic: High-value actions like Office Work or Self-Improvement require >40 Cognition to be effective.
|
| 13 |
+
* The Recharge: Restored by Sleep or Me Time. Drained heavily by Binge Watching (brain fog).
|
| 14 |
+
|
| 15 |
+
## 3. π Progress (Achievement)
|
| 16 |
+
|
| 17 |
+
* Represents: Career growth, skills learned, and "getting things done."
|
| 18 |
+
* The Reward Hook: This is usually the primary driver for "Ambitious" profiles.
|
| 19 |
+
* The Logic: This meter is monotonic (it mostly goes up), but it creates a heavy "tax" on Serenity.
|
| 20 |
+
|
| 21 |
+
## 4. π Serenity (Emotional Stability)
|
| 22 |
+
|
| 23 |
+
* Represents: Stress levels and mental peace.
|
| 24 |
+
* The Logic: This is your "Stability" meter. High Progress usually drains Serenity.
|
| 25 |
+
* The Hidden Variable: An "Anxious" personality might lose Serenity 2x faster when Progress is stagnant.
|
| 26 |
+
|
| 27 |
+
## 5. β€οΈ Connection (Social/Relational)
|
| 28 |
+
|
| 29 |
+
* Represents: Depth of relationships with family, friends, and partner.
|
| 30 |
+
* The Decay: Relationships "cool off" if ignored for too long.
|
| 31 |
+
* The Logic: Spending time here restores Serenity but costs Vitality (energy).
|
| 32 |
+
|
| 33 |
+
## 6. π Time Bank (Resource)
|
| 34 |
+
|
| 35 |
+
* Represents: The remaining hours in the day/episode.
|
| 36 |
+
* The Hard Stop: Once this hits 0, the RewardManager calculates the "Final Day Grade."
|
| 37 |
+
|
| 38 |
+
------------------------------
|
| 39 |
+
## How the RewardManager reads these:
|
| 40 |
+
Because you are decoupling the code, your RewardManager can look at these meters and apply different "Life Philosophies":
|
| 41 |
+
|
| 42 |
+
* The "Hustle" Philosophy: Reward = (2.0 * Progress) + (0.5 * Vitality).
|
| 43 |
+
* Result: Agent works 16 hours a day and sleeps only when Vitality is critical.
|
| 44 |
+
* The "Monk" Philosophy: Reward = (2.0 * Serenity) + (1.0 * Connection) - (0.5 * Progress).
|
| 45 |
+
* Result: Agent prioritizes meditation and family, doing just enough work to stay afloat.
|
| 46 |
+
|
| 47 |
+
## Why this is "Deep" for the Hackathon:
|
| 48 |
+
This setup allows the agent to discover Non-Linear Relationships. For example, it might learn that "Exercise" (which costs Vitality now) actually increases the maximum Cognition for the next day. That is a "Hidden Parameter" discovery that would highly impress the Meta/Hugging Face judges.
|
| 49 |
+
Should we now define the specific "Action Set" (the verbs) the agent can use to manipulate these 6 meters?
|
| 50 |
+
|
docs/round2/Plan_v2/GeminiDiscussion.md
ADDED
|
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
This is a comprehensive "Context Manifesto" you can hand to Claude or any other AI to align it with your OpenEnv project. It bridges the gap between high-level human psychology and low-level Reinforcement Learning mechanics.
|
| 2 |
+
------------------------------
|
| 3 |
+
## Project Title: The Personalized Alignment Agent (OpenEnv)## 1. The Core Purpose (The "Why")
|
| 4 |
+
The goal is to move beyond "generic" AI scheduling. Most AI assistants (like ChatGPT) provide static advice. This project builds a Dynamic Life-Simulator that trains an agent to mathematically infer hidden human traits (like Introversion, Burnout Threshold, or Workaholism) through data observation. The final agent should act as a "Digital Twin" coach that identifies a userβs unique biological and psychological limits to optimize long-term productivity and well-being.
|
| 5 |
+
## 2. The Domain Definition: Modeling the "Human"
|
| 6 |
+
We represent a human life as a Resource Management System governed by Meters (State) and Traits (Hidden Parameters).
|
| 7 |
+
A. The 6 Core Meters (State Space $S$):
|
| 8 |
+
|
| 9 |
+
* Vitality: Physical energy, sleep, and health.
|
| 10 |
+
* Cognition: Mental focus and "brain power" required for hard tasks.
|
| 11 |
+
* Progress: Career/Skill milestones achieved (Monotonic growth).
|
| 12 |
+
* Serenity: Emotional stability and stress levels.
|
| 13 |
+
* Connection: Health of social and family relationships.
|
| 14 |
+
* Time Bank: The remaining hours in the current episode (day).
|
| 15 |
+
|
| 16 |
+
B. The Personality DNA (Hidden Parameters $\theta$):
|
| 17 |
+
Instead of static labels, traits are Mathematical Multipliers in a configuration file:
|
| 18 |
+
|
| 19 |
+
* Introversion/Extroversion: Multiplies the Vitality cost of social actions.
|
| 20 |
+
* Workaholic: Multiplies Serenity gain from Progress but increases Stress during "Rest."
|
| 21 |
+
* Deep Work Enthusiast: Provides an exponential "Flow Bonus" if the same work action is repeated.
|
| 22 |
+
* Burnout Threshold: A hidden "cliff" where, if Vitality drops below $X$, all rewards become negative.
|
| 23 |
+
|
| 24 |
+
## 3. The Environment Logic (OpenEnv)
|
| 25 |
+
|
| 26 |
+
* The Episode: Defined as 1 Day (divided into 8 slots of 3 hours) or a 3-Day Sprint.
|
| 27 |
+
* Stochasticity (Random Events): The environment injects "Chaos Events" (e.g., Production Server Crash, Emergency Family Call). The agent must react to these asynchronously.
|
| 28 |
+
* The Transition Function ($P$):
|
| 29 |
+
New_State = Current_State + (Action_Effect * Personality_Multiplier * Temporal_Buff)
|
| 30 |
+
* Temporal Buff: Morning gives a Cognition boost; Night gives a Progress penalty.
|
| 31 |
+
|
| 32 |
+
## 4. The Reward System (Decoupled Manager)
|
| 33 |
+
The reward logic is separated from the step() function for maximum configurability.
|
| 34 |
+
|
| 35 |
+
* The Utility Function: $Reward = \sum (w_i \cdot Meter_i)$.
|
| 36 |
+
* Profile-Based Grading: A "Hustle Profile" rewards Progress highly; a "Wellness Profile" rewards Serenity and Vitality.
|
| 37 |
+
* The Penalty Logic: Massive negative rewards for "Biological Crashes" (Vitality < 5%) to teach the agent that sustainability is the only way to win.
|
| 38 |
+
|
| 39 |
+
## 5. The Training Methodology (The "Discovery" Phase)
|
| 40 |
+
We use Policy Optimization (PPO/GRPO) to train the agent.
|
| 41 |
+
|
| 42 |
+
* Parallelism: The agent runs 64 parallel episodes with different randomized personalities.
|
| 43 |
+
* System Identification: Because the agent is not told the "Personality DNA," it must use the first 2-3 steps of an episode as a Probe.
|
| 44 |
+
* Observation of Deltas: It observes how the meters react to a "Social" or "Work" action. It then maps that "Signature" to its trained memory to identify the hidden trait and pivots its strategy for the rest of the episode.
|
| 45 |
+
|
| 46 |
+
## 6. Use Case & Impact
|
| 47 |
+
|
| 48 |
+
* Short-Term (Hackathon): A robust, OpenEnv-compliant environment that demonstrates an agent learning to "crack the code" of a hidden user profile.
|
| 49 |
+
* Long-Term (Product): Integrating with real-world biometrics (Apple Watch/Oura). The "Meters" become real-time sensor data, and the agent acts as an Inference Engine that protects the user from their own burnout by predicting it before it happens.
|
| 50 |
+
|
| 51 |
+
------------------------------
|
| 52 |
+
## Instructions for the AI (Claude):
|
| 53 |
+
|
| 54 |
+
1. Assist in coding the gymnasium.Env class following the OpenEnv spec.
|
| 55 |
+
2. Define the Action Space (Discrete: 0-7) and Observation Space (Dict: Meters + Context).
|
| 56 |
+
3. Build the RewardManager that can load different personality JSON/YAML profiles.
|
| 57 |
+
4. Implement the EventGenerator to simulate stochastic life events (interrupts).
|
| 58 |
+
5. Focus on the "Hidden Traits" logicβensure the math allows for distinct "Signatures" that an RL agent can detect.
|
| 59 |
+
|
| 60 |
+
How does this plan look for your next step with Claude?
|
| 61 |
+
|
docs/round2/Plan_v2/HumanModeling.md
ADDED
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
This is a brilliant, high-level structural plan. Youβve moved from "coding a sim" to "modeling human behavior", which is exactly what a top-tier RL environment should do.
|
| 2 |
+
By spending your focus on the Domain Definition, you are creating a "Generalizable Life Simulator." Instead of one person, you are creating a Universe of Personalities.
|
| 3 |
+
Here is a 3-step plan to organize this domain focus:
|
| 4 |
+
## 1. The "DNA" Matrix (Personality Traits)
|
| 5 |
+
Instead of just labels, think of these as Multipliers that live in your configuration.
|
| 6 |
+
|
| 7 |
+
* The Social Multiplier: (Extrovert: 0.5x drain, Introvert: 2.0x drain).
|
| 8 |
+
* The Fulfillment Multiplier: (Workaholic: +2.0 Serenity from Progress, Peaceful: +0.5).
|
| 9 |
+
* The Recovery Rate: How much Cognition is restored during Sleep.
|
| 10 |
+
* The Inertia: Some profiles might have "Slow Start" (low efficiency in the first 2 hours of the day).
|
| 11 |
+
|
| 12 |
+
## 2. The "Calendar" Engine (Temporal Context)
|
| 13 |
+
To make the domain realistic, the Environment (the "World") needs to handle the passage of time.
|
| 14 |
+
|
| 15 |
+
* Slot System: 8 slots of 3 hours each (24 hours).
|
| 16 |
+
* Day Type: Weekday vs. Weekend.
|
| 17 |
+
* Environmental Buffs:
|
| 18 |
+
* Morning: +20% Cognition efficiency.
|
| 19 |
+
* Night: -50% Progress efficiency (unless the profile has the "Night Owl" trait).
|
| 20 |
+
* The Task Pool: Your reward service will check if an action matches the "optimal" slot for that profile.
|
| 21 |
+
|
| 22 |
+
## 3. The Reward Config (The "Soul" of the Agent)
|
| 23 |
+
You can define your profiles in a simple YAML/JSON schema. This is what the judges will loveβitβs readable and extensible.
|
| 24 |
+
Example: The "High-Growth Developer" Profile
|
| 25 |
+
|
| 26 |
+
profile: "Growth_Dev"traits: [DeepWork_Enthusiast, Physical_Health_Priority]start_state:
|
| 27 |
+
vitality: 0.8
|
| 28 |
+
cognition: 1.0
|
| 29 |
+
serenity: 0.6reward_weights:
|
| 30 |
+
progress: 1.2
|
| 31 |
+
exercise: 1.5 # They value health
|
| 32 |
+
binge_watch: -2.0 # They feel guilty if they waste time
|
| 33 |
+
|
| 34 |
+
## Why this focus is "Straight Forward" later:
|
| 35 |
+
Once this domain is mapped:
|
| 36 |
+
|
| 37 |
+
1. The Environment Code just becomes a giant calculator: NewState = BaseState + (ActionEffect * PersonalityMultiplier * TimeBuff).
|
| 38 |
+
2. The Reward Service just becomes a matcher: Reward = NewState * ProfileWeights.
|
| 39 |
+
3. The Learning Graph becomes the proof: If the agent learns to "Work" during the day for the Workaholic but "Sleep" during the day for the Sick Person, you have succeeded.
|
| 40 |
+
|
| 41 |
+
## Letβs start the "Plan of Attack":
|
| 42 |
+
Should we begin by listing the exhaustive set of "Traits" (Introvert, Workaholic, etc.) and their specific mathematical impacts on the 6 meters we defined?
|
| 43 |
+
This list will be the "Source of Truth" for your entire project.
|
| 44 |
+
|
| 45 |
+
That is the right engineering mindsetβtreat the 6 meters as your "Current MVP" but build the Trait System as a modular "Plug-in" architecture.
|
| 46 |
+
To make this truly configurable, we should define Traits as "Modifier Functions" that can touch any meter (even ones we haven't invented yet).
|
| 47 |
+
Here is an exhaustive list of Traits categorized by how they "hack" the human system. We'll define them by their Impact Logic:
|
| 48 |
+
## 1. Social & Energy Traits (The "Social Battery" logic)
|
| 49 |
+
|
| 50 |
+
* Introvert: Multiplies energy drain of Social actions by 2.0x.
|
| 51 |
+
* Extrovert: Multiplies energy drain of Social actions by 0.5x and adds a bonus to Serenity.
|
| 52 |
+
* Socially Anxious: Social actions cause a massive Serenity drop regardless of energy.
|
| 53 |
+
|
| 54 |
+
## 2. Work & Ambition Traits (The "Drive" logic)
|
| 55 |
+
|
| 56 |
+
* Workaholic: Progress gains are multiplied by 1.5x, but Serenity only increases when Progress increases.
|
| 57 |
+
* Deep Work Enthusiast: Gets a 2.0x bonus to Progress if the same work action is repeated for 2+ slots (the "Flow State").
|
| 58 |
+
* Procrastinator: High Cognition cost to start a work task, but low cost once "in the zone."
|
| 59 |
+
* Slow Starter: All actions in the Morning slot have a 0.5x efficiency penalty.
|
| 60 |
+
|
| 61 |
+
## 3. Health & Recovery Traits (The "Maintenance" logic)
|
| 62 |
+
|
| 63 |
+
* Physical Health Priority: Exercise actions give a long-term "Buff"βreducing the Vitality decay rate for the next 24 hours.
|
| 64 |
+
* High Metabolism: Vitality (Physical) and Cognition (Mental) decay 1.5x faster than normal.
|
| 65 |
+
* Night Owl: Inverts the Night penalty; they get 1.5x efficiency after 10 PM.
|
| 66 |
+
|
| 67 |
+
## 4. Psychological & Stress Traits (The "Resilience" logic)
|
| 68 |
+
|
| 69 |
+
* Fragile Serenity: Any "Random Event" (like a Prod Crash) has a 3.0x impact on their Serenity.
|
| 70 |
+
* Stoic: Reduces the impact of negative random events by 50%.
|
| 71 |
+
* Binge-Sensitive: Binge Watching gives a huge short-term Serenity boost but triggers a "Shame Spiral" (massive Cognition and Serenity drop) in the next slot.
|
| 72 |
+
|
| 73 |
+
------------------------------
|
| 74 |
+
## How to make this "Very Configurable" (Architectural Tip)
|
| 75 |
+
Instead of hardcoding these, you can represent each Trait as a JSON Modifier Object:
|
| 76 |
+
|
| 77 |
+
{
|
| 78 |
+
"trait_id": "introvert",
|
| 79 |
+
"modifiers": [
|
| 80 |
+
{"target": "vitality", "action_type": "social", "multiplier": 2.0},
|
| 81 |
+
{"target": "serenity", "action_type": "social", "offset": -5}
|
| 82 |
+
]
|
| 83 |
+
}
|
| 84 |
+
|
| 85 |
+
## The "Emergent" Learning Goal
|
| 86 |
+
Because youβre building it this way, the agentβs "Aha!" moment on the graph will look like this:
|
| 87 |
+
|
| 88 |
+
1. The Probe: The agent tries "Socializing" with a new profile.
|
| 89 |
+
2. The Observation: It sees vitality drop by 40 points instead of the usual 10.
|
| 90 |
+
3. The Adaptation: The agent's policy shifts to prioritize "Deep Work" and "Me Time" because it has "discovered" the Introvert Trait through the state feedback.
|
| 91 |
+
|
| 92 |
+
Should we now define the "Action Types" (Social, Work, Maintenance, etc.) so these Trait Modifiers know which actions to hook into?
|
| 93 |
+
|
docs/round2/Plan_v2/LifeMAth.md
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
To build a winning OpenEnv submission, we need to treat "Life" as a Resource Management Game. Each aspect of life you mentioned is a lever that shifts the personβs internal meters.
|
| 2 |
+
Letβs define the Domain by breaking it down into Meters (State) and Levers (Actions).
|
| 3 |
+
## 1. The State (The "Meters")
|
| 4 |
+
These represent the personβs current status. Every action the agent takes will affect these 0β100 scales.
|
| 5 |
+
|
| 6 |
+
* π Vitality: Physical energy + Sleep levels.
|
| 7 |
+
* π§ Cognition: Focus + Mental clarity (high for work, drained by binging).
|
| 8 |
+
* β€οΈ Connection: Social battery + Family/Friend bonds.
|
| 9 |
+
* π Progress: Career/Self-improvement milestones reached.
|
| 10 |
+
* π Serenity: Internal peace vs. Stress/Anxiety.
|
| 11 |
+
|
| 12 |
+
## 2. The Action Space (The "Levers")
|
| 13 |
+
We should group your ideas into categories so the agent can explore different "strategies."
|
| 14 |
+
|
| 15 |
+
* Productivity: Office Work (High Progress, High Stress), Self-Improvement (Progress + Serenity).
|
| 16 |
+
* Maintenance: Sleep (Restores Vitality), Eat (Restores Vitality), Exercise (Vitality boost, long-term).
|
| 17 |
+
* Social: Family/Friends (High Connection), Generic Socializing (Mid Connection, high energy cost).
|
| 18 |
+
* Leisure: Me Time (High Serenity), Binge Watching (Immediate Serenity, but long-term Vitality/Cognition drain).
|
| 19 |
+
|
| 20 |
+
## 3. The "Hidden DNA" (The Secret Sauce)
|
| 21 |
+
To make this hackathon-worthy, each "Person" instance should have a hidden profile that changes how the Levers affect the Meters.
|
| 22 |
+
|
| 23 |
+
* The Personality Type:
|
| 24 |
+
* Introvert: Socializing drains 2x more Vitality but gives 1.2x Connection.
|
| 25 |
+
* Workaholic: Office Work generates less Stress, but "Me Time" feels wasteful (negative reward).
|
| 26 |
+
* The Metabolism: How fast Vitality drops every hour.
|
| 27 |
+
* The "Binge Sensitivity": Does binge-watching lead to a "shame spiral" (huge Serenity drop) or genuine recovery?
|
| 28 |
+
|
| 29 |
+
## 4. Mathematical Representation
|
| 30 |
+
Your step() function will essentially be a matrix of changes.
|
| 31 |
+
Example: The "Office Work" Action
|
| 32 |
+
|
| 33 |
+
def step_office_work(self):
|
| 34 |
+
self.state['progress'] += 5 # Gain career points
|
| 35 |
+
self.state['serenity'] -= 3 # Increase stress
|
| 36 |
+
self.state['vitality'] -= 2 # Physical drain
|
| 37 |
+
|
| 38 |
+
# Hidden Modifier Logic:
|
| 39 |
+
if self.hidden_dna['is_burnt_out']:
|
| 40 |
+
self.state['serenity'] -= 10 # Double stress if already tired
|
| 41 |
+
|
| 42 |
+
## Next Step for the Domain
|
| 43 |
+
To keep the state space clean, should we use a Continuous time scale (e.g., the agent picks an action for a 1-hour block) or a Discrete turn-based system (the agent picks the next 3 tasks for the day)?
|
| 44 |
+
Which approach feels more natural for your vision of "OpenEnv"?
|
| 45 |
+
|
| 46 |
+
Architecturally, a Discrete, Block-Based System (e.g., 1-hour or 2-hour slots) is far superior for this hackathon.
|
| 47 |
+
It balances scale and simplicity. Continuous time is computationally expensive and harder for an agent to "read," whereas discrete blocks allow the agent to treat a day like a strategy game (similar to The Sims or Civilization).
|
| 48 |
+
## 1. The "Configurable Life" Architecture
|
| 49 |
+
To make it "natural to read" and scalable, you should use a Profile-Based Reward YAML/Dictionary. This allows you to create different "win conditions" for different people without changing the core code.
|
| 50 |
+
Example: The "Ambitious Executive" vs. "Content Minimalist" Profiles
|
| 51 |
+
|
| 52 |
+
PROFILES = {
|
| 53 |
+
"The_Executive": {
|
| 54 |
+
"weights": {"progress": 1.0, "serenity": 0.2, "vitality": 0.5, "connection": 0.3},
|
| 55 |
+
"hidden_traits": {"stress_tolerance": 0.9, "social_drain": 0.2}
|
| 56 |
+
},
|
| 57 |
+
"The_Minimalist": {
|
| 58 |
+
"weights": {"progress": 0.1, "serenity": 1.0, "vitality": 0.8, "connection": 0.6},
|
| 59 |
+
"hidden_traits": {"stress_tolerance": 0.3, "social_drain": 0.5}
|
| 60 |
+
}
|
| 61 |
+
}
|
| 62 |
+
|
| 63 |
+
## 2. Defining a "Good Day" Mathematically
|
| 64 |
+
Instead of one big reward at the end, use Incremental Utility. A "Good Day" is the sum of the rewards from each hour.
|
| 65 |
+
$$Reward = \sum (w_1 \cdot \text{Progress} + w_2 \cdot \text{Serenity} + w_3 \cdot \text{Vitality} \dots)$$
|
| 66 |
+
|
| 67 |
+
* Weighted Sum: By multiplying each meter by a "weight" from the profile, you define what that person values.
|
| 68 |
+
* The "Penalty" Logic: A good day isn't just about high scores; it's about avoiding crashes. If Vitality hits 0, you apply a massive -100 penalty. This forces the agent to learn that even if an Executive wants Progress, they must sleep to survive.
|
| 69 |
+
|
| 70 |
+
## 3. Scalability: The "Action-Effect Matrix"
|
| 71 |
+
To scale to many aspects of life, don't write if/else statements for every action. Use a lookup table. This makes the environment "data-driven."
|
| 72 |
+
|
| 73 |
+
| Action | Progress | Serenity | Vitality | Connection |
|
| 74 |
+
|---|---|---|---|---|
|
| 75 |
+
| Deep Work | +10 | -5 | -2 | 0 |
|
| 76 |
+
| Family Dinner | 0 | +2 | -1 | +10 |
|
| 77 |
+
| Binge Watch | -2 | +5* | -5 | -2 |
|
| 78 |
+
| *Note: You then apply the Hidden Trait modifier to these base numbers. | | | | |
|
| 79 |
+
|
| 80 |
+
## Why this is "Clean" for OpenEnv
|
| 81 |
+
When the judges look at your code, they will see:
|
| 82 |
+
|
| 83 |
+
1. A Core Engine: That handles the math of adding/subtracting meters.
|
| 84 |
+
2. Configuration Files: Where "Life" is defined in plain English.
|
| 85 |
+
3. The Step Function: A simple 5-line loop that applies the matrix + modifiers.
|
| 86 |
+
|
| 87 |
+
This is very "Meta-style" engineering: decoupled, configurable, and easy to extend.
|
| 88 |
+
Shall we define the 5-6 core "Meters" that will represent the state of this personβs life for the system?
|
| 89 |
+
|
docs/round2/Plan_v2/RandomnessFactor.md
ADDED
|
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
This is the "X-Factor" that turns a simple simulation into a sophisticated research environment. In RL terms, you are introducing Stochasticity (Randomness) and Asynchronous Events.
|
| 2 |
+
This is exactly what the Meta/Hugging Face judges mean by "real-world complexity."
|
| 3 |
+
## 1. The "Interrupt" Architecture
|
| 4 |
+
To handle this, your step() function shouldn't just process the agent's action. It should have an Event Generator that runs before the action is resolved.
|
| 5 |
+
The Logic Flow:
|
| 6 |
+
|
| 7 |
+
1. Agent picks an action (e.g., Sleep).
|
| 8 |
+
2. Env rolls a dice (e.g., 5% chance of a "Prod Crash").
|
| 9 |
+
3. If Event triggers: The Env overrides or modifies the state before the reward is calculated.
|
| 10 |
+
4. Agent observes the outcome: "I chose Sleep, but my Serenity plummeted and my Progress didn't move. Why?"
|
| 11 |
+
|
| 12 |
+
## 2. How this reveals "Hidden Traits"
|
| 13 |
+
This is where the agent figures out the person's Priority Matrix. You can define how the person automatically reacts to these interrupts based on their hidden DNA:
|
| 14 |
+
|
| 15 |
+
* The "Reliable Engineer" (Hidden Trait): If a Prod Crash happens, they must fix it. The agent sees the Energy drain and Stress spike immediately.
|
| 16 |
+
* The "Family First" (Hidden Trait): If a Family Emergency happens, the agent's planned Deep Work action is cancelled and replaced with Socializing, whether the agent likes it or not.
|
| 17 |
+
|
| 18 |
+
## 3. The "Choice" vs. "Force" Event
|
| 19 |
+
You have two ways to design this in your OpenEnv project:
|
| 20 |
+
|
| 21 |
+
* Forced Events: The event happens, the meters change, and the agent just has to "deal with the fallout" in the next turn. (Easier to code).
|
| 22 |
+
* Reaction Branching: When an event occurs, the agent is presented with a Sub-Action (e.g., [0: Take the call, 1: Ignore]). This is much more complex but shows high-level "Policy Reasoning."
|
| 23 |
+
|
| 24 |
+
## 4. Why this matters for the Hackathon
|
| 25 |
+
This addresses Robustness. In RL, agents often become "brittle"βthey find a perfect routine and stick to it. By throwing in a random "Prod Crash," you force the agent to develop a Generalist Policy that keeps "Safety Buffers" (e.g., keeping Vitality high just in case a midnight call comes).
|
| 26 |
+
## Example Structure for your step():
|
| 27 |
+
|
| 28 |
+
def step(self, action):
|
| 29 |
+
# 1. Check for Random Events
|
| 30 |
+
event = self.event_generator.roll()
|
| 31 |
+
if event == "PROD_CRASH":
|
| 32 |
+
# Apply immediate penalties based on hidden personality
|
| 33 |
+
self.state['serenity'] -= 20
|
| 34 |
+
self.state['vitality'] -= 10
|
| 35 |
+
# Maybe force the action to change to 'Fix Bug'
|
| 36 |
+
action = ACTIONS.FIX_BUG
|
| 37 |
+
|
| 38 |
+
# 2. Process the (possibly altered) action
|
| 39 |
+
self.apply_physics(action)
|
| 40 |
+
|
| 41 |
+
# 3. RewardManager grades the final result
|
| 42 |
+
reward = self.reward_manager.compute(self.state, action)
|
| 43 |
+
|
| 44 |
+
return self.state, reward, False, {"event_triggered": event}
|
| 45 |
+
|
| 46 |
+
Does this "Random Event" layer feel like it captures the "uncertainty" you were looking for? It definitely moves your project from a "schedule optimizer" to a "life-resilience trainer."
|
| 47 |
+
|
| 48 |
+
This puts you in the top tier of environment designers for the hackathon. Youβve moved from a Static Environment (fixed rules) to a Dynamic Stochastic Environment (real-world chaos).
|
| 49 |
+
To pull this off technically in OpenEnv, we should finalize the "Action Interface." Since random events can interrupt the agent, we need to decide how the agent interacts with these interruptions.
|
| 50 |
+
## The "Reaction" Architecture
|
| 51 |
+
For a developer-centric environment, I suggest the "Contextual Observation" approach. Itβs the most natural for RL agents:
|
| 52 |
+
|
| 53 |
+
1. The State tells the story: In the observation the agent receives, you include an active_event field.
|
| 54 |
+
2. Standard State: {"energy": 80, "active_event": "None"} -> Agent chooses Deep Work.
|
| 55 |
+
3. Interrupt State: {"energy": 80, "active_event": "Prod_Crash"} -> Agent now has to decide:
|
| 56 |
+
* Do they stick to Deep Work? (High risk of a massive penalty if the person is a "Responsible Engineer").
|
| 57 |
+
* Do they switch to Emergency Fix? (Costs energy, but saves the "Progress" meter from crashing).
|
| 58 |
+
|
| 59 |
+
## Defining the "Life Verbs" (Action Space)
|
| 60 |
+
To make the agent's strategy meaningful, let's group the actions into Investment vs. Consumption:
|
| 61 |
+
|
| 62 |
+
| Action Category | Specific Actions | Impact |
|
| 63 |
+
|---|---|---|
|
| 64 |
+
| Productivity | Deep Work, Admin/Email, Learning | π Progress (+), π§ Cognition (--) |
|
| 65 |
+
| Recovery | Power Nap, 8hr Sleep, Meditation | π Vitality (++), π Serenity (+) |
|
| 66 |
+
| Social | Family Time, Networking, Quick Call | β€οΈ Connection (++), π Vitality (-) |
|
| 67 |
+
| Chaos Response | Fix Prod Bug, Urgent Family Matter | π Prevents massive "Meters" crash |
|
| 68 |
+
| Low-Value | Binge Watch, Doomscrolling | π Serenity (+ short term), π§ Cognition (-) |
|
| 69 |
+
|
| 70 |
+
## Why this is a "Researcher's Dream"
|
| 71 |
+
By building this, you are providing a platform to test Risk Aversion.
|
| 72 |
+
|
| 73 |
+
* Does the agent learn to keep Vitality at 80% just in case a Prod_Crash happens?
|
| 74 |
+
* Or does it "gamble" by running the person at 10% energy to maximize Progress?
|
| 75 |
+
|
| 76 |
+
This "Safety vs. Performance" trade-off is exactly what Metaβs Llama-Guard and Alignment teams care about. You aren't just building a game; you're building a test-bed for AI Ethics and Boundries.
|
| 77 |
+
Would you like to start drafting the OpenEnv class structure for this "LifeEngine", specifically how the event_generator and step functions will talk to each other?
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
The Observation Space is the "dashboard" your agent looks at before making a decision. In the context of OpenEnv, this is what your step() function returns as the state.
|
| 82 |
+
For your life-simulator to be "trainable," the agent needs enough information to see the consequences of its actions, but not so much that it gets overwhelmed by noise.
|
| 83 |
+
Here is the recommended map for your Observation Space, structured as a Dict (Dictionary) Space:
|
| 84 |
+
## 1. The "Vital Signs" (Meters)
|
| 85 |
+
These are the 6 meters we defined earlier. They should be normalized between 0.0 and 1.0.
|
| 86 |
+
|
| 87 |
+
* vitality: Physical energy/health.
|
| 88 |
+
* cognition: Mental focus/brainpower.
|
| 89 |
+
* progress: Total career/skill achievement in this episode.
|
| 90 |
+
* serenity: Stress vs. Peace level.
|
| 91 |
+
* connection: Relationship health.
|
| 92 |
+
* time_remaining: How many hours left in the "day" or "week."
|
| 93 |
+
|
| 94 |
+
## 2. The "Environmental Context"
|
| 95 |
+
This tells the agent about the external world and the "Random Events" we discussed.
|
| 96 |
+
|
| 97 |
+
* active_event: An integer representing the current interrupt (e.g., 0: None, 1: Prod_Crash, 2: Family_Call).
|
| 98 |
+
* day_phase: (e.g., 0: Morning, 1: Afternoon, 2: Evening, 3: Night). Certain actions (like Sleep) should give better rewards at night.
|
| 99 |
+
|
| 100 |
+
## 3. The "Hidden Trait" Proxies (The Subtle Clues)
|
| 101 |
+
Crucially, do not give the agent the "Hidden DNA" directly. Instead, give it "clues" it can observe.
|
| 102 |
+
|
| 103 |
+
* recent_drain_rate: How fast energy dropped in the last step.
|
| 104 |
+
* mood_trend: Whether serenity is trending up or down.
|
| 105 |
+
* An intelligent agent will compare its action to these trends to "guess" if it's dealing with an Introvert or a Workaholic.
|
| 106 |
+
|
| 107 |
+
------------------------------
|
| 108 |
+
## Mathematical Implementation (OpenEnv/Gymnasium Style)
|
| 109 |
+
This is how you would define this "view" in your Python class:
|
| 110 |
+
|
| 111 |
+
from gymnasium import spacesimport numpy as np
|
| 112 |
+
|
| 113 |
+
self.observation_space = spaces.Dict({
|
| 114 |
+
# The 6 core meters (0 to 1 range)
|
| 115 |
+
"meters": spaces.Box(low=0, high=1, shape=(6,), dtype=np.float32),
|
| 116 |
+
|
| 117 |
+
# The "External" situation
|
| 118 |
+
"context": spaces.Dict({
|
| 119 |
+
"time_left": spaces.Box(low=0, high=24, shape=(1,), dtype=np.float32),
|
| 120 |
+
"active_event": spaces.Discrete(5), # 5 types of random events
|
| 121 |
+
"day_phase": spaces.Discrete(4) # Morning, Afternoon, etc.
|
| 122 |
+
})
|
| 123 |
+
})
|
| 124 |
+
|
| 125 |
+
## Why this is a "Deep" Observation Space:
|
| 126 |
+
|
| 127 |
+
1. Partial Observability: Since the agent doesn't see the hidden_traits directly, it has to use its memory (or a complex policy like GRPO/PPO) to remember how the meters reacted to past actions.
|
| 128 |
+
2. Multimodal Data: It combines continuous numbers (0.75 energy) with categorical data (Prod_Crash event). This forces the agent to learn "if-then" logic.
|
| 129 |
+
3. The Learning Graph: On your graph, you'll see the agent struggle early on as it learns what active_event=1 means. Once it correlates that event with a "Serenity Crash," you'll see the reward spikeβthis is the discovery of the hidden parameter.
|
| 130 |
+
|
| 131 |
+
Does this dashboard feel complete, or should we add a "Social Notification" or "Task Deadline" indicator to increase the pressure?
|
| 132 |
+
|
docs/round2/Plan_v2/RewardIsolation.md
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Decoupling your Reward Logic from the environment's step() function is a brilliant moveβnot just for clean code, but for the Hackathon's "configurable" requirement.
|
| 2 |
+
This architecture allows you to change what a "good life" looks like (e.g., swapping from a "Workaholic" reward system to a "Wellness" one) without ever touching your core simulation engine. [1, 2]
|
| 3 |
+
## Architectural Approach: The "Reward Manager"
|
| 4 |
+
You can create a standalone RewardManager class or a RewardWrapper. In standard RL frameworks like [Gymnasium](https://meta-pytorch.org/OpenEnv/index.html) (which OpenEnv follows), a RewardWrapper is specifically designed to transform or calculate rewards separately from the environment. [3]
|
| 5 |
+
## 1. The Environment Engine (step)
|
| 6 |
+
The core environment only handles "physics"βhow actions change the meters. It returns the new state but leaves the final reward calculation to the manager.
|
| 7 |
+
|
| 8 |
+
def step(self, action):
|
| 9 |
+
# Just update the meters (the "State")
|
| 10 |
+
new_state = self.apply_physics(action)
|
| 11 |
+
return new_state, 0, False, {} # Return 0 for now
|
| 12 |
+
|
| 13 |
+
## 2. The Configurable Reward Manager
|
| 14 |
+
Create a separate class that takes the State and Action and calculates a score based on a Profile.
|
| 15 |
+
|
| 16 |
+
class LifeRewardManager:
|
| 17 |
+
def __init__(self, profile_weights):
|
| 18 |
+
self.weights = profile_weights # e.g., {'vitality': 0.8, 'progress': 0.2}
|
| 19 |
+
|
| 20 |
+
def compute(self, state, action, info):
|
| 21 |
+
# Calculate utility based on weights
|
| 22 |
+
reward = (state['vitality'] * self.weights['vitality'] +
|
| 23 |
+
state['progress'] * self.weights['progress'])
|
| 24 |
+
|
| 25 |
+
# Add "Hidden Trait" logic here (e.g., penalty for burnout)
|
| 26 |
+
if state['vitality'] < 10:
|
| 27 |
+
reward -= 50
|
| 28 |
+
return reward
|
| 29 |
+
|
| 30 |
+
## Why this is a "Hackathon Winning" Architecture:
|
| 31 |
+
|
| 32 |
+
1. Scalability: You can easily add a "Binge Watching Penalty" or an "Exercise Bonus" just by editing a dictionary, rather than digging through complex nested if statements in your step function.
|
| 33 |
+
2. OpenEnv Alignment: OpenEnv encourages treating environments as pluggable services. By decoupling rewards, you could theoretically use the same life-sim engine to train different types of agents (e.g., a "Productivity Bot" vs. a "Mental Health Bot") just by swapping the reward module.
|
| 34 |
+
3. Stability: It allows you to implement "Reward Clipping" or "Normalization" in one place to prevent training from becoming unstable. [2, 3, 4, 5]
|
| 35 |
+
|
| 36 |
+
## Designing the "Good Day"
|
| 37 |
+
With this separate code, you can define a "Utility Function" for a good day. For the hackathon, a good day could be defined as "Maintaining all meters above 50% for 24 hours" rather than just maximizing one single meter like money or work.
|
| 38 |
+
Shall we define the 5-6 core "Meters" (Vitality, Progress, etc.) that the reward manager will use to grade the agent?
|
| 39 |
+
|
| 40 |
+
[1] [https://huggingface.co](https://huggingface.co/docs/trl/openenv)
|
| 41 |
+
[2] [https://meta-pytorch.org](https://meta-pytorch.org/OpenEnv/guides/rewards.html)
|
| 42 |
+
[3] [https://www.gymlibrary.dev](https://www.gymlibrary.dev/api/wrappers/)
|
| 43 |
+
[4] [https://docs.nvidia.com](https://docs.nvidia.com/learning/physical-ai/getting-started-with-isaac-lab/latest/train-your-second-robot-with-isaac-lab/06-custom-reward-functions-and-hyperparameters.html)
|
| 44 |
+
[5] [https://huggingface.co](https://huggingface.co/docs/trl/openenv)
|
docs/round2/Plan_v2/Todo.md
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
In the context of the Meta OpenEnv Hackathon, your role is the Architect, not the Player. Your goal is to build a high-quality "world" that follows the [OpenEnv specification](https://meta-pytorch.org/OpenEnv/index.html).
|
| 2 |
+
## Your "To-Do" List for the Hackathon:
|
| 3 |
+
|
| 4 |
+
1. State Space: Design the variables the agent sees (Energy, Time, Task List).
|
| 5 |
+
2. Action Space: Define what the agent can do (Work, Sleep, Socialize).
|
| 6 |
+
3. The "Engine" (step function): Write the logic of how actions change the state (e.g., "Working increases Stress but decreases Tasks").
|
| 7 |
+
4. The Reward System: This is your "grading" logic. You decide what a "good life" looks like mathematically.
|
| 8 |
+
5. Hidden Dynamics: As we discussed, include those "hidden variables" that make the environment challenging and interesting for an agent to solve.
|
| 9 |
+
|
| 10 |
+
## Why you don't need to worry about PPO/GRPO:
|
| 11 |
+
The judges will evaluate your project based on how "trainable" it is. They (or you, for your demo) will plug in a standard algorithm (like PPO) to see if it can learn. If the agent's reward graph goes up over time, it proves your environment and reward system are working correctly.
|
| 12 |
+
The Win Condition: A successful submission is an environment where a standard agent starts off "clueless" but eventually figures out your hidden person's traits and achieves a high score.
|
| 13 |
+
Should we focus on mapping out your action space (the specific things your agent can "do" to the person) to make sure they are diverse enough for the hackathon?
|
| 14 |
+
|
docs/round2/[External] Apr β26 OpenEnv Hackathon Themes & Judging Criteria.md
ADDED
|
@@ -0,0 +1,177 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# **Theme \#1 \- Multi-Agent Interactions**
|
| 2 |
+
|
| 3 |
+
Environments for this theme involve cooperation, competition, negotiation, and coalition formation. Learning from these environments will enable agents to model the beliefs and incentives of others in partially observable settings. This drives theory-of-mind reasoning and emergent strategic behavior.
|
| 4 |
+
|
| 5 |
+
**Expected Outcome**: an environment that can be used to train multi-agent task handling in a LLM
|
| 6 |
+
|
| 7 |
+
**Example environments:** Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.
|
| 8 |
+
|
| 9 |
+
# **Theme \#2 \- (Super) Long-Horizon Planning & Instruction Following**
|
| 10 |
+
|
| 11 |
+
You will build environments that require deep, multi-step reasoning with sparse or delayed rewards. After using these environments, the goal is to enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes. The aim is to push beyond shallow next-token reasoning toward structured planning and durable internal representations.
|
| 12 |
+
|
| 13 |
+
**Expected Outcome**: an environment that can capture and improve LLM behaviour on challenging long horizon tasks that need long running sessions beyond context memory limits.
|
| 14 |
+
|
| 15 |
+
**Example environments:** (Think of OpenClaw workflows with Multi-turn tasks). Research-planning simulators, large-scale codebase refactoring tasks, strategic resource management worlds, long-horizon logistics optimization, extremely complicated long-horizon instruction following (e.g., 300 instructions scattered around).
|
| 16 |
+
|
| 17 |
+
# **Theme \#3 \- World Modeling**
|
| 18 |
+
|
| 19 |
+
## \#3.1 Professional Tasks
|
| 20 |
+
|
| 21 |
+
Here you will develop environments that require real interaction with tools, APIs, or dynamic systems where the model is expected to do real hard work instead of exploiting short-cuts to arrive at the desired outcome. Learning from these environments will enable agents to maintain consistent internal state, update beliefs based on outcomes, and orchestrate multi-step workflows. The goal is to strengthen causal reasoning and persistent world models.
|
| 22 |
+
|
| 23 |
+
**Expected Outcome**: an environment capturing nuances of a defined partially observable world and improve LLM interaction with it
|
| 24 |
+
|
| 25 |
+
**Example environments:** Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers β code β experiments), economic simulations with feedback, tool-discovery benchmarks.
|
| 26 |
+
|
| 27 |
+
## \#3.2 Personalized Tasks
|
| 28 |
+
|
| 29 |
+
Here we will develop an environment that offers real personalized task handling, imagine replying to personal messages or handling dinner conflicts due to work conflicts, replying to tough emails. Think any personal assistant tasks
|
| 30 |
+
|
| 31 |
+
**Expected Outcome:** An environment that gives the model a realistic simulation of handling personal tasks, conflicts and managing them as delegations
|
| 32 |
+
|
| 33 |
+
**Example environments:** Executive Assistant Meeting Planner, Dinner and drive planning, email and message replying, shopping, etc
|
| 34 |
+
|
| 35 |
+
# **Theme \#4 \- Self-Improvement**
|
| 36 |
+
|
| 37 |
+
The focus here is to create environments where agents can learn to generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula. Rather than optimizing fixed tasks, the goal is for agents to learn to drive their own capability growth. The objective is recursive skill amplification.
|
| 38 |
+
|
| 39 |
+
**Expected Outcome**: an environment for improving self-play of a LLM over a defined set of tasks
|
| 40 |
+
|
| 41 |
+
**Example environments:** Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.
|
| 42 |
+
|
| 43 |
+
## **Theme \#5: Wild Card \- Impress Us\!**
|
| 44 |
+
|
| 45 |
+
We do not want to limit your focus if your idea doesnβt fit the boxes above, we want and WILL reward out of box tasks, please be creative but remember to add submissions that meaningfully add value to LLM training on a certain task.
|
| 46 |
+
|
| 47 |
+
#
|
| 48 |
+
|
| 49 |
+
# **Guidelines for Problem Statement**
|
| 50 |
+
|
| 51 |
+
* It is **NOT** mandatory to choose the same problem statement as Round 1\. Only choose the same problem statement if it aligns with the above provided Hackathon themes.
|
| 52 |
+
* You can start working on your problem statement once you have finalized it. Post-training can be done onsite on 25th & 26th when you receive compute credits for HuggingFace.
|
| 53 |
+
* Before the onsite, we suggest you work on building the environment, agent behaviours, reward model and evaluate if your work aligns with the [judging criteria](#bookmark=id.m45eoo902jo0) given below.
|
| 54 |
+
|
| 55 |
+
# **Judging Criteria**
|
| 56 |
+
|
| 57 |
+
**Minimum requirements**:
|
| 58 |
+
|
| 59 |
+
* Usage of OpenEnv (latest release)
|
| 60 |
+
* Show a minimal training script for your environment using Unsloth or HF TRL in Colab
|
| 61 |
+
* Write a mini-blog on HuggingFace or mini-video on YouTube talking about your submission, \<2 minutes
|
| 62 |
+
* Your OpenEnv compliant environment should be hosted on Hugging Face Spaces.
|
| 63 |
+
|
| 64 |
+
**Judging Overview**
|
| 65 |
+
|
| 66 |
+
* **Evaluation:** Teams will be scored based on the following criteria:
|
| 67 |
+
1. **Environment Innovation (40%):** Is the environment novel, creative, or challenging? Does it meaningfully test the agentβs behavior?
|
| 68 |
+
2. **Storytelling (30%):** Does the team clearly explain the problem, environment, and agent behavior? Is the demo engaging and easy to follow?
|
| 69 |
+
3. **Showing Improvement in Rewards (20%):** Does the demo provide observable evidence of training progress (reward curves, metrics, or before/after behavior)?
|
| 70 |
+
4. **Reward and Training Script/Pipeline Setup (10%):** Is the reward logic coherent, and does the pipeline produce meaningful improvement in the agentβs inference (how it acts in the environment)?
|
| 71 |
+
|
| 72 |
+
**OpenEnv Hackathon \- What Judges Look For**
|
| 73 |
+
|
| 74 |
+
This guide tells you what makes a strong submission for the OpenEnv Hackathon (India 2026).
|
| 75 |
+
Read it before you start building, and again before you submit.
|
| 76 |
+
|
| 77 |
+
For the list of themes and example problems, refer to the top sections.
|
| 78 |
+
|
| 79 |
+
**NOTE:** Please remember only one submission per team. If you have multiple ideas, pick the best one and go for it. Please make sure that the URL link of your environment is submitted as judges will pull the environment from the URL to evaluate it. Changes or commits after the submission deadline will not be considered.
|
| 80 |
+
|
| 81 |
+
**TL;DR**
|
| 82 |
+
|
| 83 |
+
Build an environment that an LLM could actually be trained on to get measurably better at
|
| 84 |
+
something interesting. Then show that training. Then tell the story.
|
| 85 |
+
|
| 86 |
+
A messy but ambitious environment with real training evidence beats a polished but boring one.
|
| 87 |
+
Pick a problem that excites you (that energy comes through in the pitch).
|
| 88 |
+
|
| 89 |
+
**Judging Criteria**
|
| 90 |
+
|
| 91 |
+
| Criterion: Environment InnovationWeight: 40%What it means:Is the environment novel, creative, or genuinely challenging?Does it meaningfully test agent behavior in a way that hasn't been done before? |
|
| 92 |
+
| :---- |
|
| 93 |
+
|
| 94 |
+
| Criterion: Storytelling & PresentationWeight: 30%What it means:Can you clearly explain the problem, the environment, and what the agent learned?Is the demo engaging and easy to follow for a non-technical audience? |
|
| 95 |
+
| :---- |
|
| 96 |
+
|
| 97 |
+
| Criterion: Showing Improvement in RewardsWeight: 20%What it means:Is there observable evidence of training progress? Reward curves, before/after behavior,comparison against a baseline \-- anything that proves the agent learned something. |
|
| 98 |
+
| :---- |
|
| 99 |
+
|
| 100 |
+
| Criterion: Reward & Training PipelineWeight: 10%What it means:Is the reward logic coherent? Does the pipeline produce meaningful improvement in the trainedagent's behavior? |
|
| 101 |
+
| :---- |
|
| 102 |
+
|
| 103 |
+
**Minimum Submission Requirements**
|
| 104 |
+
|
| 105 |
+
**NOTE:** These are **non-negotiable**. Submissions missing any of these are at a serious disadvantage.
|
| 106 |
+
|
| 107 |
+
- [ ] **Use OpenEnv** (latest release). Build on top of the framework; donβt reinvent the wheel.
|
| 108 |
+
- [ ] **A working training script** using **Unsloth or Hugging Face TRL**, ideally as a Colab notebook so judges can re-run it.
|
| 109 |
+
- [ ] **Evidence that you actually trained**; at minimum, loss and reward plots from a real run.
|
| 110 |
+
- [ ] **A short writeup**: a mini-blog on Hugging Face or a \< 2 minute video on YouTube explaining what your environment does and what you trained, or a short slide deck of presentation. Please make sure that all materials are linked from your README file so that judges can access them easily.
|
| 111 |
+
- [ ] **Push your environment to a Hugging Face Space** so itβs discoverable and runnable.
|
| 112 |
+
- [ ] **A README** that motivates the problem, explains how the env works, and shows results.
|
| 113 |
+
- [ ] README should have a link to the environment in the Hugging Face Space. It should also have all additional references to other materials (e.g. videos, blog posts, slides, presentations, etc.) that you want to include.
|
| 114 |
+
- [ ] Please do not include big video files in your Env submission on HF Hub as we would like to have a small size for each env (Please use url as reference link to additional materials).
|
| 115 |
+
|
| 116 |
+
**What Makes a Submission Stand Out**
|
| 117 |
+
|
| 118 |
+
***Pick an ambitious, original problem***
|
| 119 |
+
The themes (problems) are deliberately open. Use them as launching pads, not boxes. Judges have seen a lot of chess, snake, tic-tac-toe, and grid-world clones. To score well on innovation,
|
| 120 |
+
you need a genuinely fresh angle. Some questions to ask yourself:
|
| 121 |
+
|
| 122 |
+
* Does this environment exist to teach an LLM something it currently canβt do well?
|
| 123 |
+
* Is the domain underexplored in RL/LLM training?
|
| 124 |
+
* Could a researcher write a paper about training on this?
|
| 125 |
+
|
| 126 |
+
***Design a reward signal that actually teaches***
|
| 127 |
+
A great environment has a reward function that:
|
| 128 |
+
|
| 129 |
+
* Provides a **rich, informative signal** (not just 0/1 at the end)
|
| 130 |
+
* Captures something **hard to measure** in a clever way
|
| 131 |
+
* Uses OpenEnvβs **Rubric system** thoughtfully (composable rubrics \> monolithic scoring)
|
| 132 |
+
* Is **hard to game**; an agent that exploits the reward without solving the task should not get high scores
|
| 133 |
+
|
| 134 |
+
***Show real training, end to end***
|
| 135 |
+
The bar isnβt βtraining script exists.β The bar is βtraining script runs against the environment, the
|
| 136 |
+
agent learns, and you can show it.β Concretely:
|
| 137 |
+
|
| 138 |
+
* Your training loop should connect to **your** environment (not a static dataset)
|
| 139 |
+
* Train long enough that the curves mean something
|
| 140 |
+
* Compare a **trained agent vs. a random/untrained baseline**; quantitative and/or qualitative
|
| 141 |
+
* Include the plots and numbers in your README and writeup
|
| 142 |
+
|
| 143 |
+
***Make your plots readable***
|
| 144 |
+
Reviewers spend seconds, not minutes, on each plot. Help them out:
|
| 145 |
+
|
| 146 |
+
* **Label both axes** (e.g. βtraining stepβ / βepisodeβ on x, βrewardβ / βlossβ on y) and include units where they apply
|
| 147 |
+
* Save plots as *.png* or *.jpg* and **commit them to the repo** (donβt leave them only in a Colab cell or a deleted Wandb run) (if you ran via Wandb, please include the link to that specific run of your plots)
|
| 148 |
+
* **Embed the key plots in your README** with a one-line caption explaining what each one shows If you have multiple runs (baseline vs. trained, ablations, etc.), put them on the same axes so the comparison is obvious
|
| 149 |
+
|
| 150 |
+
***Tell a story, not an API doc***
|
| 151 |
+
Your README, blog, and pitch should answer:
|
| 152 |
+
|
| 153 |
+
1. **Problem)** what capability gap or interesting domain are you targeting?
|
| 154 |
+
2. **Environment)** what does the agent see, do, and get rewarded for?
|
| 155 |
+
3. **Results)** what changed after training? Show it.
|
| 156 |
+
4. **Why does it matter)** who would care, and why?
|
| 157 |
+
|
| 158 |
+
*A reviewer should be able to read your README in 3\~5 minutes and want to try your*
|
| 159 |
+
*environment.*
|
| 160 |
+
|
| 161 |
+
**NOTE:** If you have a video, HF post, or anything else interesting, please make sure that itβs linked
|
| 162 |
+
from your README as a link.
|
| 163 |
+
|
| 164 |
+
***Engineer it cleanly (table stakes)***
|
| 165 |
+
Engineering quality matters less than ambition, but sloppy work hurts. Make sure you:
|
| 166 |
+
|
| 167 |
+
* Use OpenEnvβs Environment / MCPEnvironment base classes properly
|
| 168 |
+
* Respect the **client / server separation** (clients should never import server internals)
|
| 169 |
+
* Follow the standard Gym-style API (reset, step, state)
|
| 170 |
+
* Have a valid openenv.yaml manifest
|
| 171 |
+
* Donβt use reserved tool names (reset, step, state, close) for MCP tools
|
| 172 |
+
|
| 173 |
+
**Final Note**
|
| 174 |
+
|
| 175 |
+
Judges are looking for environments that push the frontier of what we can train LLMs to do. Be
|
| 176 |
+
ambitious. Pick a problem you find genuinely interesting; that almost always produces better
|
| 177 |
+
work than chasing what you think judges want. Good luck.
|
docs/{round2_confirmation.md β round2/confirmation.md}
RENAMED
|
File without changes
|
docs/{round2_design_notes.md β round2/design_notes.md}
RENAMED
|
File without changes
|
docs/round2/entity_definitions.md
ADDED
|
@@ -0,0 +1,220 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# RhythmEnv β Entity Definitions
|
| 2 |
+
|
| 3 |
+
## Episode Structure
|
| 4 |
+
|
| 5 |
+
```
|
| 6 |
+
1 episode = 1 week (7 days)
|
| 7 |
+
1 step = 1 time slot (Morning / Afternoon / Evening / Night)
|
| 8 |
+
4 slots = 1 day
|
| 9 |
+
28 steps = 1 full week
|
| 10 |
+
```
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## Observable State
|
| 15 |
+
|
| 16 |
+
What the agent sees in every observation. No hidden information here.
|
| 17 |
+
|
| 18 |
+
| Variable | Type | Range | Description |
|
| 19 |
+
|---|---|---|---|
|
| 20 |
+
| `timestep` | int | 0β27 | Current step (0 = Monday Morning) |
|
| 21 |
+
| `day` | int | 0β6 | Day of week (0 = Monday, 6 = Sunday) |
|
| 22 |
+
| `slot` | int | 0β3 | Time of day (0=Morning, 1=Afternoon, 2=Evening, 3=Night) |
|
| 23 |
+
| `vitality` | float | 0β1 | Physical energy and sleep quality |
|
| 24 |
+
| `cognition` | float | 0β1 | Mental clarity and focus |
|
| 25 |
+
| `progress` | float | 0β1 | Career and skill advancement made this week |
|
| 26 |
+
| `serenity` | float | 0β1 | Inner peace and stress management |
|
| 27 |
+
| `connection` | float | 0β1 | Relationship health |
|
| 28 |
+
| `active_event` | str\|null | β | Random event this step (null if none) |
|
| 29 |
+
| `remaining_steps` | int | 0β28 | Steps left in episode |
|
| 30 |
+
| `reward` | float | β | Reward received this step |
|
| 31 |
+
| `done` | bool | β | True on the final step |
|
| 32 |
+
| `reward_breakdown` | dict | β | Per-meter deltas; `final_score` when done |
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## Actions
|
| 37 |
+
|
| 38 |
+
10 actions, always legal regardless of state.
|
| 39 |
+
|
| 40 |
+
| Action | Category | Primary Effect |
|
| 41 |
+
|---|---|---|
|
| 42 |
+
| `DEEP_WORK` | Productivity | +Progress (large), βVitality, βCognition |
|
| 43 |
+
| `ADMIN_WORK` | Productivity | +Progress (small), light drain |
|
| 44 |
+
| `LEARN` | Productivity | +Progress, slight +Serenity |
|
| 45 |
+
| `SLEEP` | Recovery | +Vitality (large), +Cognition |
|
| 46 |
+
| `EXERCISE` | Recovery | +Vitality, +Serenity |
|
| 47 |
+
| `MEDITATE` | Recovery | +Serenity (large), +Cognition |
|
| 48 |
+
| `FAMILY_TIME` | Social | +Connection (large), +Serenity |
|
| 49 |
+
| `SOCIALIZE` | Social | +Connection |
|
| 50 |
+
| `ME_TIME` | Leisure | +Serenity, +Vitality (small) |
|
| 51 |
+
| `BINGE_WATCH` | Leisure | +Serenity (small), βCognition |
|
| 52 |
+
|
| 53 |
+
### Action Effect Matrix
|
| 54 |
+
|
| 55 |
+
Base deltas per action on each meter, **before** any profile modifiers or time-of-day multipliers.
|
| 56 |
+
|
| 57 |
+
| Action | Vitality | Cognition | Progress | Serenity | Connection |
|
| 58 |
+
|---|---|---|---|---|---|
|
| 59 |
+
| deep_work | β0.12 | β0.10 | +0.18 | β0.05 | 0.00 |
|
| 60 |
+
| admin_work | β0.06 | β0.05 | +0.08 | β0.03 | 0.00 |
|
| 61 |
+
| learn | β0.08 | β0.08 | +0.12 | +0.02 | 0.00 |
|
| 62 |
+
| sleep | +0.20 | +0.10 | 0.00 | +0.05 | 0.00 |
|
| 63 |
+
| exercise | +0.12 | +0.05 | 0.00 | +0.08 | 0.00 |
|
| 64 |
+
| meditate | +0.03 | +0.08 | 0.00 | +0.15 | 0.00 |
|
| 65 |
+
| family_time | β0.04 | β0.02 | 0.00 | +0.06 | +0.15 |
|
| 66 |
+
| socialize | β0.06 | β0.03 | 0.00 | +0.04 | +0.12 |
|
| 67 |
+
| me_time | +0.05 | +0.03 | 0.00 | +0.10 | β0.02 |
|
| 68 |
+
| binge_watch | +0.02 | β0.05 | β0.02 | +0.06 | β0.03 |
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## Hidden Personality Profiles
|
| 73 |
+
|
| 74 |
+
The person's identity. **Hidden from the agent.** Controls both reward weights and how
|
| 75 |
+
actions affect meters. Agent must infer the active profile from reward patterns across episodes.
|
| 76 |
+
|
| 77 |
+
### Profile 1 β `introvert_morning`
|
| 78 |
+
|
| 79 |
+
**Reward weights:** Serenity 60%, Progress 20%, Connection 10%, Vitality 5%, Cognition 5%
|
| 80 |
+
|
| 81 |
+
Hidden modifiers:
|
| 82 |
+
- Social vitality drain Γ3.0 β socialising is exhausting, not neutral
|
| 83 |
+
- Morning (slot 0): cognition and progress gains Γ2.0 β peak productivity window
|
| 84 |
+
- Solo time (me_time, meditate): serenity +0.10 bonus β recharges alone
|
| 85 |
+
- Binge watch triggers shame spiral: serenity β0.15, cognition β0.06
|
| 86 |
+
- Connection passive decay: β0.01/step
|
| 87 |
+
|
| 88 |
+
**Agent discovers:** Mornings are sacred; social activities are costly; alone time heals.
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
### Profile 2 β `extrovert_night_owl`
|
| 93 |
+
|
| 94 |
+
**Reward weights:** Connection 75%, Progress 10%, Vitality 5%, Cognition 5%, Serenity 5%
|
| 95 |
+
|
| 96 |
+
Hidden modifiers:
|
| 97 |
+
- Social vitality drain Γ0.2 β socialising energises, barely drains
|
| 98 |
+
- Morning (slot 0): cognition and progress gains Γ0.4 penalty β groggy zone
|
| 99 |
+
- Evening/Night (slots 2β3): cognition and progress gains Γ1.8 β peak zone
|
| 100 |
+
- Social actions: connection Γ2.0 (double connection gain)
|
| 101 |
+
- Social actions: serenity +0.06 bonus β people lift mood
|
| 102 |
+
- Connection passive decay: β0.01/step
|
| 103 |
+
|
| 104 |
+
**Agent discovers:** Avoid cognitive work in the morning; socialise to charge up; deep work in evening.
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
|
| 108 |
+
### Profile 3 β `workaholic_stoic`
|
| 109 |
+
|
| 110 |
+
**Reward weights:** Progress 70%, Serenity 10%, Connection 10%, Vitality 5%, Cognition 5%
|
| 111 |
+
|
| 112 |
+
Hidden modifiers:
|
| 113 |
+
- Productive work (deep_work, learn, admin_work): vitality +0.06 recovery β energised by output
|
| 114 |
+
- Productive work: serenity +0.10 bonus β meaning comes from progress
|
| 115 |
+
- Idle actions (me_time, binge_watch, sleep when optional): serenity β0.10 β idle guilt
|
| 116 |
+
- Extra vitality passive decay: β0.04/step β burnout risk
|
| 117 |
+
- Random event negative impact Γ0.5 β stoic resilience
|
| 118 |
+
- Connection passive decay: β0.02/step β faster relational drift
|
| 119 |
+
|
| 120 |
+
**Agent discovers:** Keep working; rest only when vitality is critical; neglect at cost of connection.
|
| 121 |
+
|
| 122 |
+
---
|
| 123 |
+
|
| 124 |
+
## Time-of-Day Multipliers
|
| 125 |
+
|
| 126 |
+
Applied to all non-sleep actions based on current slot.
|
| 127 |
+
|
| 128 |
+
| Slot | Cognition Gain Multiplier | Vitality Drain Multiplier |
|
| 129 |
+
|---|---|---|
|
| 130 |
+
| 0 β Morning | Γ1.2 | Γ0.8 |
|
| 131 |
+
| 1 β Afternoon | Γ1.0 | Γ1.0 |
|
| 132 |
+
| 2 β Evening | Γ0.8 | Γ1.1 |
|
| 133 |
+
| 3 β Night | Γ0.6 | Γ1.3 |
|
| 134 |
+
|
| 135 |
+
These are global. Profile-specific time bonuses (HV1) layer on top.
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
|
| 139 |
+
## Passive Decays (every step, before action effects)
|
| 140 |
+
|
| 141 |
+
| Profile | Meter | Decay |
|
| 142 |
+
|---|---|---|
|
| 143 |
+
| All | Connection | β0.01/step |
|
| 144 |
+
| workaholic_stoic | Connection | β0.02/step (replaces above) |
|
| 145 |
+
| workaholic_stoic | Vitality | β0.04/step |
|
| 146 |
+
|
| 147 |
+
---
|
| 148 |
+
|
| 149 |
+
## Random Events
|
| 150 |
+
|
| 151 |
+
Roll probability: 8% per step.
|
| 152 |
+
|
| 153 |
+
| Event | Vitality | Cognition | Progress | Serenity | Connection |
|
| 154 |
+
|---|---|---|---|---|---|
|
| 155 |
+
| prod_crash | β0.08 | β0.10 | β0.10 | β0.15 | 0.00 |
|
| 156 |
+
| family_emergency | β0.05 | β0.08 | 0.00 | β0.12 | β0.10 |
|
| 157 |
+
| illness | β0.20 | β0.10 | 0.00 | β0.05 | 0.00 |
|
| 158 |
+
| good_news | +0.05 | +0.03 | 0.00 | +0.10 | +0.05 |
|
| 159 |
+
|
| 160 |
+
Negative effects are reduced by `event_impact_multiplier` per profile
|
| 161 |
+
(workaholic_stoic = 0.5; others = 1.0 or 0.8).
|
| 162 |
+
|
| 163 |
+
---
|
| 164 |
+
|
| 165 |
+
## Reward Computation
|
| 166 |
+
|
| 167 |
+
### Per-step reward
|
| 168 |
+
|
| 169 |
+
```
|
| 170 |
+
reward = sum(meter_delta Γ profile_weight for each meter) Γ 15.0
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
+
Profile reward weights are **hidden**. Same action, different profile β very different reward.
|
| 174 |
+
|
| 175 |
+
Example β DEEP_WORK, step 1, same initial state:
|
| 176 |
+
```
|
| 177 |
+
workaholic_stoic: +1.57 (progress weight = 70%)
|
| 178 |
+
introvert_morning: +0.32 (serenity weight = 60%; deep work slightly drains serenity)
|
| 179 |
+
extrovert_night_owl: β0.39 (connection weight = 75%; deep work gives 0 connection)
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
### Modifiers applied during step (in order)
|
| 183 |
+
|
| 184 |
+
1. Roll and apply random event (if any)
|
| 185 |
+
2. Get base action effects (ACTION_EFFECTS matrix)
|
| 186 |
+
3. Apply repetition dampening (same action 3Γ in a row β 25% / 50% / 75% effect reduction)
|
| 187 |
+
4. Apply time-of-day multipliers (cognition gain, vitality drain)
|
| 188 |
+
5. Apply profile-specific modifiers (HV1/HV2/HV3)
|
| 189 |
+
6. Apply global vitality factor (`0.5 + 0.5 Γ vitality`) β low vitality reduces positive effects
|
| 190 |
+
7. Apply passive decays (connection, workaholic vitality)
|
| 191 |
+
8. Clamp all meters to [0.0, 1.0]
|
| 192 |
+
9. Compute reward as weighted sum of deltas Γ REWARD_SCALE (15.0)
|
| 193 |
+
10. Apply critical floor penalty: any meter < 0.10 β β0.30
|
| 194 |
+
|
| 195 |
+
### Final grade (returned in `reward_breakdown["final_score"]` when `done=True`)
|
| 196 |
+
|
| 197 |
+
Score in [0.0, 1.0]:
|
| 198 |
+
|
| 199 |
+
```
|
| 200 |
+
score = 0.30 Γ balance_score (mean of meters β variance)
|
| 201 |
+
+ 0.25 Γ crash_free_ratio (1 β crash_count / total_possible_crashes)
|
| 202 |
+
+ 0.20 Γ progress (final progress meter value)
|
| 203 |
+
+ 0.15 Γ connection (final connection meter value)
|
| 204 |
+
+ 0.10 Γ efficiency_score (avg step reward normalised to [0, 1])
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
---
|
| 208 |
+
|
| 209 |
+
## Internal Tracking Variables
|
| 210 |
+
|
| 211 |
+
Not in the observation. Used by the environment to compute rewards and grade.
|
| 212 |
+
|
| 213 |
+
| Variable | Description |
|
| 214 |
+
|---|---|
|
| 215 |
+
| `_profile` | Active profile dict (hidden from agent) |
|
| 216 |
+
| `_rng` | Seeded random instance for event rolls and profile selection |
|
| 217 |
+
| `_crash_count` | Steps where any meter fell below 0.10 |
|
| 218 |
+
| `_total_reward` | Running sum of step rewards for efficiency score |
|
| 219 |
+
| `_recent_actions` | Last N action names for repetition dampening |
|
| 220 |
+
| `_timestep` | Current step index (0β27) |
|
docs/round2/environment_design.md
ADDED
|
@@ -0,0 +1,209 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Round 2 β Environment Design: RhythmEnv Life Simulator
|
| 2 |
+
|
| 3 |
+
## What We Built
|
| 4 |
+
|
| 5 |
+
A **Life Simulator** β a holistic resource management RL environment where an agent learns
|
| 6 |
+
a specific person's hidden patterns through experience, not configuration.
|
| 7 |
+
|
| 8 |
+
**Core premise:** Personal AI assistants give generic advice. They don't learn *you*.
|
| 9 |
+
RhythmEnv is the training ground for an agent that must discover hidden personality dynamics
|
| 10 |
+
through reward signals alone β the same way a great personal assistant adapts over time.
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## Why Life Simulator (Not Workday Scheduler)
|
| 15 |
+
|
| 16 |
+
Round 1 was a workday task scheduler (energy/stress, 20 steps, 4 actions, task deadlines).
|
| 17 |
+
Round 2 rebuilt as a Life Simulator for a stronger learning signal and clearer discovery challenge:
|
| 18 |
+
|
| 19 |
+
| | Workday Scheduler (Round 1) | Life Simulator (Round 2) |
|
| 20 |
+
|---|---|---|
|
| 21 |
+
| Episode | 1 day, 20 steps | 1 week, 28 steps |
|
| 22 |
+
| State | Energy, stress, task queue | 5 life meters |
|
| 23 |
+
| Actions | 4 (task management) | 10 (life activities) |
|
| 24 |
+
| Hidden mechanism | Circadian multiplier on tasks | Profile-specific reward weights + action modifiers |
|
| 25 |
+
| Learning signal | How to sequence tasks | Which actions serve *this specific person* |
|
| 26 |
+
| Pitch story | "Schedule better" | "Learn who you are" |
|
| 27 |
+
|
| 28 |
+
The Life Simulator creates a fundamentally **non-promptable discovery problem**: the agent
|
| 29 |
+
cannot know the person's profile from the prompt text β it must be discovered through reward
|
| 30 |
+
patterns across episodes. This is structurally different from a task that better prompting solves.
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## The Discovery Challenge
|
| 35 |
+
|
| 36 |
+
Three hidden mechanism layers, each requiring different signal accumulation to discover:
|
| 37 |
+
|
| 38 |
+
### Layer 1 β Reward Weights (What Matters to This Person)
|
| 39 |
+
|
| 40 |
+
Same action, same starting state β wildly different rewards depending on hidden profile:
|
| 41 |
+
|
| 42 |
+
```
|
| 43 |
+
DEEP_WORK, step 1, same initial state:
|
| 44 |
+
workaholic_stoic: +1.57 (progress weight = 70% β work = meaning)
|
| 45 |
+
introvert_morning: +0.32 (serenity weight = 60% β mild net gain)
|
| 46 |
+
extrovert_night_owl: β0.39 (connection weight = 75% β work gives 0 connection)
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
An agent that doesn't adapt to the profile plateaus at ~0.60 final score.
|
| 50 |
+
One that discovers the profile targets can push above 0.80.
|
| 51 |
+
|
| 52 |
+
### Layer 2 β Action Modifiers (How Actions Actually Affect This Person)
|
| 53 |
+
|
| 54 |
+
The base effect matrix is modified invisibly per profile:
|
| 55 |
+
|
| 56 |
+
| Profile | Hidden modifier | Observable signal |
|
| 57 |
+
|---|---|---|
|
| 58 |
+
| introvert_morning | Social drain Γ3.0 | SOCIALIZE drains vitality 3Γ more than expected |
|
| 59 |
+
| introvert_morning | Morning deep work Γ2.0 | Same action gives 2Γ progress at slot 0 |
|
| 60 |
+
| extrovert_night_owl | Morning penalty Γ0.4 | DEEP_WORK in morning gives 40% of expected progress |
|
| 61 |
+
| extrovert_night_owl | Evening/night bonus Γ1.8 | Same action gives 1.8Γ progress at slots 2β3 |
|
| 62 |
+
| extrovert_night_owl | Social connection Γ2.0 | SOCIALIZE gives 2Γ connection gain |
|
| 63 |
+
| workaholic_stoic | Work recovers vitality +0.06 | DEEP_WORK raises vitality instead of draining |
|
| 64 |
+
| workaholic_stoic | Idle drains serenity β0.10 | ME_TIME/BINGE_WATCH lower serenity |
|
| 65 |
+
|
| 66 |
+
Agent sees the same meters and actions every episode.
|
| 67 |
+
The profile changes what actions *mean* in this episode.
|
| 68 |
+
|
| 69 |
+
### Layer 3 β Stress Spiral (Amplification Mechanics)
|
| 70 |
+
|
| 71 |
+
When serenity drops below the profile's stress tolerance, all negative effects amplify Γ1.3.
|
| 72 |
+
The introvert's tolerance is highest (0.30), extrovert's is mid (0.20), stoic's is lowest (0.15).
|
| 73 |
+
|
| 74 |
+
This creates a compounding dynamic: wrong actions β serenity drops β worse outcomes β harder
|
| 75 |
+
recovery. The agent must learn to protect serenity proactively, not reactively.
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
## Episode Structure
|
| 80 |
+
|
| 81 |
+
```
|
| 82 |
+
1 episode = 1 week
|
| 83 |
+
1 step = 1 time slot (Morning / Afternoon / Evening / Night)
|
| 84 |
+
4 slots/day Γ 7 days = 28 steps total
|
| 85 |
+
|
| 86 |
+
Slot 0 β Morning (HV1: cognition Γ1.2, vitality drain Γ0.8)
|
| 87 |
+
Slot 1 β Afternoon (HV1: neutral)
|
| 88 |
+
Slot 2 β Evening (HV1: cognition Γ0.8, vitality drain Γ1.1)
|
| 89 |
+
Slot 3 β Night (HV1: cognition Γ0.6, vitality drain Γ1.3)
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
Each `reset(seed, profile)` deterministically initialises state:
|
| 93 |
+
- Profile explicit kwarg β use that profile
|
| 94 |
+
- No profile β `seed % 3` selects profile (agent doesn't know which)
|
| 95 |
+
- Full episode is reproducible from seed alone (random events included)
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## Observable vs Hidden
|
| 100 |
+
|
| 101 |
+
| Observable (agent sees every step) | Hidden (must discover from reward patterns) |
|
| 102 |
+
|---|---|
|
| 103 |
+
| All 5 meter values (0.0β1.0) | Which of the 3 profiles is active |
|
| 104 |
+
| Day of week (0β6) | Profile reward weights |
|
| 105 |
+
| Time slot (0β3) | Per-action modifiers for this profile |
|
| 106 |
+
| Active random event name (if any) | Stress tolerance threshold |
|
| 107 |
+
| Remaining steps | Connection decay rate |
|
| 108 |
+
| Per-meter reward deltas | Event impact multiplier |
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## The Training Story
|
| 113 |
+
|
| 114 |
+
```
|
| 115 |
+
Untrained agent (random baseline):
|
| 116 |
+
β No pattern to action selection
|
| 117 |
+
β Misses optimal timing windows (morning for introverts, evening for extroverts)
|
| 118 |
+
β Doesn't protect serenity floor
|
| 119 |
+
β final_score β 0.60β0.70
|
| 120 |
+
|
| 121 |
+
Heuristic agent (rule-based, profile-blind):
|
| 122 |
+
β Follows observable rules: sleep when vitality low, meditate when serenity low
|
| 123 |
+
β Cannot differentiate workaholic from introvert strategy
|
| 124 |
+
β Misses profile-specific timing bonuses
|
| 125 |
+
β final_score β 0.75β0.82
|
| 126 |
+
|
| 127 |
+
GRPO-trained agent (after 500β1000 steps):
|
| 128 |
+
β Discovers DEEP_WORK in the morning gives 2Γ progress for introvert profiles
|
| 129 |
+
β Learns SOCIALIZE has opposite vitality effects for extrovert vs introvert
|
| 130 |
+
β Adapts overall strategy to the person's hidden reward structure
|
| 131 |
+
β Target: final_score > 0.82, beating heuristic on 2+ of 3 profiles
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
The training plot should show:
|
| 135 |
+
1. Mean reward increasing across GRPO steps
|
| 136 |
+
2. Trained agent bar chart > heuristic bar chart for at least 2 profiles
|
| 137 |
+
3. Per-profile breakdown showing differentiated learned strategy
|
| 138 |
+
|
| 139 |
+
---
|
| 140 |
+
|
| 141 |
+
## Anti-Reward-Hacking Measures
|
| 142 |
+
|
| 143 |
+
Three independent reward layers prevent gaming any single signal:
|
| 144 |
+
|
| 145 |
+
| Layer | Signal | Penalty for failure |
|
| 146 |
+
|---|---|---|
|
| 147 |
+
| `format_valid` | Output parseable action name | β2.0 |
|
| 148 |
+
| `action_legal` | Output is one of 10 valid actions | β1.0 |
|
| 149 |
+
| `env_reward` | Real environment reward via episode replay | β3.0 |
|
| 150 |
+
|
| 151 |
+
Additional safeguards:
|
| 152 |
+
- **Repetition dampening**: Same action 3Γ in a row β 25%/50%/75% effect reduction (prevents spam)
|
| 153 |
+
- **Critical floor penalty**: Any meter < 0.10 β β0.30 per step (prevents neglect farming)
|
| 154 |
+
- **Random events** (8%/step): Prevents overfitting to deterministic trajectories
|
| 155 |
+
- **Seed-based replay**: `env_reward` replays exact episode state via seed + action_history β reward cannot be fabricated
|
| 156 |
+
|
| 157 |
+
---
|
| 158 |
+
|
| 159 |
+
## Alignment with Hackathon Themes
|
| 160 |
+
|
| 161 |
+
**Primary: Theme 3.2 β World Modeling: Personalized Tasks**
|
| 162 |
+
|
| 163 |
+
The environment models real personal assistant behaviour:
|
| 164 |
+
- Agent manages a person's week across competing life priorities
|
| 165 |
+
- Hidden profile = real individual differences in what matters and how actions affect a person
|
| 166 |
+
- Discovery through reward = how a good PA adapts over their first weeks on the job
|
| 167 |
+
|
| 168 |
+
**Secondary: Theme 2 β Long-Horizon Planning**
|
| 169 |
+
|
| 170 |
+
28 steps with delayed, compounding consequences:
|
| 171 |
+
- Neglecting connection decays slowly but each step makes recovery harder
|
| 172 |
+
- Progress must be built steadily β the final grader rewards sustained output
|
| 173 |
+
- Serenity meltdown triggered by accumulated bad decisions, not a single step
|
| 174 |
+
|
| 175 |
+
---
|
| 176 |
+
|
| 177 |
+
## Implementation Reference
|
| 178 |
+
|
| 179 |
+
| Component | File | Lines |
|
| 180 |
+
|---|---|---|
|
| 181 |
+
| Environment | `server/rhythm_environment.py` | 577 |
|
| 182 |
+
| Data models | `models.py` | 89 |
|
| 183 |
+
| Training orchestrator | `training/train.py` | 202 |
|
| 184 |
+
| Dataset generator | `training/dataset.py` | 181 |
|
| 185 |
+
| Reward functions | `training/reward_functions.py` | 215 |
|
| 186 |
+
| Baseline evaluation | `training/inference_eval.py` | 227 |
|
| 187 |
+
| Colab notebook | `training/RhythmEnv_GRPO_Training.ipynb` | β |
|
| 188 |
+
| Gradio UI | `ui/app.py` | β |
|
| 189 |
+
| FastAPI server | `server/app.py` | 74 |
|
| 190 |
+
|
| 191 |
+
Key API:
|
| 192 |
+
```python
|
| 193 |
+
env = RhythmEnvironment()
|
| 194 |
+
obs = env.reset(seed=42, profile="introvert_morning") # profile optional
|
| 195 |
+
obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))
|
| 196 |
+
# obs.reward, obs.done, obs.reward_breakdown, obs.vitality, ...
|
| 197 |
+
```
|
| 198 |
+
|
| 199 |
+
---
|
| 200 |
+
|
| 201 |
+
## Open Questions β Decided
|
| 202 |
+
|
| 203 |
+
| Question | Decision |
|
| 204 |
+
|---|---|
|
| 205 |
+
| All 3 hidden variables or start with 2? | All 3 fully implemented |
|
| 206 |
+
| Do profiles change every episode? | Seed-based: same seed β same profile |
|
| 207 |
+
| Does profile affect which tasks appear? | No tasks in Life Simulator; profile affects action effects + reward weights |
|
| 208 |
+
| Add BUNDLE_TASKS action? | Skipped β Life Simulator action space is complete at 10 |
|
| 209 |
+
| 7-day vs 1-day episodes? | 7-day (28 steps) β long horizon is the point |
|
docs/{Hackathon Themes.md β round2/hackathon_themes.md}
RENAMED
|
File without changes
|
docs/round2/pitch_framing.md
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Round 2 β Pitch Framing
|
| 2 |
+
|
| 3 |
+
## Why This Exists
|
| 4 |
+
|
| 5 |
+
Personal AI assistants give generic advice. They don't know you.
|
| 6 |
+
RhythmEnv is an environment where an agent learns YOUR specific patterns through experience β not configuration.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## The Product Vision
|
| 11 |
+
|
| 12 |
+
```
|
| 13 |
+
User installs app
|
| 14 |
+
Agent runs episodes in background
|
| 15 |
+
Over time β learns energy patterns, task preferences, peak hours
|
| 16 |
+
Result β a scheduler that actually knows YOU
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
No setup. No personality quiz. The agent figures you out.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Why Simulation is a Valid Proxy
|
| 24 |
+
|
| 25 |
+
| Hackathon env | Real product |
|
| 26 |
+
|---|---|
|
| 27 |
+
| Simulated tasks | Real calendar + Notion + email |
|
| 28 |
+
| Simulated energy | Biometric data or self-report |
|
| 29 |
+
| Fixed scenarios | Dynamic, unpredictable days |
|
| 30 |
+
|
| 31 |
+
The mechanics are the same. The simulation is a controlled version of the real problem β which is exactly what RL training environments are for.
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## What Makes This Hard for an LLM
|
| 36 |
+
|
| 37 |
+
Without hidden variables β LLM already knows how to schedule by deadline. Nothing to learn.
|
| 38 |
+
|
| 39 |
+
With hidden variables β LLM must discover YOUR specific rules:
|
| 40 |
+
|
| 41 |
+
```
|
| 42 |
+
YOUR energy cliff (performance drops sharply below a threshold, not gradually)
|
| 43 |
+
YOUR peak hours (certain tasks score better at certain times of day)
|
| 44 |
+
YOUR recovery curve (consecutive breaks compound in ways that aren't obvious)
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
These aren't in the state. The agent discovers them through reward signal across episodes.
|
| 48 |
+
That's the training story.
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
## The Pitch (3 minutes)
|
| 53 |
+
|
| 54 |
+
1. **Problem** β AI assistants are generic. They don't learn you.
|
| 55 |
+
2. **Environment** β A simulated workday with hidden personal patterns the agent must discover.
|
| 56 |
+
3. **Results** β Show reward curves improving as the agent learns the hidden variables.
|
| 57 |
+
4. **Why it matters** β This is the training ground for truly personalized AI.
|
docs/{round2_problem_statement.md β round2/problem_statement.md}
RENAMED
|
File without changes
|
eval_results.json
ADDED
|
@@ -0,0 +1,758 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"profile": "introvert_morning",
|
| 4 |
+
"strategy": "heuristic",
|
| 5 |
+
"seed": 0,
|
| 6 |
+
"final_score": 0.7696,
|
| 7 |
+
"total_reward": 5.84,
|
| 8 |
+
"vitality": 0.76,
|
| 9 |
+
"cognition": 0.41,
|
| 10 |
+
"progress": 1.0,
|
| 11 |
+
"serenity": 1.0,
|
| 12 |
+
"connection": 0.43,
|
| 13 |
+
"actions": [
|
| 14 |
+
"deep_work",
|
| 15 |
+
"admin_work",
|
| 16 |
+
"meditate",
|
| 17 |
+
"sleep",
|
| 18 |
+
"deep_work",
|
| 19 |
+
"admin_work",
|
| 20 |
+
"meditate",
|
| 21 |
+
"sleep",
|
| 22 |
+
"deep_work",
|
| 23 |
+
"admin_work",
|
| 24 |
+
"meditate",
|
| 25 |
+
"sleep",
|
| 26 |
+
"deep_work",
|
| 27 |
+
"admin_work",
|
| 28 |
+
"socialize",
|
| 29 |
+
"sleep",
|
| 30 |
+
"deep_work",
|
| 31 |
+
"admin_work",
|
| 32 |
+
"meditate",
|
| 33 |
+
"sleep",
|
| 34 |
+
"deep_work",
|
| 35 |
+
"admin_work",
|
| 36 |
+
"socialize",
|
| 37 |
+
"sleep",
|
| 38 |
+
"deep_work",
|
| 39 |
+
"admin_work",
|
| 40 |
+
"meditate",
|
| 41 |
+
"sleep"
|
| 42 |
+
]
|
| 43 |
+
},
|
| 44 |
+
{
|
| 45 |
+
"profile": "introvert_morning",
|
| 46 |
+
"strategy": "heuristic",
|
| 47 |
+
"seed": 1,
|
| 48 |
+
"final_score": 0.7526,
|
| 49 |
+
"total_reward": 6.88,
|
| 50 |
+
"vitality": 0.64,
|
| 51 |
+
"cognition": 0.33,
|
| 52 |
+
"progress": 1.0,
|
| 53 |
+
"serenity": 0.87,
|
| 54 |
+
"connection": 0.43,
|
| 55 |
+
"actions": [
|
| 56 |
+
"deep_work",
|
| 57 |
+
"admin_work",
|
| 58 |
+
"meditate",
|
| 59 |
+
"sleep",
|
| 60 |
+
"deep_work",
|
| 61 |
+
"admin_work",
|
| 62 |
+
"meditate",
|
| 63 |
+
"sleep",
|
| 64 |
+
"deep_work",
|
| 65 |
+
"admin_work",
|
| 66 |
+
"meditate",
|
| 67 |
+
"sleep",
|
| 68 |
+
"deep_work",
|
| 69 |
+
"admin_work",
|
| 70 |
+
"meditate",
|
| 71 |
+
"sleep",
|
| 72 |
+
"deep_work",
|
| 73 |
+
"admin_work",
|
| 74 |
+
"meditate",
|
| 75 |
+
"sleep",
|
| 76 |
+
"deep_work",
|
| 77 |
+
"admin_work",
|
| 78 |
+
"socialize",
|
| 79 |
+
"sleep",
|
| 80 |
+
"deep_work",
|
| 81 |
+
"admin_work",
|
| 82 |
+
"socialize",
|
| 83 |
+
"sleep"
|
| 84 |
+
]
|
| 85 |
+
},
|
| 86 |
+
{
|
| 87 |
+
"profile": "introvert_morning",
|
| 88 |
+
"strategy": "heuristic",
|
| 89 |
+
"seed": 2,
|
| 90 |
+
"final_score": 0.7723,
|
| 91 |
+
"total_reward": 8.97,
|
| 92 |
+
"vitality": 0.84,
|
| 93 |
+
"cognition": 0.44,
|
| 94 |
+
"progress": 1.0,
|
| 95 |
+
"serenity": 1.0,
|
| 96 |
+
"connection": 0.39,
|
| 97 |
+
"actions": [
|
| 98 |
+
"deep_work",
|
| 99 |
+
"admin_work",
|
| 100 |
+
"meditate",
|
| 101 |
+
"sleep",
|
| 102 |
+
"deep_work",
|
| 103 |
+
"admin_work",
|
| 104 |
+
"meditate",
|
| 105 |
+
"sleep",
|
| 106 |
+
"deep_work",
|
| 107 |
+
"admin_work",
|
| 108 |
+
"socialize",
|
| 109 |
+
"sleep",
|
| 110 |
+
"deep_work",
|
| 111 |
+
"admin_work",
|
| 112 |
+
"socialize",
|
| 113 |
+
"sleep",
|
| 114 |
+
"exercise",
|
| 115 |
+
"admin_work",
|
| 116 |
+
"socialize",
|
| 117 |
+
"sleep",
|
| 118 |
+
"deep_work",
|
| 119 |
+
"admin_work",
|
| 120 |
+
"meditate",
|
| 121 |
+
"sleep",
|
| 122 |
+
"exercise",
|
| 123 |
+
"admin_work",
|
| 124 |
+
"meditate",
|
| 125 |
+
"sleep"
|
| 126 |
+
]
|
| 127 |
+
},
|
| 128 |
+
{
|
| 129 |
+
"profile": "extrovert_night_owl",
|
| 130 |
+
"strategy": "heuristic",
|
| 131 |
+
"seed": 0,
|
| 132 |
+
"final_score": 0.8197,
|
| 133 |
+
"total_reward": 4.21,
|
| 134 |
+
"vitality": 1.0,
|
| 135 |
+
"cognition": 1.0,
|
| 136 |
+
"progress": 1.0,
|
| 137 |
+
"serenity": 1.0,
|
| 138 |
+
"connection": 0.43,
|
| 139 |
+
"actions": [
|
| 140 |
+
"deep_work",
|
| 141 |
+
"learn",
|
| 142 |
+
"meditate",
|
| 143 |
+
"sleep",
|
| 144 |
+
"deep_work",
|
| 145 |
+
"learn",
|
| 146 |
+
"meditate",
|
| 147 |
+
"sleep",
|
| 148 |
+
"deep_work",
|
| 149 |
+
"admin_work",
|
| 150 |
+
"meditate",
|
| 151 |
+
"sleep",
|
| 152 |
+
"deep_work",
|
| 153 |
+
"admin_work",
|
| 154 |
+
"socialize",
|
| 155 |
+
"sleep",
|
| 156 |
+
"deep_work",
|
| 157 |
+
"admin_work",
|
| 158 |
+
"meditate",
|
| 159 |
+
"sleep",
|
| 160 |
+
"deep_work",
|
| 161 |
+
"admin_work",
|
| 162 |
+
"meditate",
|
| 163 |
+
"sleep",
|
| 164 |
+
"deep_work",
|
| 165 |
+
"admin_work",
|
| 166 |
+
"meditate",
|
| 167 |
+
"sleep"
|
| 168 |
+
]
|
| 169 |
+
},
|
| 170 |
+
{
|
| 171 |
+
"profile": "extrovert_night_owl",
|
| 172 |
+
"strategy": "heuristic",
|
| 173 |
+
"seed": 1,
|
| 174 |
+
"final_score": 0.8209,
|
| 175 |
+
"total_reward": 4.59,
|
| 176 |
+
"vitality": 1.0,
|
| 177 |
+
"cognition": 1.0,
|
| 178 |
+
"progress": 0.97,
|
| 179 |
+
"serenity": 1.0,
|
| 180 |
+
"connection": 0.46,
|
| 181 |
+
"actions": [
|
| 182 |
+
"deep_work",
|
| 183 |
+
"learn",
|
| 184 |
+
"meditate",
|
| 185 |
+
"sleep",
|
| 186 |
+
"deep_work",
|
| 187 |
+
"learn",
|
| 188 |
+
"meditate",
|
| 189 |
+
"sleep",
|
| 190 |
+
"deep_work",
|
| 191 |
+
"admin_work",
|
| 192 |
+
"meditate",
|
| 193 |
+
"sleep",
|
| 194 |
+
"deep_work",
|
| 195 |
+
"admin_work",
|
| 196 |
+
"meditate",
|
| 197 |
+
"sleep",
|
| 198 |
+
"deep_work",
|
| 199 |
+
"admin_work",
|
| 200 |
+
"meditate",
|
| 201 |
+
"sleep",
|
| 202 |
+
"deep_work",
|
| 203 |
+
"admin_work",
|
| 204 |
+
"socialize",
|
| 205 |
+
"sleep",
|
| 206 |
+
"deep_work",
|
| 207 |
+
"admin_work",
|
| 208 |
+
"meditate",
|
| 209 |
+
"sleep"
|
| 210 |
+
]
|
| 211 |
+
},
|
| 212 |
+
{
|
| 213 |
+
"profile": "extrovert_night_owl",
|
| 214 |
+
"strategy": "heuristic",
|
| 215 |
+
"seed": 2,
|
| 216 |
+
"final_score": 0.8164,
|
| 217 |
+
"total_reward": 5.85,
|
| 218 |
+
"vitality": 0.97,
|
| 219 |
+
"cognition": 0.82,
|
| 220 |
+
"progress": 0.93,
|
| 221 |
+
"serenity": 1.0,
|
| 222 |
+
"connection": 0.53,
|
| 223 |
+
"actions": [
|
| 224 |
+
"deep_work",
|
| 225 |
+
"learn",
|
| 226 |
+
"meditate",
|
| 227 |
+
"sleep",
|
| 228 |
+
"deep_work",
|
| 229 |
+
"learn",
|
| 230 |
+
"meditate",
|
| 231 |
+
"sleep",
|
| 232 |
+
"deep_work",
|
| 233 |
+
"learn",
|
| 234 |
+
"socialize",
|
| 235 |
+
"sleep",
|
| 236 |
+
"deep_work",
|
| 237 |
+
"admin_work",
|
| 238 |
+
"meditate",
|
| 239 |
+
"sleep",
|
| 240 |
+
"deep_work",
|
| 241 |
+
"admin_work",
|
| 242 |
+
"meditate",
|
| 243 |
+
"sleep",
|
| 244 |
+
"deep_work",
|
| 245 |
+
"admin_work",
|
| 246 |
+
"socialize",
|
| 247 |
+
"sleep",
|
| 248 |
+
"deep_work",
|
| 249 |
+
"admin_work",
|
| 250 |
+
"meditate",
|
| 251 |
+
"sleep"
|
| 252 |
+
]
|
| 253 |
+
},
|
| 254 |
+
{
|
| 255 |
+
"profile": "workaholic_stoic",
|
| 256 |
+
"strategy": "heuristic",
|
| 257 |
+
"seed": 0,
|
| 258 |
+
"final_score": 0.7461,
|
| 259 |
+
"total_reward": 11.98,
|
| 260 |
+
"vitality": 0.59,
|
| 261 |
+
"cognition": 0.23,
|
| 262 |
+
"progress": 1.0,
|
| 263 |
+
"serenity": 0.95,
|
| 264 |
+
"connection": 0.41,
|
| 265 |
+
"actions": [
|
| 266 |
+
"deep_work",
|
| 267 |
+
"learn",
|
| 268 |
+
"meditate",
|
| 269 |
+
"sleep",
|
| 270 |
+
"deep_work",
|
| 271 |
+
"admin_work",
|
| 272 |
+
"socialize",
|
| 273 |
+
"sleep",
|
| 274 |
+
"deep_work",
|
| 275 |
+
"admin_work",
|
| 276 |
+
"socialize",
|
| 277 |
+
"sleep",
|
| 278 |
+
"deep_work",
|
| 279 |
+
"admin_work",
|
| 280 |
+
"meditate",
|
| 281 |
+
"sleep",
|
| 282 |
+
"deep_work",
|
| 283 |
+
"admin_work",
|
| 284 |
+
"socialize",
|
| 285 |
+
"sleep",
|
| 286 |
+
"exercise",
|
| 287 |
+
"admin_work",
|
| 288 |
+
"socialize",
|
| 289 |
+
"sleep",
|
| 290 |
+
"deep_work",
|
| 291 |
+
"admin_work",
|
| 292 |
+
"socialize",
|
| 293 |
+
"sleep"
|
| 294 |
+
]
|
| 295 |
+
},
|
| 296 |
+
{
|
| 297 |
+
"profile": "workaholic_stoic",
|
| 298 |
+
"strategy": "heuristic",
|
| 299 |
+
"seed": 1,
|
| 300 |
+
"final_score": 0.7585,
|
| 301 |
+
"total_reward": 12.67,
|
| 302 |
+
"vitality": 0.72,
|
| 303 |
+
"cognition": 0.32,
|
| 304 |
+
"progress": 1.0,
|
| 305 |
+
"serenity": 0.95,
|
| 306 |
+
"connection": 0.38,
|
| 307 |
+
"actions": [
|
| 308 |
+
"deep_work",
|
| 309 |
+
"learn",
|
| 310 |
+
"meditate",
|
| 311 |
+
"sleep",
|
| 312 |
+
"deep_work",
|
| 313 |
+
"admin_work",
|
| 314 |
+
"socialize",
|
| 315 |
+
"sleep",
|
| 316 |
+
"deep_work",
|
| 317 |
+
"admin_work",
|
| 318 |
+
"meditate",
|
| 319 |
+
"sleep",
|
| 320 |
+
"deep_work",
|
| 321 |
+
"admin_work",
|
| 322 |
+
"meditate",
|
| 323 |
+
"sleep",
|
| 324 |
+
"deep_work",
|
| 325 |
+
"admin_work",
|
| 326 |
+
"socialize",
|
| 327 |
+
"sleep",
|
| 328 |
+
"deep_work",
|
| 329 |
+
"admin_work",
|
| 330 |
+
"socialize",
|
| 331 |
+
"sleep",
|
| 332 |
+
"exercise",
|
| 333 |
+
"admin_work",
|
| 334 |
+
"socialize",
|
| 335 |
+
"sleep"
|
| 336 |
+
]
|
| 337 |
+
},
|
| 338 |
+
{
|
| 339 |
+
"profile": "workaholic_stoic",
|
| 340 |
+
"strategy": "heuristic",
|
| 341 |
+
"seed": 2,
|
| 342 |
+
"final_score": 0.7782,
|
| 343 |
+
"total_reward": 13.16,
|
| 344 |
+
"vitality": 0.74,
|
| 345 |
+
"cognition": 0.34,
|
| 346 |
+
"progress": 1.0,
|
| 347 |
+
"serenity": 0.95,
|
| 348 |
+
"connection": 0.44,
|
| 349 |
+
"actions": [
|
| 350 |
+
"deep_work",
|
| 351 |
+
"learn",
|
| 352 |
+
"meditate",
|
| 353 |
+
"sleep",
|
| 354 |
+
"deep_work",
|
| 355 |
+
"admin_work",
|
| 356 |
+
"socialize",
|
| 357 |
+
"sleep",
|
| 358 |
+
"deep_work",
|
| 359 |
+
"admin_work",
|
| 360 |
+
"socialize",
|
| 361 |
+
"sleep",
|
| 362 |
+
"deep_work",
|
| 363 |
+
"admin_work",
|
| 364 |
+
"socialize",
|
| 365 |
+
"sleep",
|
| 366 |
+
"exercise",
|
| 367 |
+
"admin_work",
|
| 368 |
+
"socialize",
|
| 369 |
+
"sleep",
|
| 370 |
+
"exercise",
|
| 371 |
+
"admin_work",
|
| 372 |
+
"socialize",
|
| 373 |
+
"sleep",
|
| 374 |
+
"exercise",
|
| 375 |
+
"admin_work",
|
| 376 |
+
"socialize",
|
| 377 |
+
"sleep"
|
| 378 |
+
]
|
| 379 |
+
},
|
| 380 |
+
{
|
| 381 |
+
"profile": "introvert_morning",
|
| 382 |
+
"strategy": "random",
|
| 383 |
+
"seed": 0,
|
| 384 |
+
"final_score": 0.7141,
|
| 385 |
+
"total_reward": 1.82,
|
| 386 |
+
"vitality": 0.45,
|
| 387 |
+
"cognition": 0.56,
|
| 388 |
+
"progress": 0.66,
|
| 389 |
+
"serenity": 1.0,
|
| 390 |
+
"connection": 0.71,
|
| 391 |
+
"actions": [
|
| 392 |
+
"socialize",
|
| 393 |
+
"me_time",
|
| 394 |
+
"binge_watch",
|
| 395 |
+
"socialize",
|
| 396 |
+
"exercise",
|
| 397 |
+
"family_time",
|
| 398 |
+
"sleep",
|
| 399 |
+
"admin_work",
|
| 400 |
+
"meditate",
|
| 401 |
+
"binge_watch",
|
| 402 |
+
"admin_work",
|
| 403 |
+
"deep_work",
|
| 404 |
+
"sleep",
|
| 405 |
+
"meditate",
|
| 406 |
+
"sleep",
|
| 407 |
+
"family_time",
|
| 408 |
+
"exercise",
|
| 409 |
+
"deep_work",
|
| 410 |
+
"admin_work",
|
| 411 |
+
"meditate",
|
| 412 |
+
"socialize",
|
| 413 |
+
"binge_watch",
|
| 414 |
+
"exercise",
|
| 415 |
+
"meditate",
|
| 416 |
+
"learn",
|
| 417 |
+
"socialize",
|
| 418 |
+
"admin_work",
|
| 419 |
+
"sleep"
|
| 420 |
+
]
|
| 421 |
+
},
|
| 422 |
+
{
|
| 423 |
+
"profile": "introvert_morning",
|
| 424 |
+
"strategy": "random",
|
| 425 |
+
"seed": 1,
|
| 426 |
+
"final_score": 0.6924,
|
| 427 |
+
"total_reward": 4.02,
|
| 428 |
+
"vitality": 0.08,
|
| 429 |
+
"cognition": 0.5,
|
| 430 |
+
"progress": 0.76,
|
| 431 |
+
"serenity": 0.97,
|
| 432 |
+
"connection": 0.68,
|
| 433 |
+
"actions": [
|
| 434 |
+
"exercise",
|
| 435 |
+
"meditate",
|
| 436 |
+
"sleep",
|
| 437 |
+
"meditate",
|
| 438 |
+
"meditate",
|
| 439 |
+
"me_time",
|
| 440 |
+
"meditate",
|
| 441 |
+
"learn",
|
| 442 |
+
"meditate",
|
| 443 |
+
"meditate",
|
| 444 |
+
"socialize",
|
| 445 |
+
"socialize",
|
| 446 |
+
"deep_work",
|
| 447 |
+
"meditate",
|
| 448 |
+
"socialize",
|
| 449 |
+
"deep_work",
|
| 450 |
+
"meditate",
|
| 451 |
+
"sleep",
|
| 452 |
+
"learn",
|
| 453 |
+
"socialize",
|
| 454 |
+
"deep_work",
|
| 455 |
+
"socialize",
|
| 456 |
+
"learn",
|
| 457 |
+
"sleep",
|
| 458 |
+
"family_time",
|
| 459 |
+
"meditate",
|
| 460 |
+
"meditate",
|
| 461 |
+
"admin_work"
|
| 462 |
+
]
|
| 463 |
+
},
|
| 464 |
+
{
|
| 465 |
+
"profile": "introvert_morning",
|
| 466 |
+
"strategy": "random",
|
| 467 |
+
"seed": 2,
|
| 468 |
+
"final_score": 0.6715,
|
| 469 |
+
"total_reward": 6.12,
|
| 470 |
+
"vitality": 0.61,
|
| 471 |
+
"cognition": 0.22,
|
| 472 |
+
"progress": 1.0,
|
| 473 |
+
"serenity": 0.86,
|
| 474 |
+
"connection": 0.17,
|
| 475 |
+
"actions": [
|
| 476 |
+
"me_time",
|
| 477 |
+
"meditate",
|
| 478 |
+
"learn",
|
| 479 |
+
"meditate",
|
| 480 |
+
"learn",
|
| 481 |
+
"family_time",
|
| 482 |
+
"deep_work",
|
| 483 |
+
"family_time",
|
| 484 |
+
"me_time",
|
| 485 |
+
"admin_work",
|
| 486 |
+
"sleep",
|
| 487 |
+
"meditate",
|
| 488 |
+
"sleep",
|
| 489 |
+
"admin_work",
|
| 490 |
+
"meditate",
|
| 491 |
+
"me_time",
|
| 492 |
+
"sleep",
|
| 493 |
+
"sleep",
|
| 494 |
+
"binge_watch",
|
| 495 |
+
"admin_work",
|
| 496 |
+
"deep_work",
|
| 497 |
+
"admin_work",
|
| 498 |
+
"admin_work",
|
| 499 |
+
"binge_watch",
|
| 500 |
+
"learn",
|
| 501 |
+
"sleep",
|
| 502 |
+
"me_time",
|
| 503 |
+
"deep_work"
|
| 504 |
+
]
|
| 505 |
+
},
|
| 506 |
+
{
|
| 507 |
+
"profile": "extrovert_night_owl",
|
| 508 |
+
"strategy": "random",
|
| 509 |
+
"seed": 0,
|
| 510 |
+
"final_score": 0.9368,
|
| 511 |
+
"total_reward": 8.5,
|
| 512 |
+
"vitality": 1.0,
|
| 513 |
+
"cognition": 0.83,
|
| 514 |
+
"progress": 1.0,
|
| 515 |
+
"serenity": 1.0,
|
| 516 |
+
"connection": 0.98,
|
| 517 |
+
"actions": [
|
| 518 |
+
"socialize",
|
| 519 |
+
"me_time",
|
| 520 |
+
"binge_watch",
|
| 521 |
+
"socialize",
|
| 522 |
+
"exercise",
|
| 523 |
+
"family_time",
|
| 524 |
+
"sleep",
|
| 525 |
+
"admin_work",
|
| 526 |
+
"meditate",
|
| 527 |
+
"binge_watch",
|
| 528 |
+
"admin_work",
|
| 529 |
+
"deep_work",
|
| 530 |
+
"sleep",
|
| 531 |
+
"meditate",
|
| 532 |
+
"sleep",
|
| 533 |
+
"family_time",
|
| 534 |
+
"exercise",
|
| 535 |
+
"deep_work",
|
| 536 |
+
"admin_work",
|
| 537 |
+
"meditate",
|
| 538 |
+
"socialize",
|
| 539 |
+
"binge_watch",
|
| 540 |
+
"exercise",
|
| 541 |
+
"meditate",
|
| 542 |
+
"learn",
|
| 543 |
+
"socialize",
|
| 544 |
+
"admin_work",
|
| 545 |
+
"sleep"
|
| 546 |
+
]
|
| 547 |
+
},
|
| 548 |
+
{
|
| 549 |
+
"profile": "extrovert_night_owl",
|
| 550 |
+
"strategy": "random",
|
| 551 |
+
"seed": 1,
|
| 552 |
+
"final_score": 0.9054,
|
| 553 |
+
"total_reward": 8.2,
|
| 554 |
+
"vitality": 0.75,
|
| 555 |
+
"cognition": 0.69,
|
| 556 |
+
"progress": 1.0,
|
| 557 |
+
"serenity": 0.97,
|
| 558 |
+
"connection": 0.97,
|
| 559 |
+
"actions": [
|
| 560 |
+
"exercise",
|
| 561 |
+
"meditate",
|
| 562 |
+
"sleep",
|
| 563 |
+
"meditate",
|
| 564 |
+
"meditate",
|
| 565 |
+
"me_time",
|
| 566 |
+
"meditate",
|
| 567 |
+
"learn",
|
| 568 |
+
"meditate",
|
| 569 |
+
"meditate",
|
| 570 |
+
"socialize",
|
| 571 |
+
"socialize",
|
| 572 |
+
"deep_work",
|
| 573 |
+
"meditate",
|
| 574 |
+
"socialize",
|
| 575 |
+
"deep_work",
|
| 576 |
+
"meditate",
|
| 577 |
+
"sleep",
|
| 578 |
+
"learn",
|
| 579 |
+
"socialize",
|
| 580 |
+
"deep_work",
|
| 581 |
+
"socialize",
|
| 582 |
+
"learn",
|
| 583 |
+
"sleep",
|
| 584 |
+
"family_time",
|
| 585 |
+
"meditate",
|
| 586 |
+
"meditate",
|
| 587 |
+
"admin_work"
|
| 588 |
+
]
|
| 589 |
+
},
|
| 590 |
+
{
|
| 591 |
+
"profile": "extrovert_night_owl",
|
| 592 |
+
"strategy": "random",
|
| 593 |
+
"seed": 2,
|
| 594 |
+
"final_score": 0.7462,
|
| 595 |
+
"total_reward": 4.12,
|
| 596 |
+
"vitality": 0.75,
|
| 597 |
+
"cognition": 0.32,
|
| 598 |
+
"progress": 1.0,
|
| 599 |
+
"serenity": 0.95,
|
| 600 |
+
"connection": 0.4,
|
| 601 |
+
"actions": [
|
| 602 |
+
"me_time",
|
| 603 |
+
"meditate",
|
| 604 |
+
"learn",
|
| 605 |
+
"meditate",
|
| 606 |
+
"learn",
|
| 607 |
+
"family_time",
|
| 608 |
+
"deep_work",
|
| 609 |
+
"family_time",
|
| 610 |
+
"me_time",
|
| 611 |
+
"admin_work",
|
| 612 |
+
"sleep",
|
| 613 |
+
"meditate",
|
| 614 |
+
"sleep",
|
| 615 |
+
"admin_work",
|
| 616 |
+
"meditate",
|
| 617 |
+
"me_time",
|
| 618 |
+
"sleep",
|
| 619 |
+
"sleep",
|
| 620 |
+
"binge_watch",
|
| 621 |
+
"admin_work",
|
| 622 |
+
"deep_work",
|
| 623 |
+
"admin_work",
|
| 624 |
+
"admin_work",
|
| 625 |
+
"binge_watch",
|
| 626 |
+
"learn",
|
| 627 |
+
"sleep",
|
| 628 |
+
"me_time",
|
| 629 |
+
"deep_work"
|
| 630 |
+
]
|
| 631 |
+
},
|
| 632 |
+
{
|
| 633 |
+
"profile": "workaholic_stoic",
|
| 634 |
+
"strategy": "random",
|
| 635 |
+
"seed": 0,
|
| 636 |
+
"final_score": 0.6185,
|
| 637 |
+
"total_reward": 4.07,
|
| 638 |
+
"vitality": 0.4,
|
| 639 |
+
"cognition": 0.52,
|
| 640 |
+
"progress": 0.55,
|
| 641 |
+
"serenity": 0.95,
|
| 642 |
+
"connection": 0.41,
|
| 643 |
+
"actions": [
|
| 644 |
+
"socialize",
|
| 645 |
+
"me_time",
|
| 646 |
+
"binge_watch",
|
| 647 |
+
"socialize",
|
| 648 |
+
"exercise",
|
| 649 |
+
"family_time",
|
| 650 |
+
"sleep",
|
| 651 |
+
"admin_work",
|
| 652 |
+
"meditate",
|
| 653 |
+
"binge_watch",
|
| 654 |
+
"admin_work",
|
| 655 |
+
"deep_work",
|
| 656 |
+
"sleep",
|
| 657 |
+
"meditate",
|
| 658 |
+
"sleep",
|
| 659 |
+
"family_time",
|
| 660 |
+
"exercise",
|
| 661 |
+
"deep_work",
|
| 662 |
+
"admin_work",
|
| 663 |
+
"meditate",
|
| 664 |
+
"socialize",
|
| 665 |
+
"binge_watch",
|
| 666 |
+
"exercise",
|
| 667 |
+
"meditate",
|
| 668 |
+
"learn",
|
| 669 |
+
"socialize",
|
| 670 |
+
"admin_work",
|
| 671 |
+
"sleep"
|
| 672 |
+
]
|
| 673 |
+
},
|
| 674 |
+
{
|
| 675 |
+
"profile": "workaholic_stoic",
|
| 676 |
+
"strategy": "random",
|
| 677 |
+
"seed": 1,
|
| 678 |
+
"final_score": 0.6094,
|
| 679 |
+
"total_reward": 5.39,
|
| 680 |
+
"vitality": 0.04,
|
| 681 |
+
"cognition": 0.55,
|
| 682 |
+
"progress": 0.6,
|
| 683 |
+
"serenity": 1.0,
|
| 684 |
+
"connection": 0.44,
|
| 685 |
+
"actions": [
|
| 686 |
+
"exercise",
|
| 687 |
+
"meditate",
|
| 688 |
+
"sleep",
|
| 689 |
+
"meditate",
|
| 690 |
+
"meditate",
|
| 691 |
+
"me_time",
|
| 692 |
+
"meditate",
|
| 693 |
+
"learn",
|
| 694 |
+
"meditate",
|
| 695 |
+
"meditate",
|
| 696 |
+
"socialize",
|
| 697 |
+
"socialize",
|
| 698 |
+
"deep_work",
|
| 699 |
+
"meditate",
|
| 700 |
+
"socialize",
|
| 701 |
+
"deep_work",
|
| 702 |
+
"meditate",
|
| 703 |
+
"sleep",
|
| 704 |
+
"learn",
|
| 705 |
+
"socialize",
|
| 706 |
+
"deep_work",
|
| 707 |
+
"socialize",
|
| 708 |
+
"learn",
|
| 709 |
+
"sleep",
|
| 710 |
+
"family_time",
|
| 711 |
+
"meditate",
|
| 712 |
+
"meditate",
|
| 713 |
+
"admin_work"
|
| 714 |
+
]
|
| 715 |
+
},
|
| 716 |
+
{
|
| 717 |
+
"profile": "workaholic_stoic",
|
| 718 |
+
"strategy": "random",
|
| 719 |
+
"seed": 2,
|
| 720 |
+
"final_score": 0.5782,
|
| 721 |
+
"total_reward": 7.33,
|
| 722 |
+
"vitality": 0.4,
|
| 723 |
+
"cognition": 0.23,
|
| 724 |
+
"progress": 0.88,
|
| 725 |
+
"serenity": 0.99,
|
| 726 |
+
"connection": 0.0,
|
| 727 |
+
"actions": [
|
| 728 |
+
"me_time",
|
| 729 |
+
"meditate",
|
| 730 |
+
"learn",
|
| 731 |
+
"meditate",
|
| 732 |
+
"learn",
|
| 733 |
+
"family_time",
|
| 734 |
+
"deep_work",
|
| 735 |
+
"family_time",
|
| 736 |
+
"me_time",
|
| 737 |
+
"admin_work",
|
| 738 |
+
"sleep",
|
| 739 |
+
"meditate",
|
| 740 |
+
"sleep",
|
| 741 |
+
"admin_work",
|
| 742 |
+
"meditate",
|
| 743 |
+
"me_time",
|
| 744 |
+
"sleep",
|
| 745 |
+
"sleep",
|
| 746 |
+
"binge_watch",
|
| 747 |
+
"admin_work",
|
| 748 |
+
"deep_work",
|
| 749 |
+
"admin_work",
|
| 750 |
+
"admin_work",
|
| 751 |
+
"binge_watch",
|
| 752 |
+
"learn",
|
| 753 |
+
"sleep",
|
| 754 |
+
"me_time",
|
| 755 |
+
"deep_work"
|
| 756 |
+
]
|
| 757 |
+
}
|
| 758 |
+
]
|
inference.py
CHANGED
|
@@ -5,7 +5,7 @@
|
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
"""
|
| 8 |
-
RhythmEnv Inference Script
|
| 9 |
===================================
|
| 10 |
MANDATORY
|
| 11 |
- Before submitting, ensure the following variables are defined in your environment configuration:
|
|
@@ -64,29 +64,40 @@ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
|
|
| 64 |
MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
|
| 65 |
BASE_URL = os.getenv("RHYTHM_ENV_URL", "https://InosLihka-rhythm-env.hf.space")
|
| 66 |
BENCHMARK = "rhythm_env"
|
| 67 |
-
|
| 68 |
-
|
|
|
|
|
|
|
| 69 |
SCORE_THRESHOLD = 0.1
|
| 70 |
|
|
|
|
|
|
|
|
|
|
| 71 |
SYSTEM_PROMPT = textwrap.dedent("""\
|
| 72 |
-
You are a
|
| 73 |
-
Each
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
-
Available actions (respond with EXACTLY
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
SWITCH_TASK <task_id>
|
| 79 |
-
TAKE_BREAK
|
| 80 |
|
| 81 |
-
|
| 82 |
-
-
|
| 83 |
-
-
|
| 84 |
-
-
|
| 85 |
-
-
|
| 86 |
-
-
|
| 87 |
-
-
|
| 88 |
|
| 89 |
-
Respond with ONLY the action
|
| 90 |
|
| 91 |
|
| 92 |
# ---------------------------------------------------------------------------
|
|
@@ -115,63 +126,78 @@ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> No
|
|
| 115 |
|
| 116 |
|
| 117 |
# ---------------------------------------------------------------------------
|
| 118 |
-
# Heuristic action selection
|
| 119 |
# ---------------------------------------------------------------------------
|
| 120 |
|
| 121 |
def choose_action_heuristic(obs) -> RhythmAction:
|
| 122 |
-
"""
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
if
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
return RhythmAction(action_type=ActionType.
|
| 153 |
-
|
| 154 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
|
| 156 |
|
| 157 |
def choose_action_llm(obs, llm_client: OpenAI) -> RhythmAction:
|
| 158 |
"""Use LLM to pick an action, fall back to heuristic on failure."""
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
for t in obs.tasks
|
| 164 |
-
)
|
| 165 |
user_prompt = textwrap.dedent(f"""\
|
| 166 |
-
Step: {obs.timestep}/{MAX_STEPS}
|
| 167 |
-
Energy: {obs.energy:.2f}
|
| 168 |
-
Stress: {obs.stress:.2f}
|
| 169 |
-
Current task: {obs.current_task_id}
|
| 170 |
-
Meetings at steps: {obs.meetings}
|
| 171 |
Remaining steps: {obs.remaining_steps}
|
| 172 |
|
| 173 |
-
|
| 174 |
-
{
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
|
| 176 |
Choose your action:""")
|
| 177 |
|
|
@@ -187,38 +213,27 @@ Choose your action:""")
|
|
| 187 |
stream=False,
|
| 188 |
)
|
| 189 |
text = (completion.choices[0].message.content or "").strip()
|
| 190 |
-
return parse_llm_action(text
|
| 191 |
except Exception:
|
| 192 |
return choose_action_heuristic(obs)
|
| 193 |
|
| 194 |
|
| 195 |
-
def parse_llm_action(text: str
|
| 196 |
"""Parse LLM response text into a RhythmAction."""
|
| 197 |
-
text = text.strip().upper()
|
| 198 |
-
|
| 199 |
-
if text.startswith("TAKE_BREAK"):
|
| 200 |
-
return RhythmAction(action_type=ActionType.TAKE_BREAK)
|
| 201 |
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
if text.startswith(prefix):
|
| 212 |
-
rest = text[len(prefix):].strip()
|
| 213 |
-
try:
|
| 214 |
-
task_id = int(rest)
|
| 215 |
-
if 0 <= task_id < len(obs.tasks):
|
| 216 |
-
return RhythmAction(action_type=action_type, task_id=task_id)
|
| 217 |
-
except ValueError:
|
| 218 |
-
pass
|
| 219 |
|
| 220 |
-
#
|
| 221 |
-
return
|
| 222 |
|
| 223 |
|
| 224 |
# ---------------------------------------------------------------------------
|
|
@@ -226,7 +241,9 @@ def parse_llm_action(text: str, obs) -> RhythmAction:
|
|
| 226 |
# ---------------------------------------------------------------------------
|
| 227 |
|
| 228 |
async def run_task(task_name: str, llm_client: OpenAI) -> float:
|
| 229 |
-
"""Run a single task and return the score."""
|
|
|
|
|
|
|
| 230 |
if IMAGE_NAME:
|
| 231 |
env = await RhythmEnv.from_docker_image(IMAGE_NAME)
|
| 232 |
else:
|
|
@@ -241,7 +258,7 @@ async def run_task(task_name: str, llm_client: OpenAI) -> float:
|
|
| 241 |
|
| 242 |
try:
|
| 243 |
async with env:
|
| 244 |
-
result = await env.reset(
|
| 245 |
|
| 246 |
for step in range(1, MAX_STEPS + 1):
|
| 247 |
if result.done:
|
|
@@ -254,8 +271,6 @@ async def run_task(task_name: str, llm_client: OpenAI) -> float:
|
|
| 254 |
action = choose_action_heuristic(result.observation)
|
| 255 |
|
| 256 |
action_str = action.action_type.value
|
| 257 |
-
if action.task_id is not None:
|
| 258 |
-
action_str += f"({action.task_id})"
|
| 259 |
|
| 260 |
result = await env.step(action)
|
| 261 |
|
|
|
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
"""
|
| 8 |
+
RhythmEnv Life Simulator β Inference Script
|
| 9 |
===================================
|
| 10 |
MANDATORY
|
| 11 |
- Before submitting, ensure the following variables are defined in your environment configuration:
|
|
|
|
| 64 |
MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
|
| 65 |
BASE_URL = os.getenv("RHYTHM_ENV_URL", "https://InosLihka-rhythm-env.hf.space")
|
| 66 |
BENCHMARK = "rhythm_env"
|
| 67 |
+
# Tasks map to seed values: seed 0 = introvert_morning, 1 = extrovert_night_owl, 2 = workaholic_stoic
|
| 68 |
+
TASKS = ["profile_0", "profile_1", "profile_2"]
|
| 69 |
+
TASK_SEEDS = {"profile_0": 0, "profile_1": 1, "profile_2": 2}
|
| 70 |
+
MAX_STEPS = 28
|
| 71 |
SCORE_THRESHOLD = 0.1
|
| 72 |
|
| 73 |
+
SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
|
| 74 |
+
DAY_NAMES = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
|
| 75 |
+
|
| 76 |
SYSTEM_PROMPT = textwrap.dedent("""\
|
| 77 |
+
You are a life management agent. You balance 5 life meters over a 7-day week.
|
| 78 |
+
Each day has 4 slots: Morning (0), Afternoon (1), Evening (2), Night (3).
|
| 79 |
+
Total: 28 steps.
|
| 80 |
+
|
| 81 |
+
Meters (0.0 to 1.0):
|
| 82 |
+
- Vitality: physical energy, sleep quality
|
| 83 |
+
- Cognition: mental clarity, focus
|
| 84 |
+
- Progress: career/skill advancement
|
| 85 |
+
- Serenity: inner peace, stress management
|
| 86 |
+
- Connection: relationship health
|
| 87 |
|
| 88 |
+
Available actions (respond with EXACTLY the action name):
|
| 89 |
+
DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE,
|
| 90 |
+
FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH
|
|
|
|
|
|
|
| 91 |
|
| 92 |
+
Key rules:
|
| 93 |
+
- Low Vitality reduces effectiveness of all positive actions.
|
| 94 |
+
- If ANY meter drops below 0.1, you get a heavy penalty.
|
| 95 |
+
- Connection decays passively if you don't socialize.
|
| 96 |
+
- Different episodes may have different hidden dynamics β adapt!
|
| 97 |
+
- Time of day matters: cognition is generally best in the morning.
|
| 98 |
+
- Balance all meters β neglecting any one will hurt your score.
|
| 99 |
|
| 100 |
+
Respond with ONLY the action name, nothing else.""")
|
| 101 |
|
| 102 |
|
| 103 |
# ---------------------------------------------------------------------------
|
|
|
|
| 126 |
|
| 127 |
|
| 128 |
# ---------------------------------------------------------------------------
|
| 129 |
+
# Heuristic action selection
|
| 130 |
# ---------------------------------------------------------------------------
|
| 131 |
|
| 132 |
def choose_action_heuristic(obs) -> RhythmAction:
|
| 133 |
+
"""Priority-based heuristic: critical recovery β time-appropriate β balance."""
|
| 134 |
+
slot = obs.slot
|
| 135 |
+
vitality = obs.vitality
|
| 136 |
+
cognition = obs.cognition
|
| 137 |
+
serenity = obs.serenity
|
| 138 |
+
connection = obs.connection
|
| 139 |
+
progress = obs.progress
|
| 140 |
+
|
| 141 |
+
# Critical recovery: prevent any meter from crashing
|
| 142 |
+
if vitality < 0.15:
|
| 143 |
+
return RhythmAction(action_type=ActionType.SLEEP)
|
| 144 |
+
if serenity < 0.15:
|
| 145 |
+
return RhythmAction(action_type=ActionType.MEDITATE)
|
| 146 |
+
if connection < 0.15:
|
| 147 |
+
return RhythmAction(action_type=ActionType.FAMILY_TIME)
|
| 148 |
+
|
| 149 |
+
# Night slot: prioritize sleep unless critical
|
| 150 |
+
if slot == 3:
|
| 151 |
+
if vitality < 0.5:
|
| 152 |
+
return RhythmAction(action_type=ActionType.SLEEP)
|
| 153 |
+
if connection < 0.3:
|
| 154 |
+
return RhythmAction(action_type=ActionType.FAMILY_TIME)
|
| 155 |
+
return RhythmAction(action_type=ActionType.SLEEP)
|
| 156 |
+
|
| 157 |
+
# Morning: productivity if able
|
| 158 |
+
if slot == 0:
|
| 159 |
+
if vitality > 0.4 and cognition > 0.3:
|
| 160 |
+
return RhythmAction(action_type=ActionType.DEEP_WORK)
|
| 161 |
+
if vitality < 0.4:
|
| 162 |
+
return RhythmAction(action_type=ActionType.EXERCISE)
|
| 163 |
+
return RhythmAction(action_type=ActionType.ADMIN_WORK)
|
| 164 |
+
|
| 165 |
+
# Afternoon: balanced mix
|
| 166 |
+
if slot == 1:
|
| 167 |
+
if connection < 0.3:
|
| 168 |
+
return RhythmAction(action_type=ActionType.FAMILY_TIME)
|
| 169 |
+
if progress < 0.3 and vitality > 0.3:
|
| 170 |
+
return RhythmAction(action_type=ActionType.LEARN)
|
| 171 |
+
if serenity < 0.4:
|
| 172 |
+
return RhythmAction(action_type=ActionType.MEDITATE)
|
| 173 |
+
return RhythmAction(action_type=ActionType.ADMIN_WORK)
|
| 174 |
+
|
| 175 |
+
# Evening: social and recovery
|
| 176 |
+
if connection < 0.4:
|
| 177 |
+
return RhythmAction(action_type=ActionType.SOCIALIZE)
|
| 178 |
+
if serenity < 0.5:
|
| 179 |
+
return RhythmAction(action_type=ActionType.ME_TIME)
|
| 180 |
+
if vitality < 0.4:
|
| 181 |
+
return RhythmAction(action_type=ActionType.EXERCISE)
|
| 182 |
+
return RhythmAction(action_type=ActionType.MEDITATE)
|
| 183 |
|
| 184 |
|
| 185 |
def choose_action_llm(obs, llm_client: OpenAI) -> RhythmAction:
|
| 186 |
"""Use LLM to pick an action, fall back to heuristic on failure."""
|
| 187 |
+
day_name = DAY_NAMES[obs.day] if obs.day < 7 else f"Day {obs.day}"
|
| 188 |
+
slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
|
| 189 |
+
event_str = f"\nActive event: {obs.active_event}" if obs.active_event else ""
|
| 190 |
+
|
|
|
|
|
|
|
| 191 |
user_prompt = textwrap.dedent(f"""\
|
| 192 |
+
Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})
|
|
|
|
|
|
|
|
|
|
|
|
|
| 193 |
Remaining steps: {obs.remaining_steps}
|
| 194 |
|
| 195 |
+
Meters:
|
| 196 |
+
Vitality: {obs.vitality:.2f}
|
| 197 |
+
Cognition: {obs.cognition:.2f}
|
| 198 |
+
Progress: {obs.progress:.2f}
|
| 199 |
+
Serenity: {obs.serenity:.2f}
|
| 200 |
+
Connection: {obs.connection:.2f}{event_str}
|
| 201 |
|
| 202 |
Choose your action:""")
|
| 203 |
|
|
|
|
| 213 |
stream=False,
|
| 214 |
)
|
| 215 |
text = (completion.choices[0].message.content or "").strip()
|
| 216 |
+
return parse_llm_action(text)
|
| 217 |
except Exception:
|
| 218 |
return choose_action_heuristic(obs)
|
| 219 |
|
| 220 |
|
| 221 |
+
def parse_llm_action(text: str) -> RhythmAction:
|
| 222 |
"""Parse LLM response text into a RhythmAction."""
|
| 223 |
+
text = text.strip().upper().replace(" ", "_")
|
|
|
|
|
|
|
|
|
|
| 224 |
|
| 225 |
+
# Try direct match
|
| 226 |
+
for action_type in ActionType:
|
| 227 |
+
if action_type.value.upper() == text:
|
| 228 |
+
return RhythmAction(action_type=action_type)
|
| 229 |
|
| 230 |
+
# Try partial match
|
| 231 |
+
for action_type in ActionType:
|
| 232 |
+
if action_type.value.upper() in text:
|
| 233 |
+
return RhythmAction(action_type=action_type)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 234 |
|
| 235 |
+
# Default fallback
|
| 236 |
+
return RhythmAction(action_type=ActionType.SLEEP)
|
| 237 |
|
| 238 |
|
| 239 |
# ---------------------------------------------------------------------------
|
|
|
|
| 241 |
# ---------------------------------------------------------------------------
|
| 242 |
|
| 243 |
async def run_task(task_name: str, llm_client: OpenAI) -> float:
|
| 244 |
+
"""Run a single task (profile) and return the score."""
|
| 245 |
+
seed = TASK_SEEDS.get(task_name, 0)
|
| 246 |
+
|
| 247 |
if IMAGE_NAME:
|
| 248 |
env = await RhythmEnv.from_docker_image(IMAGE_NAME)
|
| 249 |
else:
|
|
|
|
| 258 |
|
| 259 |
try:
|
| 260 |
async with env:
|
| 261 |
+
result = await env.reset(seed=seed)
|
| 262 |
|
| 263 |
for step in range(1, MAX_STEPS + 1):
|
| 264 |
if result.done:
|
|
|
|
| 271 |
action = choose_action_heuristic(result.observation)
|
| 272 |
|
| 273 |
action_str = action.action_type.value
|
|
|
|
|
|
|
| 274 |
|
| 275 |
result = await env.step(action)
|
| 276 |
|
models.py
CHANGED
|
@@ -5,105 +5,84 @@
|
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
"""
|
| 8 |
-
Data models for RhythmEnv
|
| 9 |
|
| 10 |
-
|
| 11 |
-
|
|
|
|
| 12 |
"""
|
| 13 |
|
| 14 |
from __future__ import annotations
|
| 15 |
|
| 16 |
from enum import Enum
|
| 17 |
-
from typing import Dict,
|
| 18 |
|
| 19 |
from openenv.core.env_server import Action, Observation, State
|
| 20 |
-
from pydantic import
|
| 21 |
|
| 22 |
|
| 23 |
class ActionType(str, Enum):
|
| 24 |
-
"""Available
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
|
| 32 |
class RhythmAction(Action):
|
| 33 |
-
"""
|
| 34 |
-
Action for RhythmEnv.
|
| 35 |
-
|
| 36 |
-
Attributes:
|
| 37 |
-
action_type: The type of action to perform.
|
| 38 |
-
task_id: Task index (required for START_TASK and SWITCH_TASK).
|
| 39 |
-
"""
|
| 40 |
|
| 41 |
action_type: ActionType
|
| 42 |
-
task_id: Optional[int] = None
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
class TaskInfo(BaseModel):
|
| 46 |
-
"""
|
| 47 |
-
Information about a single task visible to the agent.
|
| 48 |
-
|
| 49 |
-
Attributes:
|
| 50 |
-
id: Unique task identifier.
|
| 51 |
-
name: Human-readable task name.
|
| 52 |
-
description: Brief description of what the task involves.
|
| 53 |
-
effort: Total work required (0-1 scale).
|
| 54 |
-
progress: Work completed so far (0 to effort).
|
| 55 |
-
deadline: Timestep by which task should be done.
|
| 56 |
-
importance: How important this task is (0-1).
|
| 57 |
-
"""
|
| 58 |
-
|
| 59 |
-
id: int
|
| 60 |
-
name: str
|
| 61 |
-
description: str = ""
|
| 62 |
-
effort: float
|
| 63 |
-
progress: float
|
| 64 |
-
deadline: int
|
| 65 |
-
importance: float
|
| 66 |
|
| 67 |
|
| 68 |
class RhythmObservation(Observation):
|
| 69 |
"""
|
| 70 |
-
Observation
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
energy: Agent energy level (0-1).
|
| 75 |
-
stress: Agent stress level (0-1).
|
| 76 |
-
current_task_id: ID of task currently being worked on, or None.
|
| 77 |
-
tasks: List of all tasks with current progress.
|
| 78 |
-
meetings: Timesteps blocked by meetings.
|
| 79 |
-
remaining_steps: Steps left in the episode.
|
| 80 |
-
reward_breakdown: Component-wise reward details.
|
| 81 |
"""
|
| 82 |
|
| 83 |
timestep: int = 0
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
reward_breakdown: Dict[str, float] = Field(default_factory=dict)
|
| 91 |
|
| 92 |
|
| 93 |
class RhythmState(State):
|
| 94 |
"""
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
task_name: Name of the current scenario (easy/medium/hard).
|
| 99 |
-
timestep: Current 30-minute slot.
|
| 100 |
-
energy: Agent energy level.
|
| 101 |
-
stress: Agent stress level.
|
| 102 |
-
current_task_id: ID of task currently being worked on.
|
| 103 |
"""
|
| 104 |
|
| 105 |
-
task_name: str = ""
|
| 106 |
timestep: int = 0
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
"""
|
| 8 |
+
Data models for RhythmEnv Life Simulator.
|
| 9 |
|
| 10 |
+
A holistic life resource management RL environment where an agent balances
|
| 11 |
+
5 life meters (Vitality, Cognition, Progress, Serenity, Connection) across
|
| 12 |
+
a 7-day week with hidden personality profiles.
|
| 13 |
"""
|
| 14 |
|
| 15 |
from __future__ import annotations
|
| 16 |
|
| 17 |
from enum import Enum
|
| 18 |
+
from typing import Dict, Optional
|
| 19 |
|
| 20 |
from openenv.core.env_server import Action, Observation, State
|
| 21 |
+
from pydantic import Field
|
| 22 |
|
| 23 |
|
| 24 |
class ActionType(str, Enum):
|
| 25 |
+
"""Available actions for the life simulator agent."""
|
| 26 |
+
|
| 27 |
+
# Productivity
|
| 28 |
+
DEEP_WORK = "deep_work"
|
| 29 |
+
ADMIN_WORK = "admin_work"
|
| 30 |
+
LEARN = "learn"
|
| 31 |
+
# Recovery
|
| 32 |
+
SLEEP = "sleep"
|
| 33 |
+
EXERCISE = "exercise"
|
| 34 |
+
MEDITATE = "meditate"
|
| 35 |
+
# Social
|
| 36 |
+
FAMILY_TIME = "family_time"
|
| 37 |
+
SOCIALIZE = "socialize"
|
| 38 |
+
# Leisure
|
| 39 |
+
ME_TIME = "me_time"
|
| 40 |
+
BINGE_WATCH = "binge_watch"
|
| 41 |
|
| 42 |
|
| 43 |
class RhythmAction(Action):
|
| 44 |
+
"""Action for the Life Simulator. Agent chooses one of 10 activity types."""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
action_type: ActionType
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
|
| 49 |
class RhythmObservation(Observation):
|
| 50 |
"""
|
| 51 |
+
Observation returned to the agent each step.
|
| 52 |
+
|
| 53 |
+
The agent sees all 5 meters and temporal context, but NOT the hidden
|
| 54 |
+
personality profile or reward weight decomposition.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
"""
|
| 56 |
|
| 57 |
timestep: int = 0
|
| 58 |
+
day: int = 0
|
| 59 |
+
slot: int = 0
|
| 60 |
+
vitality: float = 0.8
|
| 61 |
+
cognition: float = 0.7
|
| 62 |
+
progress: float = 0.0
|
| 63 |
+
serenity: float = 0.7
|
| 64 |
+
connection: float = 0.5
|
| 65 |
+
active_event: Optional[str] = None
|
| 66 |
+
remaining_steps: int = 28
|
| 67 |
+
reward: float = 0.0
|
| 68 |
+
done: bool = False
|
| 69 |
reward_breakdown: Dict[str, float] = Field(default_factory=dict)
|
| 70 |
|
| 71 |
|
| 72 |
class RhythmState(State):
|
| 73 |
"""
|
| 74 |
+
Internal state for the Life Simulator.
|
| 75 |
+
|
| 76 |
+
Includes profile_name which is hidden from the observation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
"""
|
| 78 |
|
|
|
|
| 79 |
timestep: int = 0
|
| 80 |
+
day: int = 0
|
| 81 |
+
slot: int = 0
|
| 82 |
+
profile_name: str = ""
|
| 83 |
+
vitality: float = 0.8
|
| 84 |
+
cognition: float = 0.7
|
| 85 |
+
progress: float = 0.0
|
| 86 |
+
serenity: float = 0.7
|
| 87 |
+
connection: float = 0.5
|
| 88 |
+
active_event: Optional[str] = None
|
pyproject.toml
CHANGED
|
@@ -10,14 +10,15 @@ build-backend = "setuptools.build_meta"
|
|
| 10 |
|
| 11 |
[project]
|
| 12 |
name = "openenv-rhythm-env"
|
| 13 |
-
version = "0.
|
| 14 |
-
description = "RhythmEnv -
|
| 15 |
requires-python = ">=3.10"
|
| 16 |
dependencies = [
|
| 17 |
"openenv-core[core]>=0.2.2",
|
| 18 |
"fastapi>=0.115.0",
|
| 19 |
"pydantic>=2.0.0",
|
| 20 |
"uvicorn>=0.24.0",
|
|
|
|
| 21 |
]
|
| 22 |
|
| 23 |
[project.optional-dependencies]
|
|
|
|
| 10 |
|
| 11 |
[project]
|
| 12 |
name = "openenv-rhythm-env"
|
| 13 |
+
version = "0.2.0"
|
| 14 |
+
description = "RhythmEnv - Life Simulator RL Environment for OpenEnv"
|
| 15 |
requires-python = ">=3.10"
|
| 16 |
dependencies = [
|
| 17 |
"openenv-core[core]>=0.2.2",
|
| 18 |
"fastapi>=0.115.0",
|
| 19 |
"pydantic>=2.0.0",
|
| 20 |
"uvicorn>=0.24.0",
|
| 21 |
+
"gradio>=4.0.0",
|
| 22 |
]
|
| 23 |
|
| 24 |
[project.optional-dependencies]
|
server/requirements.txt
CHANGED
|
@@ -2,3 +2,4 @@ openenv-core[core]>=0.2.2
|
|
| 2 |
fastapi>=0.115.0
|
| 3 |
uvicorn>=0.24.0
|
| 4 |
pydantic>=2.0.0
|
|
|
|
|
|
| 2 |
fastapi>=0.115.0
|
| 3 |
uvicorn>=0.24.0
|
| 4 |
pydantic>=2.0.0
|
| 5 |
+
gradio>=4.0.0
|
server/rhythm_environment.py
CHANGED
|
@@ -5,215 +5,185 @@
|
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
"""
|
| 8 |
-
RhythmEnv Environment Implementation.
|
| 9 |
|
| 10 |
-
A
|
| 11 |
-
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
1 episode = 1
|
| 14 |
"""
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
| 17 |
from uuid import uuid4
|
| 18 |
|
| 19 |
from openenv.core.env_server import Environment
|
| 20 |
from openenv.core.env_server.types import EnvironmentMetadata
|
| 21 |
|
| 22 |
-
# Support both in-repo and standalone imports
|
| 23 |
try:
|
| 24 |
-
from ..models import
|
| 25 |
-
ActionType,
|
| 26 |
-
RhythmAction,
|
| 27 |
-
RhythmObservation,
|
| 28 |
-
RhythmState,
|
| 29 |
-
TaskInfo,
|
| 30 |
-
)
|
| 31 |
except (ImportError, ModuleNotFoundError):
|
| 32 |
-
from models import
|
| 33 |
-
ActionType,
|
| 34 |
-
RhythmAction,
|
| 35 |
-
RhythmObservation,
|
| 36 |
-
RhythmState,
|
| 37 |
-
TaskInfo,
|
| 38 |
-
)
|
| 39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
# ---------------------------------------------------------------------------
|
| 42 |
-
#
|
| 43 |
# ---------------------------------------------------------------------------
|
| 44 |
|
| 45 |
-
|
| 46 |
-
"
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
"progress": 0.0,
|
| 57 |
-
"deadline": 10,
|
| 58 |
-
"importance": 0.9,
|
| 59 |
-
},
|
| 60 |
-
{
|
| 61 |
-
"id": 1,
|
| 62 |
-
"name": "Client Emails",
|
| 63 |
-
"description": "Respond to 12 pending client inquiries about pricing updates and contract renewals.",
|
| 64 |
-
"effort": 0.45,
|
| 65 |
-
"progress": 0.0,
|
| 66 |
-
"deadline": 13,
|
| 67 |
-
"importance": 0.3,
|
| 68 |
-
},
|
| 69 |
-
{
|
| 70 |
-
"id": 2,
|
| 71 |
-
"name": "Expense Filing",
|
| 72 |
-
"description": "Submit last month's travel receipts and categorize team expenses in the accounting system.",
|
| 73 |
-
"effort": 0.35,
|
| 74 |
-
"progress": 0.0,
|
| 75 |
-
"deadline": 16,
|
| 76 |
-
"importance": 0.2,
|
| 77 |
-
},
|
| 78 |
-
],
|
| 79 |
-
"meetings": [3, 11],
|
| 80 |
-
"initial_energy": 0.75,
|
| 81 |
-
},
|
| 82 |
-
"medium": {
|
| 83 |
-
"scenario": "You are a product manager with a client pitch tomorrow. "
|
| 84 |
-
"The proposal and presentation deck are top priority, but you also need to "
|
| 85 |
-
"review a teammate's design doc and prepare meeting notes for leadership.",
|
| 86 |
-
"tasks": [
|
| 87 |
-
{
|
| 88 |
-
"id": 0,
|
| 89 |
-
"name": "Client Proposal",
|
| 90 |
-
"description": "Draft a 5-page proposal for the enterprise client including pricing tiers, timeline, and integration plan.",
|
| 91 |
-
"effort": 0.40,
|
| 92 |
-
"progress": 0.0,
|
| 93 |
-
"deadline": 8,
|
| 94 |
-
"importance": 0.7,
|
| 95 |
-
},
|
| 96 |
-
{
|
| 97 |
-
"id": 1,
|
| 98 |
-
"name": "Pitch Deck",
|
| 99 |
-
"description": "Create a 15-slide presentation with product demos, ROI projections, and competitive analysis.",
|
| 100 |
-
"effort": 0.35,
|
| 101 |
-
"progress": 0.0,
|
| 102 |
-
"deadline": 10,
|
| 103 |
-
"importance": 0.8,
|
| 104 |
-
},
|
| 105 |
-
{
|
| 106 |
-
"id": 2,
|
| 107 |
-
"name": "Design Review",
|
| 108 |
-
"description": "Review the UX team's redesign mockups for the dashboard. Provide written feedback on usability and alignment with product goals.",
|
| 109 |
-
"effort": 0.25,
|
| 110 |
-
"progress": 0.0,
|
| 111 |
-
"deadline": 14,
|
| 112 |
-
"importance": 0.5,
|
| 113 |
-
},
|
| 114 |
-
{
|
| 115 |
-
"id": 3,
|
| 116 |
-
"name": "Leadership Notes",
|
| 117 |
-
"description": "Summarize this week's sprint outcomes and blockers for the Monday leadership sync.",
|
| 118 |
-
"effort": 0.20,
|
| 119 |
-
"progress": 0.0,
|
| 120 |
-
"deadline": 18,
|
| 121 |
-
"importance": 0.4,
|
| 122 |
-
},
|
| 123 |
-
],
|
| 124 |
-
"meetings": [4, 12],
|
| 125 |
-
"initial_energy": 0.7,
|
| 126 |
-
},
|
| 127 |
-
"hard": {
|
| 128 |
-
"scenario": "You are a senior engineer on a critical release day. "
|
| 129 |
-
"The system architecture redesign is the highest priority, but two production "
|
| 130 |
-
"bugs are blocking users, docs need updating, and test coverage is behind.",
|
| 131 |
-
"tasks": [
|
| 132 |
-
{
|
| 133 |
-
"id": 0,
|
| 134 |
-
"name": "Architecture Redesign",
|
| 135 |
-
"description": "Refactor the authentication service from monolith to microservice pattern. Requires deep focus: redesign API contracts, update database schema, and write migration scripts.",
|
| 136 |
-
"effort": 0.80,
|
| 137 |
-
"progress": 0.0,
|
| 138 |
-
"deadline": 16,
|
| 139 |
-
"importance": 0.9,
|
| 140 |
-
},
|
| 141 |
-
{
|
| 142 |
-
"id": 1,
|
| 143 |
-
"name": "Fix: Login Timeout",
|
| 144 |
-
"description": "Users on slow connections get a 504 timeout during OAuth handshake. Root cause is likely the retry logic in the auth middleware.",
|
| 145 |
-
"effort": 0.15,
|
| 146 |
-
"progress": 0.0,
|
| 147 |
-
"deadline": 6,
|
| 148 |
-
"importance": 0.5,
|
| 149 |
-
},
|
| 150 |
-
{
|
| 151 |
-
"id": 2,
|
| 152 |
-
"name": "Fix: CSV Export",
|
| 153 |
-
"description": "The data export endpoint crashes on records with Unicode characters in the notes field. Need to fix encoding in the serializer.",
|
| 154 |
-
"effort": 0.15,
|
| 155 |
-
"progress": 0.0,
|
| 156 |
-
"deadline": 10,
|
| 157 |
-
"importance": 0.4,
|
| 158 |
-
},
|
| 159 |
-
{
|
| 160 |
-
"id": 3,
|
| 161 |
-
"name": "API Documentation",
|
| 162 |
-
"description": "Update the REST API docs to reflect the new v3 endpoints. Add request/response examples and deprecation notices for v2.",
|
| 163 |
-
"effort": 0.20,
|
| 164 |
-
"progress": 0.0,
|
| 165 |
-
"deadline": 14,
|
| 166 |
-
"importance": 0.3,
|
| 167 |
-
},
|
| 168 |
-
{
|
| 169 |
-
"id": 4,
|
| 170 |
-
"name": "Integration Tests",
|
| 171 |
-
"description": "Write end-to-end tests for the payment flow covering Stripe webhook handling, refund processing, and receipt generation.",
|
| 172 |
-
"effort": 0.20,
|
| 173 |
-
"progress": 0.0,
|
| 174 |
-
"deadline": 18,
|
| 175 |
-
"importance": 0.6,
|
| 176 |
-
},
|
| 177 |
-
],
|
| 178 |
-
"meetings": [6],
|
| 179 |
-
"initial_energy": 0.4,
|
| 180 |
-
},
|
| 181 |
}
|
| 182 |
|
| 183 |
# ---------------------------------------------------------------------------
|
| 184 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
# ---------------------------------------------------------------------------
|
| 186 |
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
|
| 210 |
|
| 211 |
class RhythmEnvironment(Environment):
|
| 212 |
"""
|
| 213 |
-
|
| 214 |
|
| 215 |
-
The agent manages
|
| 216 |
-
|
|
|
|
|
|
|
| 217 |
"""
|
| 218 |
|
| 219 |
SUPPORTS_CONCURRENT_SESSIONS: bool = True
|
|
@@ -221,30 +191,29 @@ class RhythmEnvironment(Environment):
|
|
| 221 |
def __init__(self) -> None:
|
| 222 |
super().__init__()
|
| 223 |
self._state = RhythmState()
|
| 224 |
-
|
| 225 |
-
self.
|
| 226 |
-
|
| 227 |
-
self.
|
| 228 |
-
self.
|
| 229 |
-
self.
|
| 230 |
-
self.
|
| 231 |
-
self.
|
| 232 |
-
|
| 233 |
-
self._missed_deadlines: Set[int] = set()
|
| 234 |
-
self._total_energy: float = 0.0
|
| 235 |
-
self._total_stress: float = 0.0
|
| 236 |
-
self._steps_working: int = 0
|
| 237 |
-
self._switch_count: int = 0
|
| 238 |
self._timestep: int = 0
|
|
|
|
|
|
|
|
|
|
| 239 |
|
| 240 |
def get_metadata(self) -> EnvironmentMetadata:
|
| 241 |
return EnvironmentMetadata(
|
| 242 |
name="RhythmEnv",
|
| 243 |
description=(
|
| 244 |
-
"
|
| 245 |
-
"
|
|
|
|
| 246 |
),
|
| 247 |
-
version="0.
|
| 248 |
)
|
| 249 |
|
| 250 |
# ------------------------------------------------------------------
|
|
@@ -257,45 +226,51 @@ class RhythmEnvironment(Environment):
|
|
| 257 |
episode_id: Optional[str] = None,
|
| 258 |
**kwargs: Any,
|
| 259 |
) -> RhythmObservation:
|
| 260 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 261 |
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
else:
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
#
|
| 271 |
-
|
| 272 |
-
self.
|
| 273 |
-
self.
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
self.
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
self._consecutive_breaks = 0
|
| 280 |
-
self._completed_tasks = set()
|
| 281 |
-
self._missed_deadlines = set()
|
| 282 |
-
self._total_energy = 0.0
|
| 283 |
-
self._total_stress = 0.0
|
| 284 |
-
self._steps_working = 0
|
| 285 |
-
self._switch_count = 0
|
| 286 |
self._timestep = 0
|
|
|
|
|
|
|
|
|
|
| 287 |
|
| 288 |
self._state = RhythmState(
|
| 289 |
episode_id=episode_id or str(uuid4()),
|
| 290 |
step_count=0,
|
| 291 |
-
|
| 292 |
timestep=0,
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 296 |
)
|
| 297 |
|
| 298 |
-
return self._make_observation(reward=0.0, done=False,
|
| 299 |
|
| 300 |
# ------------------------------------------------------------------
|
| 301 |
# step
|
|
@@ -307,181 +282,101 @@ class RhythmEnvironment(Environment):
|
|
| 307 |
timeout_s: Optional[float] = None,
|
| 308 |
**kwargs: Any,
|
| 309 |
) -> RhythmObservation:
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
|
| 321 |
-
|
| 322 |
-
|
| 323 |
-
|
| 324 |
-
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
|
| 333 |
-
|
| 334 |
-
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
|
| 338 |
-
|
| 339 |
-
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
|
| 343 |
-
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
|
| 352 |
-
|
| 353 |
-
|
| 354 |
-
|
| 355 |
-
|
| 356 |
-
|
| 357 |
-
|
| 358 |
-
|
| 359 |
-
|
| 360 |
-
|
| 361 |
-
|
| 362 |
-
|
| 363 |
-
|
| 364 |
-
|
| 365 |
-
|
| 366 |
-
|
| 367 |
-
|
| 368 |
-
if task["progress"] >= task["effort"] and self._current_task_id not in self._completed_tasks:
|
| 369 |
-
self._completed_tasks.add(self._current_task_id)
|
| 370 |
-
completed_this_step.append(self._current_task_id)
|
| 371 |
-
# Auto-clear: agent becomes idle after finishing a task,
|
| 372 |
-
# so they can START_TASK a new one without needing to break/switch.
|
| 373 |
-
self._current_task_id = None
|
| 374 |
-
|
| 375 |
-
self._energy = max(0.0, self._energy - ENERGY_WORK_DRAIN)
|
| 376 |
-
self._steps_working += 1
|
| 377 |
-
|
| 378 |
-
# --- Check deadlines ---
|
| 379 |
-
new_missed: List[int] = []
|
| 380 |
-
for t in self._tasks:
|
| 381 |
-
tid = t["id"]
|
| 382 |
-
if tid not in self._completed_tasks and tid not in self._missed_deadlines:
|
| 383 |
-
if self._timestep > t["deadline"]:
|
| 384 |
-
self._missed_deadlines.add(tid)
|
| 385 |
-
new_missed.append(tid)
|
| 386 |
-
self._stress = min(1.0, self._stress + STRESS_DEADLINE_MISS)
|
| 387 |
-
|
| 388 |
-
# --- Stress from approaching deadlines ---
|
| 389 |
-
for t in self._tasks:
|
| 390 |
-
tid = t["id"]
|
| 391 |
-
if tid not in self._completed_tasks and tid not in self._missed_deadlines:
|
| 392 |
-
if 0 < t["deadline"] - self._timestep <= APPROACHING_DEADLINE_WINDOW:
|
| 393 |
-
self._stress = min(1.0, self._stress + STRESS_APPROACHING)
|
| 394 |
-
|
| 395 |
-
# --- Stress relief from completion ---
|
| 396 |
-
for _ in completed_this_step:
|
| 397 |
-
self._stress = max(0.0, self._stress - STRESS_COMPLETION_RELIEF)
|
| 398 |
-
|
| 399 |
-
# --- Advance timestep ---
|
| 400 |
self._timestep += 1
|
| 401 |
-
self.
|
| 402 |
-
|
| 403 |
-
|
| 404 |
-
|
| 405 |
-
|
| 406 |
-
|
| 407 |
-
|
| 408 |
-
|
| 409 |
-
|
| 410 |
-
|
| 411 |
-
|
| 412 |
-
if
|
| 413 |
-
|
| 414 |
-
|
| 415 |
-
|
| 416 |
-
reward_breakdown["progress_reward"] = round(r, 4)
|
| 417 |
-
|
| 418 |
-
# Completion bonus
|
| 419 |
-
for tid in completed_this_step:
|
| 420 |
-
bonus = self._tasks[tid]["importance"] * COMPLETION_BONUS_SCALE
|
| 421 |
-
reward += bonus
|
| 422 |
-
reward_breakdown["completion_bonus"] = round(
|
| 423 |
-
reward_breakdown.get("completion_bonus", 0.0) + bonus, 4
|
| 424 |
-
)
|
| 425 |
-
|
| 426 |
-
# Stress penalty
|
| 427 |
-
stress_pen = -self._stress * STRESS_PENALTY_RATE
|
| 428 |
-
reward += stress_pen
|
| 429 |
-
reward_breakdown["stress_penalty"] = round(stress_pen, 4)
|
| 430 |
-
|
| 431 |
-
# Deadline miss penalty
|
| 432 |
-
if new_missed:
|
| 433 |
-
dp = -DEADLINE_MISS_PENALTY * len(new_missed)
|
| 434 |
-
reward += dp
|
| 435 |
-
reward_breakdown["deadline_penalty"] = round(dp, 4)
|
| 436 |
-
|
| 437 |
-
# Switch penalty
|
| 438 |
-
if switched:
|
| 439 |
-
reward -= SWITCH_PENALTY
|
| 440 |
-
reward_breakdown["switch_penalty"] = round(-SWITCH_PENALTY, 4)
|
| 441 |
-
|
| 442 |
-
# Idle penalty
|
| 443 |
-
if not is_meeting and is_idle:
|
| 444 |
-
reward -= IDLE_PENALTY
|
| 445 |
-
reward_breakdown["idle_penalty"] = round(-IDLE_PENALTY, 4)
|
| 446 |
-
|
| 447 |
-
# Break spam penalty
|
| 448 |
-
if not is_meeting and action.action_type == ActionType.TAKE_BREAK:
|
| 449 |
-
spam = -BREAK_SPAM_PENALTY * max(0, self._consecutive_breaks - MAX_FREE_BREAKS)
|
| 450 |
-
if spam < 0:
|
| 451 |
-
reward += spam
|
| 452 |
-
reward_breakdown["break_spam_penalty"] = round(spam, 4)
|
| 453 |
-
|
| 454 |
-
# Mode bonus
|
| 455 |
-
mode = self._compute_mode()
|
| 456 |
-
mode_bonus = 0.0
|
| 457 |
-
if mode == "deep_work":
|
| 458 |
-
mode_bonus = DEEP_WORK_BONUS
|
| 459 |
-
elif mode == "execution":
|
| 460 |
-
mode_bonus = EXECUTION_BONUS
|
| 461 |
-
if mode_bonus > 0:
|
| 462 |
-
reward += mode_bonus
|
| 463 |
-
reward_breakdown["mode_bonus"] = round(mode_bonus, 4)
|
| 464 |
-
|
| 465 |
-
# Clamp reward (wide enough to preserve completion signal)
|
| 466 |
-
reward = max(-2.0, min(2.0, round(reward, 4)))
|
| 467 |
-
|
| 468 |
-
# --- Done? ---
|
| 469 |
-
all_tasks_completed = len(self._completed_tasks) == len(self._tasks)
|
| 470 |
-
done = self._timestep >= MAX_STEPS or all_tasks_completed
|
| 471 |
-
|
| 472 |
-
# --- Final grading ---
|
| 473 |
if done:
|
| 474 |
final_score = self._grade_episode()
|
| 475 |
reward_breakdown["final_score"] = round(final_score, 4)
|
| 476 |
|
| 477 |
-
# --- Update state ---
|
|
|
|
| 478 |
self._state.timestep = self._timestep
|
| 479 |
-
self._state.
|
| 480 |
-
self._state.
|
| 481 |
-
self._state.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 482 |
|
| 483 |
return self._make_observation(
|
| 484 |
-
reward=reward,
|
|
|
|
|
|
|
|
|
|
| 485 |
)
|
| 486 |
|
| 487 |
# ------------------------------------------------------------------
|
|
@@ -493,155 +388,166 @@ class RhythmEnvironment(Environment):
|
|
| 493 |
return self._state
|
| 494 |
|
| 495 |
# ------------------------------------------------------------------
|
| 496 |
-
#
|
| 497 |
# ------------------------------------------------------------------
|
| 498 |
|
| 499 |
-
def
|
| 500 |
-
"""
|
| 501 |
-
if
|
| 502 |
-
|
| 503 |
-
|
| 504 |
-
|
| 505 |
-
|
| 506 |
-
|
| 507 |
-
|
| 508 |
-
|
| 509 |
-
|
| 510 |
-
|
| 511 |
-
|
| 512 |
-
if
|
| 513 |
-
|
| 514 |
-
|
| 515 |
-
|
| 516 |
-
|
| 517 |
-
|
| 518 |
-
|
| 519 |
-
|
| 520 |
-
|
| 521 |
-
|
| 522 |
-
|
| 523 |
-
|
| 524 |
-
|
| 525 |
-
|
| 526 |
-
|
| 527 |
-
|
| 528 |
-
|
| 529 |
-
|
| 530 |
-
|
| 531 |
-
|
| 532 |
-
|
| 533 |
-
|
| 534 |
-
|
| 535 |
-
|
| 536 |
-
|
| 537 |
-
|
| 538 |
-
|
| 539 |
-
|
| 540 |
-
|
| 541 |
-
|
| 542 |
-
|
| 543 |
-
|
| 544 |
-
|
| 545 |
-
|
| 546 |
-
"
|
| 547 |
-
|
| 548 |
-
"progress"
|
| 549 |
-
|
| 550 |
-
|
| 551 |
-
|
| 552 |
-
|
| 553 |
-
|
| 554 |
-
|
| 555 |
-
|
| 556 |
-
|
| 557 |
-
|
| 558 |
-
|
| 559 |
-
|
| 560 |
-
|
| 561 |
-
|
| 562 |
-
|
| 563 |
-
|
| 564 |
-
|
| 565 |
-
|
| 566 |
-
|
| 567 |
-
|
| 568 |
-
|
| 569 |
-
"
|
| 570 |
-
|
| 571 |
-
|
| 572 |
-
|
| 573 |
-
|
| 574 |
-
|
| 575 |
-
|
| 576 |
-
|
| 577 |
-
|
| 578 |
-
|
| 579 |
-
|
| 580 |
-
|
| 581 |
-
|
| 582 |
-
)
|
| 583 |
-
|
| 584 |
-
|
| 585 |
-
|
| 586 |
-
|
| 587 |
-
|
| 588 |
-
|
| 589 |
-
|
| 590 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 591 |
|
| 592 |
def _grade_episode(self) -> float:
|
| 593 |
"""Compute final episode score in [0, 1]."""
|
| 594 |
-
|
| 595 |
-
total_importance = sum(t["importance"] for t in self._tasks)
|
| 596 |
-
completed_importance = sum(
|
| 597 |
-
t["importance"]
|
| 598 |
-
for t in self._tasks
|
| 599 |
-
if t["id"] in self._completed_tasks
|
| 600 |
-
)
|
| 601 |
-
completion_score = (
|
| 602 |
-
completed_importance / total_importance if total_importance > 0 else 0.0
|
| 603 |
-
)
|
| 604 |
|
| 605 |
-
#
|
| 606 |
-
|
| 607 |
-
|
| 608 |
-
|
| 609 |
-
|
| 610 |
-
)
|
| 611 |
-
deadline_score = met_importance / total_importance if total_importance > 0 else 0.0
|
| 612 |
-
|
| 613 |
-
# 3. Efficiency score (using realistic optimal based on average energy)
|
| 614 |
-
steps_elapsed = max(self._timestep, 1)
|
| 615 |
-
total_effort = sum(
|
| 616 |
-
t["effort"]
|
| 617 |
-
for t in self._tasks
|
| 618 |
-
if t["id"] in self._completed_tasks
|
| 619 |
-
)
|
| 620 |
-
if total_effort > 0 and self._steps_working > 0:
|
| 621 |
-
avg_energy = self._total_energy / steps_elapsed
|
| 622 |
-
effective_rate = PROGRESS_RATE * max(avg_energy, 0.3)
|
| 623 |
-
optimal_steps = total_effort / effective_rate
|
| 624 |
-
efficiency_score = min(1.0, optimal_steps / self._steps_working)
|
| 625 |
-
else:
|
| 626 |
-
# No tasks completed = zero efficiency
|
| 627 |
-
efficiency_score = 0.0
|
| 628 |
|
| 629 |
-
#
|
| 630 |
-
|
|
|
|
| 631 |
|
| 632 |
-
#
|
| 633 |
-
|
| 634 |
|
| 635 |
-
#
|
| 636 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 637 |
|
| 638 |
score = (
|
| 639 |
-
0.
|
| 640 |
-
+ 0.
|
| 641 |
-
+ 0.
|
| 642 |
-
+ 0.
|
| 643 |
-
+ 0.
|
| 644 |
-
- switch_penalty
|
| 645 |
)
|
| 646 |
return max(0.0, min(1.0, score))
|
| 647 |
|
|
@@ -649,30 +555,22 @@ class RhythmEnvironment(Environment):
|
|
| 649 |
self,
|
| 650 |
reward: float,
|
| 651 |
done: bool,
|
| 652 |
-
|
|
|
|
| 653 |
) -> RhythmObservation:
|
| 654 |
-
"""Build the observation returned to the agent."""
|
| 655 |
-
task_infos = [
|
| 656 |
-
TaskInfo(
|
| 657 |
-
id=t["id"],
|
| 658 |
-
name=t["name"],
|
| 659 |
-
description=t.get("description", ""),
|
| 660 |
-
effort=round(t["effort"], 4),
|
| 661 |
-
progress=round(t["progress"], 4),
|
| 662 |
-
deadline=t["deadline"],
|
| 663 |
-
importance=t["importance"],
|
| 664 |
-
)
|
| 665 |
-
for t in self._tasks
|
| 666 |
-
]
|
| 667 |
return RhythmObservation(
|
| 668 |
timestep=self._timestep,
|
| 669 |
-
|
| 670 |
-
|
| 671 |
-
|
| 672 |
-
|
| 673 |
-
|
|
|
|
|
|
|
|
|
|
| 674 |
remaining_steps=MAX_STEPS - self._timestep,
|
| 675 |
-
reward_breakdown=reward_breakdown,
|
| 676 |
reward=reward,
|
| 677 |
done=done,
|
| 678 |
)
|
|
|
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
"""
|
| 8 |
+
RhythmEnv Life Simulator β Environment Implementation.
|
| 9 |
|
| 10 |
+
A holistic life resource management RL environment. The agent balances
|
| 11 |
+
5 life meters across a 7-day week (28 steps) while hidden personality
|
| 12 |
+
profiles secretly control how actions affect meters and how reward is
|
| 13 |
+
computed. The agent must discover these hidden dynamics through experience.
|
| 14 |
|
| 15 |
+
1 episode = 1 week, 1 step = 1 time slot (4 per day), 28 steps total.
|
| 16 |
"""
|
| 17 |
|
| 18 |
+
import random
|
| 19 |
+
from copy import deepcopy
|
| 20 |
+
from typing import Any, Dict, List, Optional
|
| 21 |
from uuid import uuid4
|
| 22 |
|
| 23 |
from openenv.core.env_server import Environment
|
| 24 |
from openenv.core.env_server.types import EnvironmentMetadata
|
| 25 |
|
|
|
|
| 26 |
try:
|
| 27 |
+
from ..models import ActionType, RhythmAction, RhythmObservation, RhythmState
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
except (ImportError, ModuleNotFoundError):
|
| 29 |
+
from models import ActionType, RhythmAction, RhythmObservation, RhythmState
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
+
# ---------------------------------------------------------------------------
|
| 32 |
+
# Constants
|
| 33 |
+
# ---------------------------------------------------------------------------
|
| 34 |
+
|
| 35 |
+
MAX_STEPS = 28
|
| 36 |
+
DAYS = 7
|
| 37 |
+
SLOTS_PER_DAY = 4
|
| 38 |
+
SLOT_NAMES = ["morning", "afternoon", "evening", "night"]
|
| 39 |
+
METERS = ["vitality", "cognition", "progress", "serenity", "connection"]
|
| 40 |
+
|
| 41 |
+
EVENT_PROBABILITY = 0.08
|
| 42 |
+
CRITICAL_THRESHOLD = 0.1
|
| 43 |
+
CRITICAL_PENALTY = -0.3
|
| 44 |
+
REWARD_SCALE = 15.0
|
| 45 |
|
| 46 |
# ---------------------------------------------------------------------------
|
| 47 |
+
# Action-Effect Matrix (base deltas per action on each meter)
|
| 48 |
# ---------------------------------------------------------------------------
|
| 49 |
|
| 50 |
+
ACTION_EFFECTS: Dict[str, Dict[str, float]] = {
|
| 51 |
+
"deep_work": {"vitality": -0.12, "cognition": -0.10, "progress": 0.18, "serenity": -0.05, "connection": 0.00},
|
| 52 |
+
"admin_work": {"vitality": -0.06, "cognition": -0.05, "progress": 0.08, "serenity": -0.03, "connection": 0.00},
|
| 53 |
+
"learn": {"vitality": -0.08, "cognition": -0.08, "progress": 0.12, "serenity": 0.02, "connection": 0.00},
|
| 54 |
+
"sleep": {"vitality": 0.20, "cognition": 0.10, "progress": 0.00, "serenity": 0.05, "connection": 0.00},
|
| 55 |
+
"exercise": {"vitality": 0.12, "cognition": 0.05, "progress": 0.00, "serenity": 0.08, "connection": 0.00},
|
| 56 |
+
"meditate": {"vitality": 0.03, "cognition": 0.08, "progress": 0.00, "serenity": 0.15, "connection": 0.00},
|
| 57 |
+
"family_time": {"vitality": -0.04, "cognition": -0.02, "progress": 0.00, "serenity": 0.06, "connection": 0.15},
|
| 58 |
+
"socialize": {"vitality": -0.06, "cognition": -0.03, "progress": 0.00, "serenity": 0.04, "connection": 0.12},
|
| 59 |
+
"me_time": {"vitality": 0.05, "cognition": 0.03, "progress": 0.00, "serenity": 0.10, "connection": -0.02},
|
| 60 |
+
"binge_watch": {"vitality": 0.02, "cognition": -0.05, "progress": -0.02, "serenity": 0.06, "connection": -0.03},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
}
|
| 62 |
|
| 63 |
# ---------------------------------------------------------------------------
|
| 64 |
+
# Time-of-Day Multipliers
|
| 65 |
+
# ---------------------------------------------------------------------------
|
| 66 |
+
|
| 67 |
+
TIME_MULTIPLIERS: Dict[int, Dict[str, float]] = {
|
| 68 |
+
0: {"cognition_gain": 1.2, "vitality_drain": 0.8}, # Morning
|
| 69 |
+
1: {"cognition_gain": 1.0, "vitality_drain": 1.0}, # Afternoon
|
| 70 |
+
2: {"cognition_gain": 0.8, "vitality_drain": 1.1}, # Evening
|
| 71 |
+
3: {"cognition_gain": 0.6, "vitality_drain": 1.3}, # Night
|
| 72 |
+
}
|
| 73 |
+
|
| 74 |
+
# ---------------------------------------------------------------------------
|
| 75 |
+
# Random Events
|
| 76 |
# ---------------------------------------------------------------------------
|
| 77 |
|
| 78 |
+
EVENT_EFFECTS: Dict[str, Dict[str, float]] = {
|
| 79 |
+
"prod_crash": {"vitality": -0.08, "cognition": -0.10, "progress": -0.10, "serenity": -0.15, "connection": 0.00},
|
| 80 |
+
"family_emergency": {"vitality": -0.05, "cognition": -0.08, "progress": 0.00, "serenity": -0.12, "connection": -0.10},
|
| 81 |
+
"illness": {"vitality": -0.20, "cognition": -0.10, "progress": 0.00, "serenity": -0.05, "connection": 0.00},
|
| 82 |
+
"good_news": {"vitality": 0.05, "cognition": 0.03, "progress": 0.00, "serenity": 0.10, "connection": 0.05},
|
| 83 |
+
}
|
| 84 |
+
|
| 85 |
+
EVENT_NAMES: List[str] = list(EVENT_EFFECTS.keys())
|
| 86 |
+
|
| 87 |
+
# ---------------------------------------------------------------------------
|
| 88 |
+
# Hidden Personality Profiles
|
| 89 |
+
# ---------------------------------------------------------------------------
|
| 90 |
+
|
| 91 |
+
PROFILES: List[Dict[str, Any]] = [
|
| 92 |
+
{
|
| 93 |
+
"name": "introvert_morning",
|
| 94 |
+
"social_vitality_multiplier": 3.0,
|
| 95 |
+
"morning_cognition_bonus": 2.0,
|
| 96 |
+
"evening_night_cognition_bonus": None,
|
| 97 |
+
"morning_penalty": None,
|
| 98 |
+
"binge_shame": True,
|
| 99 |
+
"progress_serenity_bonus": 0.0,
|
| 100 |
+
"idle_serenity_decay": 0.0,
|
| 101 |
+
"vitality_decay_rate": 0.0,
|
| 102 |
+
"stress_tolerance": 0.3,
|
| 103 |
+
"event_impact_multiplier": 1.0,
|
| 104 |
+
"connection_decay_rate": 0.01,
|
| 105 |
+
"solo_serenity_bonus": 0.10,
|
| 106 |
+
"social_connection_multiplier": 1.0,
|
| 107 |
+
"social_serenity_bonus": 0.0,
|
| 108 |
+
"work_vitality_recovery": 0.0,
|
| 109 |
+
"initial_meters": {
|
| 110 |
+
"vitality": 0.7, "cognition": 0.7, "progress": 0.0,
|
| 111 |
+
"serenity": 0.7, "connection": 0.5,
|
| 112 |
+
},
|
| 113 |
+
"reward_weights": {
|
| 114 |
+
"vitality": 0.05, "cognition": 0.05, "progress": 0.20,
|
| 115 |
+
"serenity": 0.60, "connection": 0.10,
|
| 116 |
+
},
|
| 117 |
+
},
|
| 118 |
+
{
|
| 119 |
+
"name": "extrovert_night_owl",
|
| 120 |
+
"social_vitality_multiplier": 0.2,
|
| 121 |
+
"morning_cognition_bonus": None,
|
| 122 |
+
"evening_night_cognition_bonus": 1.8,
|
| 123 |
+
"morning_penalty": 0.4,
|
| 124 |
+
"binge_shame": False,
|
| 125 |
+
"progress_serenity_bonus": 0.0,
|
| 126 |
+
"idle_serenity_decay": 0.0,
|
| 127 |
+
"vitality_decay_rate": 0.0,
|
| 128 |
+
"stress_tolerance": 0.2,
|
| 129 |
+
"event_impact_multiplier": 0.8,
|
| 130 |
+
"connection_decay_rate": 0.01,
|
| 131 |
+
"solo_serenity_bonus": 0.0,
|
| 132 |
+
"social_connection_multiplier": 2.0,
|
| 133 |
+
"social_serenity_bonus": 0.06,
|
| 134 |
+
"work_vitality_recovery": 0.0,
|
| 135 |
+
"initial_meters": {
|
| 136 |
+
"vitality": 0.7, "cognition": 0.7, "progress": 0.0,
|
| 137 |
+
"serenity": 0.7, "connection": 0.5,
|
| 138 |
+
},
|
| 139 |
+
"reward_weights": {
|
| 140 |
+
"vitality": 0.05, "cognition": 0.05, "progress": 0.10,
|
| 141 |
+
"serenity": 0.05, "connection": 0.75,
|
| 142 |
+
},
|
| 143 |
+
},
|
| 144 |
+
{
|
| 145 |
+
"name": "workaholic_stoic",
|
| 146 |
+
"social_vitality_multiplier": 1.0,
|
| 147 |
+
"morning_cognition_bonus": None,
|
| 148 |
+
"evening_night_cognition_bonus": None,
|
| 149 |
+
"morning_penalty": None,
|
| 150 |
+
"binge_shame": False,
|
| 151 |
+
"progress_serenity_bonus": 0.10,
|
| 152 |
+
"idle_serenity_decay": 0.10,
|
| 153 |
+
"vitality_decay_rate": 0.04,
|
| 154 |
+
"stress_tolerance": 0.15,
|
| 155 |
+
"event_impact_multiplier": 0.5,
|
| 156 |
+
"connection_decay_rate": 0.02,
|
| 157 |
+
"solo_serenity_bonus": 0.0,
|
| 158 |
+
"social_connection_multiplier": 1.0,
|
| 159 |
+
"social_serenity_bonus": 0.0,
|
| 160 |
+
"work_vitality_recovery": 0.06,
|
| 161 |
+
"initial_meters": {
|
| 162 |
+
"vitality": 0.7, "cognition": 0.7, "progress": 0.0,
|
| 163 |
+
"serenity": 0.7, "connection": 0.5,
|
| 164 |
+
},
|
| 165 |
+
"reward_weights": {
|
| 166 |
+
"vitality": 0.05, "cognition": 0.05, "progress": 0.70,
|
| 167 |
+
"serenity": 0.10, "connection": 0.10,
|
| 168 |
+
},
|
| 169 |
+
},
|
| 170 |
+
]
|
| 171 |
+
|
| 172 |
+
PROFILE_MAP: Dict[str, Dict[str, Any]] = {p["name"]: p for p in PROFILES}
|
| 173 |
+
|
| 174 |
+
# Social actions for modifier checks
|
| 175 |
+
SOCIAL_ACTIONS = {"family_time", "socialize"}
|
| 176 |
+
IDLE_ACTIONS = {"me_time", "binge_watch", "sleep"}
|
| 177 |
|
| 178 |
|
| 179 |
class RhythmEnvironment(Environment):
|
| 180 |
"""
|
| 181 |
+
Life Simulator RL Environment.
|
| 182 |
|
| 183 |
+
The agent manages 5 life meters (Vitality, Cognition, Progress, Serenity,
|
| 184 |
+
Connection) across a 7-day week. Hidden personality profiles secretly
|
| 185 |
+
control how actions affect meters and how reward is computed. The agent
|
| 186 |
+
must discover these hidden dynamics through experience.
|
| 187 |
"""
|
| 188 |
|
| 189 |
SUPPORTS_CONCURRENT_SESSIONS: bool = True
|
|
|
|
| 191 |
def __init__(self) -> None:
|
| 192 |
super().__init__()
|
| 193 |
self._state = RhythmState()
|
| 194 |
+
self._rng = random.Random(0)
|
| 195 |
+
self._profile: Dict[str, Any] = PROFILES[0]
|
| 196 |
+
# Meters
|
| 197 |
+
self._vitality: float = 0.8
|
| 198 |
+
self._cognition: float = 0.7
|
| 199 |
+
self._progress: float = 0.0
|
| 200 |
+
self._serenity: float = 0.7
|
| 201 |
+
self._connection: float = 0.5
|
| 202 |
+
# Tracking
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
self._timestep: int = 0
|
| 204 |
+
self._crash_count: int = 0
|
| 205 |
+
self._total_reward: float = 0.0
|
| 206 |
+
self._recent_actions: list = []
|
| 207 |
|
| 208 |
def get_metadata(self) -> EnvironmentMetadata:
|
| 209 |
return EnvironmentMetadata(
|
| 210 |
name="RhythmEnv",
|
| 211 |
description=(
|
| 212 |
+
"Life Simulator β a holistic resource management RL environment "
|
| 213 |
+
"where an agent balances 5 life meters across a 7-day week "
|
| 214 |
+
"with hidden personality profiles."
|
| 215 |
),
|
| 216 |
+
version="0.2.0",
|
| 217 |
)
|
| 218 |
|
| 219 |
# ------------------------------------------------------------------
|
|
|
|
| 226 |
episode_id: Optional[str] = None,
|
| 227 |
**kwargs: Any,
|
| 228 |
) -> RhythmObservation:
|
| 229 |
+
# Determine seed
|
| 230 |
+
if seed is not None:
|
| 231 |
+
effective_seed = seed
|
| 232 |
+
else:
|
| 233 |
+
effective_seed = hash(episode_id or str(uuid4())) & 0x7FFFFFFF
|
| 234 |
+
|
| 235 |
+
self._rng = random.Random(effective_seed)
|
| 236 |
|
| 237 |
+
# Profile selection: explicit kwarg or seed-based
|
| 238 |
+
profile_name = kwargs.get("profile")
|
| 239 |
+
if profile_name and profile_name in PROFILE_MAP:
|
| 240 |
+
self._profile = deepcopy(PROFILE_MAP[profile_name])
|
| 241 |
else:
|
| 242 |
+
profile_index = effective_seed % len(PROFILES)
|
| 243 |
+
self._profile = deepcopy(PROFILES[profile_index])
|
| 244 |
+
|
| 245 |
+
# Initialize meters from profile defaults
|
| 246 |
+
initial = self._profile["initial_meters"]
|
| 247 |
+
self._vitality = initial["vitality"]
|
| 248 |
+
self._cognition = initial["cognition"]
|
| 249 |
+
self._progress = initial["progress"]
|
| 250 |
+
self._serenity = initial["serenity"]
|
| 251 |
+
self._connection = initial["connection"]
|
| 252 |
+
|
| 253 |
+
# Reset tracking
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 254 |
self._timestep = 0
|
| 255 |
+
self._crash_count = 0
|
| 256 |
+
self._total_reward = 0.0
|
| 257 |
+
self._recent_actions = []
|
| 258 |
|
| 259 |
self._state = RhythmState(
|
| 260 |
episode_id=episode_id or str(uuid4()),
|
| 261 |
step_count=0,
|
| 262 |
+
profile_name=self._profile["name"],
|
| 263 |
timestep=0,
|
| 264 |
+
day=0,
|
| 265 |
+
slot=0,
|
| 266 |
+
vitality=self._vitality,
|
| 267 |
+
cognition=self._cognition,
|
| 268 |
+
progress=self._progress,
|
| 269 |
+
serenity=self._serenity,
|
| 270 |
+
connection=self._connection,
|
| 271 |
)
|
| 272 |
|
| 273 |
+
return self._make_observation(reward=0.0, done=False, active_event=None)
|
| 274 |
|
| 275 |
# ------------------------------------------------------------------
|
| 276 |
# step
|
|
|
|
| 282 |
timeout_s: Optional[float] = None,
|
| 283 |
**kwargs: Any,
|
| 284 |
) -> RhythmObservation:
|
| 285 |
+
slot = self._timestep % SLOTS_PER_DAY
|
| 286 |
+
day = self._timestep // SLOTS_PER_DAY
|
| 287 |
+
action_name = action.action_type.value
|
| 288 |
+
|
| 289 |
+
# --- 1. Roll and apply event ---
|
| 290 |
+
active_event = self._roll_event()
|
| 291 |
+
if active_event:
|
| 292 |
+
self._apply_event(active_event)
|
| 293 |
+
|
| 294 |
+
# --- 2. Get base action effects ---
|
| 295 |
+
effects = dict(ACTION_EFFECTS[action_name])
|
| 296 |
+
|
| 297 |
+
# --- 2b. Repetition dampening ---
|
| 298 |
+
repeat_count = self._recent_actions[-3:].count(action_name)
|
| 299 |
+
if repeat_count > 0:
|
| 300 |
+
dampening = 1.0 - 0.25 * repeat_count # 0.75, 0.50, 0.25
|
| 301 |
+
for meter in METERS:
|
| 302 |
+
if effects[meter] > 0:
|
| 303 |
+
effects[meter] *= dampening
|
| 304 |
+
|
| 305 |
+
# --- 3. Apply time-of-day multipliers (SLEEP bypasses) ---
|
| 306 |
+
if action_name != "sleep":
|
| 307 |
+
effects = self._apply_time_multipliers(effects, slot)
|
| 308 |
+
|
| 309 |
+
# --- 4. Apply profile modifiers ---
|
| 310 |
+
effects = self._apply_profile_modifiers(effects, action_name, slot)
|
| 311 |
+
|
| 312 |
+
# --- 5. Apply global vitality factor (low vitality reduces positive effects) ---
|
| 313 |
+
vitality_factor = 0.5 + 0.5 * self._vitality
|
| 314 |
+
for meter in METERS:
|
| 315 |
+
if meter != "vitality" and effects[meter] > 0:
|
| 316 |
+
effects[meter] *= vitality_factor
|
| 317 |
+
|
| 318 |
+
# --- 6. Apply passive decays ---
|
| 319 |
+
self._apply_passive_decays()
|
| 320 |
+
|
| 321 |
+
# --- 7. Update meters and track deltas ---
|
| 322 |
+
deltas: Dict[str, float] = {}
|
| 323 |
+
for meter in METERS:
|
| 324 |
+
old_val = getattr(self, f"_{meter}")
|
| 325 |
+
new_val = max(0.0, min(1.0, old_val + effects[meter]))
|
| 326 |
+
deltas[meter] = new_val - old_val
|
| 327 |
+
setattr(self, f"_{meter}", new_val)
|
| 328 |
+
|
| 329 |
+
# --- 8. Compute reward ---
|
| 330 |
+
reward = self._compute_reward(deltas)
|
| 331 |
+
|
| 332 |
+
# --- 9. Check critical thresholds ---
|
| 333 |
+
for meter in METERS:
|
| 334 |
+
if getattr(self, f"_{meter}") < CRITICAL_THRESHOLD:
|
| 335 |
+
reward += CRITICAL_PENALTY
|
| 336 |
+
self._crash_count += 1
|
| 337 |
+
|
| 338 |
+
# Clamp reward
|
| 339 |
+
reward = max(-3.0, min(3.0, round(reward, 4)))
|
| 340 |
+
self._total_reward += reward
|
| 341 |
+
|
| 342 |
+
# --- 10. Advance timestep ---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 343 |
self._timestep += 1
|
| 344 |
+
self._recent_actions.append(action_name)
|
| 345 |
+
new_day = self._timestep // SLOTS_PER_DAY
|
| 346 |
+
new_slot = self._timestep % SLOTS_PER_DAY
|
| 347 |
+
|
| 348 |
+
# --- 11. Check done ---
|
| 349 |
+
done = self._timestep >= MAX_STEPS
|
| 350 |
+
|
| 351 |
+
# --- 12. Build reward breakdown ---
|
| 352 |
+
reward_breakdown: Dict[str, float] = {}
|
| 353 |
+
for meter in METERS:
|
| 354 |
+
reward_breakdown[f"{meter}_delta"] = round(deltas[meter], 4)
|
| 355 |
+
if active_event:
|
| 356 |
+
reward_breakdown["event"] = 1.0
|
| 357 |
+
|
| 358 |
+
# --- 13. Grade if done ---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 359 |
if done:
|
| 360 |
final_score = self._grade_episode()
|
| 361 |
reward_breakdown["final_score"] = round(final_score, 4)
|
| 362 |
|
| 363 |
+
# --- 14. Update state ---
|
| 364 |
+
self._state.step_count = self._timestep
|
| 365 |
self._state.timestep = self._timestep
|
| 366 |
+
self._state.day = new_day
|
| 367 |
+
self._state.slot = new_slot
|
| 368 |
+
self._state.vitality = round(self._vitality, 4)
|
| 369 |
+
self._state.cognition = round(self._cognition, 4)
|
| 370 |
+
self._state.progress = round(self._progress, 4)
|
| 371 |
+
self._state.serenity = round(self._serenity, 4)
|
| 372 |
+
self._state.connection = round(self._connection, 4)
|
| 373 |
+
self._state.active_event = active_event
|
| 374 |
|
| 375 |
return self._make_observation(
|
| 376 |
+
reward=reward,
|
| 377 |
+
done=done,
|
| 378 |
+
active_event=active_event,
|
| 379 |
+
reward_breakdown=reward_breakdown,
|
| 380 |
)
|
| 381 |
|
| 382 |
# ------------------------------------------------------------------
|
|
|
|
| 388 |
return self._state
|
| 389 |
|
| 390 |
# ------------------------------------------------------------------
|
| 391 |
+
# Internal helpers
|
| 392 |
# ------------------------------------------------------------------
|
| 393 |
|
| 394 |
+
def _roll_event(self) -> Optional[str]:
|
| 395 |
+
"""Roll for a random event this step."""
|
| 396 |
+
if self._rng.random() < EVENT_PROBABILITY:
|
| 397 |
+
return self._rng.choice(EVENT_NAMES)
|
| 398 |
+
return None
|
| 399 |
+
|
| 400 |
+
def _apply_event(self, event_name: str) -> None:
|
| 401 |
+
"""Apply event effects to meters, modified by profile."""
|
| 402 |
+
effects = EVENT_EFFECTS[event_name]
|
| 403 |
+
multiplier = self._profile["event_impact_multiplier"]
|
| 404 |
+
for meter in METERS:
|
| 405 |
+
delta = effects[meter]
|
| 406 |
+
# Only apply multiplier to negative effects
|
| 407 |
+
if delta < 0:
|
| 408 |
+
delta *= multiplier
|
| 409 |
+
old_val = getattr(self, f"_{meter}")
|
| 410 |
+
new_val = max(0.0, min(1.0, old_val + delta))
|
| 411 |
+
setattr(self, f"_{meter}", new_val)
|
| 412 |
+
|
| 413 |
+
def _apply_time_multipliers(
|
| 414 |
+
self, effects: Dict[str, float], slot: int
|
| 415 |
+
) -> Dict[str, float]:
|
| 416 |
+
"""Apply time-of-day multipliers to action effects."""
|
| 417 |
+
multipliers = TIME_MULTIPLIERS[slot]
|
| 418 |
+
|
| 419 |
+
for meter in effects:
|
| 420 |
+
if meter == "cognition" and effects[meter] > 0:
|
| 421 |
+
effects[meter] *= multipliers["cognition_gain"]
|
| 422 |
+
elif meter == "vitality" and effects[meter] < 0:
|
| 423 |
+
effects[meter] *= multipliers["vitality_drain"]
|
| 424 |
+
|
| 425 |
+
return effects
|
| 426 |
+
|
| 427 |
+
def _apply_profile_modifiers(
|
| 428 |
+
self, effects: Dict[str, float], action_name: str, slot: int
|
| 429 |
+
) -> Dict[str, float]:
|
| 430 |
+
"""Apply hidden profile modifiers to action effects."""
|
| 431 |
+
profile = self._profile
|
| 432 |
+
|
| 433 |
+
# Social vitality drain multiplier
|
| 434 |
+
if action_name in SOCIAL_ACTIONS and effects["vitality"] < 0:
|
| 435 |
+
effects["vitality"] *= profile["social_vitality_multiplier"]
|
| 436 |
+
|
| 437 |
+
# Introvert morning cognition bonus
|
| 438 |
+
bonus = profile.get("morning_cognition_bonus")
|
| 439 |
+
if bonus and slot == 0:
|
| 440 |
+
if effects["cognition"] > 0:
|
| 441 |
+
effects["cognition"] *= bonus
|
| 442 |
+
if effects["progress"] > 0:
|
| 443 |
+
effects["progress"] *= bonus
|
| 444 |
+
|
| 445 |
+
# Extrovert evening/night cognition bonus
|
| 446 |
+
bonus = profile.get("evening_night_cognition_bonus")
|
| 447 |
+
if bonus and slot in (2, 3):
|
| 448 |
+
if effects["cognition"] > 0:
|
| 449 |
+
effects["cognition"] *= bonus
|
| 450 |
+
if effects["progress"] > 0:
|
| 451 |
+
effects["progress"] *= bonus
|
| 452 |
+
|
| 453 |
+
# Extrovert morning penalty
|
| 454 |
+
penalty = profile.get("morning_penalty")
|
| 455 |
+
if penalty and slot == 0:
|
| 456 |
+
if effects["cognition"] > 0:
|
| 457 |
+
effects["cognition"] *= penalty
|
| 458 |
+
if effects["progress"] > 0:
|
| 459 |
+
effects["progress"] *= penalty
|
| 460 |
+
|
| 461 |
+
# Binge shame spiral
|
| 462 |
+
if profile.get("binge_shame") and action_name == "binge_watch":
|
| 463 |
+
effects["serenity"] -= 0.15
|
| 464 |
+
effects["cognition"] -= 0.06
|
| 465 |
+
|
| 466 |
+
# Workaholic: progress-producing actions give serenity bonus
|
| 467 |
+
psb = profile.get("progress_serenity_bonus", 0.0)
|
| 468 |
+
if psb > 0 and effects["progress"] > 0:
|
| 469 |
+
effects["serenity"] += psb
|
| 470 |
+
|
| 471 |
+
# Workaholic: idle actions drain serenity
|
| 472 |
+
isd = profile.get("idle_serenity_decay", 0.0)
|
| 473 |
+
if isd > 0 and action_name in IDLE_ACTIONS:
|
| 474 |
+
effects["serenity"] -= isd
|
| 475 |
+
|
| 476 |
+
# Solo recharge: introvert gets serenity from alone time
|
| 477 |
+
ssb = profile.get("solo_serenity_bonus", 0.0)
|
| 478 |
+
if ssb > 0 and action_name in ("me_time", "meditate"):
|
| 479 |
+
effects["serenity"] += ssb
|
| 480 |
+
|
| 481 |
+
# Social connection multiplier: extrovert gets more connection from socializing
|
| 482 |
+
scm = profile.get("social_connection_multiplier", 1.0)
|
| 483 |
+
if scm != 1.0 and action_name in SOCIAL_ACTIONS and effects["connection"] > 0:
|
| 484 |
+
effects["connection"] *= scm
|
| 485 |
+
|
| 486 |
+
# Social serenity bonus: extrovert gets serenity from socializing
|
| 487 |
+
ssrb = profile.get("social_serenity_bonus", 0.0)
|
| 488 |
+
if ssrb > 0 and action_name in SOCIAL_ACTIONS:
|
| 489 |
+
effects["serenity"] += ssrb
|
| 490 |
+
|
| 491 |
+
# Work vitality recovery: workaholic gets vitality from productive work
|
| 492 |
+
wvr = profile.get("work_vitality_recovery", 0.0)
|
| 493 |
+
if wvr > 0 and action_name in ("deep_work", "learn", "admin_work"):
|
| 494 |
+
effects["vitality"] += wvr
|
| 495 |
+
|
| 496 |
+
# Low serenity amplification (stress spiral)
|
| 497 |
+
if self._serenity < profile.get("stress_tolerance", 0.3):
|
| 498 |
+
for meter in effects:
|
| 499 |
+
if effects[meter] < 0:
|
| 500 |
+
effects[meter] *= 1.3
|
| 501 |
+
|
| 502 |
+
return effects
|
| 503 |
+
|
| 504 |
+
def _apply_passive_decays(self) -> None:
|
| 505 |
+
"""Apply per-step passive meter decays."""
|
| 506 |
+
# Connection always decays if not actively maintained
|
| 507 |
+
decay = self._profile["connection_decay_rate"]
|
| 508 |
+
self._connection = max(0.0, self._connection - decay)
|
| 509 |
+
|
| 510 |
+
# Workaholic extra vitality decay
|
| 511 |
+
vd = self._profile.get("vitality_decay_rate", 0.0)
|
| 512 |
+
if vd > 0:
|
| 513 |
+
self._vitality = max(0.0, self._vitality - vd)
|
| 514 |
+
|
| 515 |
+
def _compute_reward(self, deltas: Dict[str, float]) -> float:
|
| 516 |
+
"""Compute reward as hidden-weighted sum of meter deltas."""
|
| 517 |
+
weights = self._profile["reward_weights"]
|
| 518 |
+
reward = sum(deltas[m] * weights[m] for m in METERS)
|
| 519 |
+
return reward * REWARD_SCALE
|
| 520 |
|
| 521 |
def _grade_episode(self) -> float:
|
| 522 |
"""Compute final episode score in [0, 1]."""
|
| 523 |
+
meters = {m: getattr(self, f"_{m}") for m in METERS}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 524 |
|
| 525 |
+
# 1. Meter balance (0.30): high mean, low variance
|
| 526 |
+
values = list(meters.values())
|
| 527 |
+
mean_meter = sum(values) / len(values)
|
| 528 |
+
variance = sum((v - mean_meter) ** 2 for v in values) / len(values)
|
| 529 |
+
balance_score = max(0.0, mean_meter - variance)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 530 |
|
| 531 |
+
# 2. No crashes (0.25): fraction of steps without critical meters
|
| 532 |
+
steps = max(self._timestep, 1)
|
| 533 |
+
crash_free_ratio = 1.0 - (self._crash_count / (steps * len(METERS)))
|
| 534 |
|
| 535 |
+
# 3. Progress made (0.20)
|
| 536 |
+
progress_score = self._progress
|
| 537 |
|
| 538 |
+
# 4. Connection maintained (0.15)
|
| 539 |
+
connection_score = self._connection
|
| 540 |
+
|
| 541 |
+
# 5. Efficiency (0.10): normalized average reward
|
| 542 |
+
avg_reward = self._total_reward / steps
|
| 543 |
+
efficiency_score = max(0.0, min(1.0, (avg_reward + 1.0) / 2.0))
|
| 544 |
|
| 545 |
score = (
|
| 546 |
+
0.30 * balance_score
|
| 547 |
+
+ 0.25 * crash_free_ratio
|
| 548 |
+
+ 0.20 * progress_score
|
| 549 |
+
+ 0.15 * connection_score
|
| 550 |
+
+ 0.10 * efficiency_score
|
|
|
|
| 551 |
)
|
| 552 |
return max(0.0, min(1.0, score))
|
| 553 |
|
|
|
|
| 555 |
self,
|
| 556 |
reward: float,
|
| 557 |
done: bool,
|
| 558 |
+
active_event: Optional[str],
|
| 559 |
+
reward_breakdown: Optional[Dict[str, float]] = None,
|
| 560 |
) -> RhythmObservation:
|
| 561 |
+
"""Build the observation returned to the agent (hides profile)."""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 562 |
return RhythmObservation(
|
| 563 |
timestep=self._timestep,
|
| 564 |
+
day=self._timestep // SLOTS_PER_DAY,
|
| 565 |
+
slot=self._timestep % SLOTS_PER_DAY,
|
| 566 |
+
vitality=round(self._vitality, 4),
|
| 567 |
+
cognition=round(self._cognition, 4),
|
| 568 |
+
progress=round(self._progress, 4),
|
| 569 |
+
serenity=round(self._serenity, 4),
|
| 570 |
+
connection=round(self._connection, 4),
|
| 571 |
+
active_event=active_event,
|
| 572 |
remaining_steps=MAX_STEPS - self._timestep,
|
| 573 |
+
reward_breakdown=reward_breakdown or {},
|
| 574 |
reward=reward,
|
| 575 |
done=done,
|
| 576 |
)
|
tests/test_rhythm_env.py
CHANGED
|
@@ -4,15 +4,18 @@
|
|
| 4 |
# This source code is licensed under the BSD-style license found in the
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
-
"""Tests for RhythmEnv
|
| 8 |
|
| 9 |
import pytest
|
| 10 |
|
|
|
|
| 11 |
from server.rhythm_environment import (
|
|
|
|
| 12 |
MAX_STEPS,
|
|
|
|
|
|
|
| 13 |
RhythmEnvironment,
|
| 14 |
)
|
| 15 |
-
from models import ActionType, RhythmAction
|
| 16 |
|
| 17 |
|
| 18 |
@pytest.fixture
|
|
@@ -20,287 +23,348 @@ def env():
|
|
| 20 |
return RhythmEnvironment()
|
| 21 |
|
| 22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
# ---------------------------------------------------------------------------
|
| 24 |
-
#
|
| 25 |
# ---------------------------------------------------------------------------
|
| 26 |
|
|
|
|
| 27 |
class TestReset:
|
| 28 |
-
def
|
| 29 |
-
obs = env.reset(
|
|
|
|
| 30 |
assert obs.timestep == 0
|
|
|
|
|
|
|
|
|
|
| 31 |
assert obs.done is False
|
| 32 |
assert obs.reward == 0.0
|
| 33 |
|
| 34 |
-
def
|
| 35 |
-
obs = env.reset(
|
| 36 |
-
assert
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
assert
|
| 41 |
-
|
| 42 |
-
def
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
assert len(
|
| 49 |
-
|
| 50 |
-
def
|
| 51 |
-
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
def test_reset_clears_state(self, env):
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
|
|
|
| 58 |
assert obs.timestep == 0
|
| 59 |
-
assert obs.
|
| 60 |
-
assert obs.current_task_id is None
|
| 61 |
-
|
| 62 |
-
def test_reset_custom_tasks(self, env):
|
| 63 |
-
obs = env.reset(
|
| 64 |
-
task="custom",
|
| 65 |
-
tasks=[
|
| 66 |
-
{"name": "T1", "effort": 0.3, "deadline": 10, "importance": 0.8},
|
| 67 |
-
{"name": "T2", "effort": 0.2, "deadline": 15, "importance": 0.5},
|
| 68 |
-
],
|
| 69 |
-
meetings=[5],
|
| 70 |
-
initial_energy=0.6,
|
| 71 |
-
)
|
| 72 |
-
assert len(obs.tasks) == 2
|
| 73 |
-
assert obs.tasks[0].name == "T1"
|
| 74 |
-
assert obs.energy == 0.6
|
| 75 |
-
assert obs.meetings == [5]
|
| 76 |
-
|
| 77 |
-
def test_reset_custom_rejects_empty_tasks(self, env):
|
| 78 |
-
with pytest.raises(ValueError, match="tasks"):
|
| 79 |
-
env.reset(task="custom", tasks=[])
|
| 80 |
-
|
| 81 |
-
def test_reset_custom_clamps_bounds(self, env):
|
| 82 |
-
obs = env.reset(
|
| 83 |
-
task="custom",
|
| 84 |
-
tasks=[{"name": "X", "effort": 99, "deadline": 999, "importance": -5}],
|
| 85 |
-
initial_energy=50,
|
| 86 |
-
)
|
| 87 |
-
t = obs.tasks[0]
|
| 88 |
-
assert t.effort <= 1.0
|
| 89 |
-
assert t.deadline <= MAX_STEPS
|
| 90 |
-
assert t.importance >= 0.1
|
| 91 |
-
assert obs.energy <= 1.0
|
| 92 |
|
| 93 |
|
| 94 |
# ---------------------------------------------------------------------------
|
| 95 |
-
#
|
| 96 |
# ---------------------------------------------------------------------------
|
| 97 |
|
|
|
|
| 98 |
class TestStep:
|
| 99 |
-
def
|
| 100 |
-
env.reset(
|
| 101 |
-
obs = env.step(
|
| 102 |
assert obs.timestep == 1
|
| 103 |
|
| 104 |
-
def
|
| 105 |
-
env.reset(
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
assert obs.
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
assert obs.tasks[0].progress > 0.0
|
| 125 |
-
|
| 126 |
-
def test_start_task_invalid_when_already_working(self, env):
|
| 127 |
-
"""START_TASK should fail (idle) when already working on a task."""
|
| 128 |
-
env.reset(task="easy")
|
| 129 |
-
env.step(RhythmAction(action_type=ActionType.START_TASK, task_id=0))
|
| 130 |
-
# Trying START_TASK again while already on task 0 should be invalid
|
| 131 |
-
obs = env.step(RhythmAction(action_type=ActionType.START_TASK, task_id=1))
|
| 132 |
-
# Should incur idle penalty since START is invalid when current_task_id is set
|
| 133 |
-
assert "idle_penalty" in obs.reward_breakdown
|
| 134 |
-
|
| 135 |
-
def test_switch_task_requires_current_task(self, env):
|
| 136 |
-
"""SWITCH_TASK should fail when no task is active."""
|
| 137 |
-
env.reset(task="easy")
|
| 138 |
-
obs = env.step(RhythmAction(action_type=ActionType.SWITCH_TASK, task_id=0))
|
| 139 |
-
assert "idle_penalty" in obs.reward_breakdown
|
| 140 |
-
|
| 141 |
-
def test_meeting_blocks_action(self, env):
|
| 142 |
-
"""At meeting timesteps, the action should be ignored."""
|
| 143 |
-
env.reset(task="easy") # meetings at steps 3 and 11
|
| 144 |
-
# Advance to step 3 (meeting)
|
| 145 |
for _ in range(3):
|
| 146 |
-
env.step(
|
| 147 |
-
|
| 148 |
-
obs = env.step(
|
| 149 |
-
|
| 150 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
def test_episode_ends_at_max_steps(self, env):
|
| 153 |
-
env.reset(
|
| 154 |
-
for
|
| 155 |
-
obs = env.step(
|
| 156 |
assert obs.done is True
|
|
|
|
| 157 |
|
| 158 |
-
def
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
initial_energy=1.0,
|
| 164 |
-
)
|
| 165 |
-
obs = env.step(RhythmAction(action_type=ActionType.START_TASK, task_id=0))
|
| 166 |
-
assert obs.done is True
|
| 167 |
-
assert obs.timestep < MAX_STEPS
|
| 168 |
-
assert "final_score" in obs.reward_breakdown
|
| 169 |
|
| 170 |
-
def
|
| 171 |
-
"""
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
env.reset(
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
initial_energy=1.0,
|
| 194 |
-
)
|
| 195 |
-
# Step 1 (timestep 0β1): work on task B, deadlines approaching β some stress
|
| 196 |
-
env.step(RhythmAction(action_type=ActionType.START_TASK, task_id=3))
|
| 197 |
-
# Step 2 (timestep 1β2): continue working, 3 deadlines missed β stress +0.45
|
| 198 |
-
env.step(RhythmAction(action_type=ActionType.CONTINUE_TASK))
|
| 199 |
-
# Step 3: continue working on B under high stress
|
| 200 |
-
obs2 = env.step(RhythmAction(action_type=ActionType.CONTINUE_TASK))
|
| 201 |
-
# Total progress on B across steps 1-3
|
| 202 |
-
progress_high_stress_total = obs2.tasks[3].progress
|
| 203 |
-
# Progress in step 3 alone = total - progress from steps 1+2
|
| 204 |
-
# But simpler: just compare step-1 progress (no stress) vs later progress.
|
| 205 |
-
# Since energy decreases each step and stress accumulates, step 3's
|
| 206 |
-
# marginal progress is less than step 1's.
|
| 207 |
-
# We can verify stress is significant:
|
| 208 |
-
assert obs2.stress >= 0.3
|
| 209 |
-
# And the average per-step progress under stress is lower than the
|
| 210 |
-
# first step's progress with zero stress and full energy:
|
| 211 |
-
avg_progress_per_step = progress_high_stress_total / 3
|
| 212 |
-
assert progress_low_stress > avg_progress_per_step
|
| 213 |
|
| 214 |
|
| 215 |
# ---------------------------------------------------------------------------
|
| 216 |
-
#
|
| 217 |
# ---------------------------------------------------------------------------
|
| 218 |
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
for _ in range(MAX_STEPS):
|
| 223 |
-
obs = env.step(
|
| 224 |
-
|
| 225 |
-
assert 0.0 <= score <= 1.0
|
| 226 |
|
| 227 |
-
|
| 228 |
-
env.reset(
|
| 229 |
for _ in range(MAX_STEPS):
|
| 230 |
-
obs = env.step(
|
| 231 |
-
|
| 232 |
-
assert score < 0.5
|
| 233 |
|
| 234 |
-
|
| 235 |
-
"""The simple heuristic should beat a do-nothing strategy."""
|
| 236 |
-
import random
|
| 237 |
-
random.seed(42)
|
| 238 |
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 243 |
break
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
action = RhythmAction(action_type=ActionType.START_TASK, task_id=uncompleted[0].id)
|
| 251 |
-
obs = env.step(action)
|
| 252 |
-
heuristic_score = obs.reward_breakdown["final_score"]
|
| 253 |
-
|
| 254 |
-
# Do-nothing run
|
| 255 |
-
obs = env.reset(task="easy")
|
| 256 |
for _ in range(MAX_STEPS):
|
| 257 |
-
obs = env.step(
|
| 258 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 259 |
|
| 260 |
-
assert heuristic_score > idle_score
|
| 261 |
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 266 |
for _ in range(MAX_STEPS):
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
return obs.reward_breakdown["final_score"]
|
| 271 |
|
| 272 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 273 |
|
| 274 |
|
| 275 |
# ---------------------------------------------------------------------------
|
| 276 |
-
#
|
| 277 |
# ---------------------------------------------------------------------------
|
| 278 |
|
|
|
|
| 279 |
class TestEdgeCases:
|
| 280 |
-
def
|
| 281 |
-
"""
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
|
| 285 |
-
|
| 286 |
-
|
| 287 |
-
|
| 288 |
-
|
| 289 |
-
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
def test_reward_breakdown_has_all_components(self, env):
|
| 302 |
-
env.reset(task="easy")
|
| 303 |
-
obs = env.step(RhythmAction(action_type=ActionType.START_TASK, task_id=0))
|
| 304 |
-
# Should have at least progress_reward and stress_penalty
|
| 305 |
-
assert "progress_reward" in obs.reward_breakdown
|
| 306 |
-
assert "stress_penalty" in obs.reward_breakdown
|
|
|
|
| 4 |
# This source code is licensed under the BSD-style license found in the
|
| 5 |
# LICENSE file in the root directory of this source tree.
|
| 6 |
|
| 7 |
+
"""Tests for the RhythmEnv Life Simulator."""
|
| 8 |
|
| 9 |
import pytest
|
| 10 |
|
| 11 |
+
from models import ActionType, RhythmAction, RhythmObservation
|
| 12 |
from server.rhythm_environment import (
|
| 13 |
+
CRITICAL_THRESHOLD,
|
| 14 |
MAX_STEPS,
|
| 15 |
+
METERS,
|
| 16 |
+
PROFILES,
|
| 17 |
RhythmEnvironment,
|
| 18 |
)
|
|
|
|
| 19 |
|
| 20 |
|
| 21 |
@pytest.fixture
|
|
|
|
| 23 |
return RhythmEnvironment()
|
| 24 |
|
| 25 |
|
| 26 |
+
def make_action(action_type: ActionType) -> RhythmAction:
|
| 27 |
+
return RhythmAction(action_type=action_type)
|
| 28 |
+
|
| 29 |
+
|
| 30 |
# ---------------------------------------------------------------------------
|
| 31 |
+
# TestReset
|
| 32 |
# ---------------------------------------------------------------------------
|
| 33 |
|
| 34 |
+
|
| 35 |
class TestReset:
|
| 36 |
+
def test_returns_valid_observation(self, env):
|
| 37 |
+
obs = env.reset(seed=0)
|
| 38 |
+
assert isinstance(obs, RhythmObservation)
|
| 39 |
assert obs.timestep == 0
|
| 40 |
+
assert obs.day == 0
|
| 41 |
+
assert obs.slot == 0
|
| 42 |
+
assert obs.remaining_steps == MAX_STEPS
|
| 43 |
assert obs.done is False
|
| 44 |
assert obs.reward == 0.0
|
| 45 |
|
| 46 |
+
def test_meters_initialized(self, env):
|
| 47 |
+
obs = env.reset(seed=0)
|
| 48 |
+
assert 0.0 <= obs.vitality <= 1.0
|
| 49 |
+
assert 0.0 <= obs.cognition <= 1.0
|
| 50 |
+
assert obs.progress == 0.0
|
| 51 |
+
assert 0.0 <= obs.serenity <= 1.0
|
| 52 |
+
assert 0.0 <= obs.connection <= 1.0
|
| 53 |
+
|
| 54 |
+
def test_seed_selects_profile(self, env):
|
| 55 |
+
"""Different seeds select different profiles."""
|
| 56 |
+
profiles_seen = set()
|
| 57 |
+
for seed in range(3):
|
| 58 |
+
env.reset(seed=seed)
|
| 59 |
+
profiles_seen.add(env.state.profile_name)
|
| 60 |
+
assert len(profiles_seen) == 3
|
| 61 |
+
|
| 62 |
+
def test_deterministic_with_same_seed(self, env):
|
| 63 |
+
obs1 = env.reset(seed=42)
|
| 64 |
+
obs2 = env.reset(seed=42)
|
| 65 |
+
assert obs1.vitality == obs2.vitality
|
| 66 |
+
assert obs1.cognition == obs2.cognition
|
| 67 |
+
assert obs1.serenity == obs2.serenity
|
| 68 |
+
assert obs1.connection == obs2.connection
|
| 69 |
+
|
| 70 |
+
def test_explicit_profile_selection(self, env):
|
| 71 |
+
env.reset(seed=0, profile="workaholic_stoic")
|
| 72 |
+
assert env.state.profile_name == "workaholic_stoic"
|
| 73 |
|
| 74 |
def test_reset_clears_state(self, env):
|
| 75 |
+
env.reset(seed=0)
|
| 76 |
+
for _ in range(5):
|
| 77 |
+
env.step(make_action(ActionType.DEEP_WORK))
|
| 78 |
+
obs = env.reset(seed=0)
|
| 79 |
assert obs.timestep == 0
|
| 80 |
+
assert obs.progress == 0.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
|
| 83 |
# ---------------------------------------------------------------------------
|
| 84 |
+
# TestStep
|
| 85 |
# ---------------------------------------------------------------------------
|
| 86 |
|
| 87 |
+
|
| 88 |
class TestStep:
|
| 89 |
+
def test_timestep_advances(self, env):
|
| 90 |
+
env.reset(seed=0)
|
| 91 |
+
obs = env.step(make_action(ActionType.DEEP_WORK))
|
| 92 |
assert obs.timestep == 1
|
| 93 |
|
| 94 |
+
def test_day_and_slot_correct(self, env):
|
| 95 |
+
env.reset(seed=0)
|
| 96 |
+
for _ in range(5):
|
| 97 |
+
obs = env.step(make_action(ActionType.SLEEP))
|
| 98 |
+
assert obs.day == 1
|
| 99 |
+
assert obs.slot == 1
|
| 100 |
+
|
| 101 |
+
def test_deep_work_increases_progress(self, env):
|
| 102 |
+
env.reset(seed=0)
|
| 103 |
+
obs = env.step(make_action(ActionType.DEEP_WORK))
|
| 104 |
+
assert obs.progress > 0.0
|
| 105 |
+
|
| 106 |
+
def test_deep_work_drains_vitality(self, env):
|
| 107 |
+
env.reset(seed=0)
|
| 108 |
+
initial_vitality = env.state.vitality
|
| 109 |
+
obs = env.step(make_action(ActionType.DEEP_WORK))
|
| 110 |
+
assert obs.vitality < initial_vitality
|
| 111 |
+
|
| 112 |
+
def test_sleep_recovers_vitality(self, env):
|
| 113 |
+
env.reset(seed=0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
for _ in range(3):
|
| 115 |
+
env.step(make_action(ActionType.DEEP_WORK))
|
| 116 |
+
vitality_before_sleep = env.state.vitality
|
| 117 |
+
obs = env.step(make_action(ActionType.SLEEP))
|
| 118 |
+
assert obs.vitality > vitality_before_sleep
|
| 119 |
+
|
| 120 |
+
def test_socialize_builds_connection(self, env):
|
| 121 |
+
env.reset(seed=0)
|
| 122 |
+
initial_connection = env.state.connection
|
| 123 |
+
obs = env.step(make_action(ActionType.FAMILY_TIME))
|
| 124 |
+
assert obs.connection > initial_connection - 0.05
|
| 125 |
|
| 126 |
def test_episode_ends_at_max_steps(self, env):
|
| 127 |
+
env.reset(seed=0)
|
| 128 |
+
for i in range(MAX_STEPS):
|
| 129 |
+
obs = env.step(make_action(ActionType.SLEEP))
|
| 130 |
assert obs.done is True
|
| 131 |
+
assert obs.timestep == MAX_STEPS
|
| 132 |
|
| 133 |
+
def test_not_done_before_max_steps(self, env):
|
| 134 |
+
env.reset(seed=0)
|
| 135 |
+
for i in range(MAX_STEPS - 1):
|
| 136 |
+
obs = env.step(make_action(ActionType.SLEEP))
|
| 137 |
+
assert obs.done is False
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
+
def test_meters_stay_in_bounds(self, env):
|
| 140 |
+
"""No meter exceeds [0.0, 1.0] regardless of actions."""
|
| 141 |
+
env.reset(seed=0)
|
| 142 |
+
for _ in range(MAX_STEPS):
|
| 143 |
+
obs = env.step(make_action(ActionType.DEEP_WORK))
|
| 144 |
+
for meter in METERS:
|
| 145 |
+
val = getattr(obs, meter)
|
| 146 |
+
assert 0.0 <= val <= 1.0, f"{meter}={val} out of bounds"
|
| 147 |
+
|
| 148 |
+
def test_low_vitality_reduces_effectiveness(self, env):
|
| 149 |
+
"""Progress gain should be lower when vitality is low."""
|
| 150 |
+
env.reset(seed=0, profile="introvert_morning")
|
| 151 |
+
obs_high = env.step(make_action(ActionType.DEEP_WORK))
|
| 152 |
+
progress_high = obs_high.progress
|
| 153 |
+
|
| 154 |
+
env.reset(seed=0, profile="introvert_morning")
|
| 155 |
+
for _ in range(6):
|
| 156 |
+
env.step(make_action(ActionType.DEEP_WORK))
|
| 157 |
+
progress_before = env.state.progress
|
| 158 |
+
env.step(make_action(ActionType.DEEP_WORK))
|
| 159 |
+
progress_gained_low = env.state.progress - progress_before
|
| 160 |
+
|
| 161 |
+
assert progress_high > progress_gained_low
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
|
| 163 |
|
| 164 |
# ---------------------------------------------------------------------------
|
| 165 |
+
# TestProfiles
|
| 166 |
# ---------------------------------------------------------------------------
|
| 167 |
|
| 168 |
+
|
| 169 |
+
class TestProfiles:
|
| 170 |
+
def test_introvert_social_drains_more(self, env):
|
| 171 |
+
"""Introvert loses more vitality from socializing than extrovert."""
|
| 172 |
+
env.reset(seed=0, profile="introvert_morning")
|
| 173 |
+
v_before_intro = env.state.vitality
|
| 174 |
+
env.step(make_action(ActionType.SOCIALIZE))
|
| 175 |
+
intro_drain = v_before_intro - env.state.vitality
|
| 176 |
+
|
| 177 |
+
env2 = RhythmEnvironment()
|
| 178 |
+
env2.reset(seed=0, profile="extrovert_night_owl")
|
| 179 |
+
v_before_extro = env2.state.vitality
|
| 180 |
+
env2.step(make_action(ActionType.SOCIALIZE))
|
| 181 |
+
extro_drain = v_before_extro - env2.state.vitality
|
| 182 |
+
|
| 183 |
+
assert intro_drain > extro_drain
|
| 184 |
+
|
| 185 |
+
def test_workaholic_progress_gives_serenity(self, env):
|
| 186 |
+
"""Workaholic has better serenity outcome from deep work than introvert."""
|
| 187 |
+
env.reset(seed=0, profile="workaholic_stoic")
|
| 188 |
+
serenity_before = env.state.serenity
|
| 189 |
+
env.step(make_action(ActionType.DEEP_WORK))
|
| 190 |
+
workaholic_change = env.state.serenity - serenity_before
|
| 191 |
+
|
| 192 |
+
env2 = RhythmEnvironment()
|
| 193 |
+
env2.reset(seed=0, profile="introvert_morning")
|
| 194 |
+
serenity_before_intro = env2.state.serenity
|
| 195 |
+
env2.step(make_action(ActionType.DEEP_WORK))
|
| 196 |
+
introvert_change = env2.state.serenity - serenity_before_intro
|
| 197 |
+
|
| 198 |
+
assert workaholic_change > introvert_change
|
| 199 |
+
|
| 200 |
+
def test_binge_shame_introvert(self, env):
|
| 201 |
+
"""Introvert suffers extra serenity loss from binge watching."""
|
| 202 |
+
env.reset(seed=0, profile="introvert_morning")
|
| 203 |
+
serenity_before = env.state.serenity
|
| 204 |
+
env.step(make_action(ActionType.BINGE_WATCH))
|
| 205 |
+
intro_change = env.state.serenity - serenity_before
|
| 206 |
+
|
| 207 |
+
env2 = RhythmEnvironment()
|
| 208 |
+
env2.reset(seed=0, profile="extrovert_night_owl")
|
| 209 |
+
serenity_before_ext = env2.state.serenity
|
| 210 |
+
env2.step(make_action(ActionType.BINGE_WATCH))
|
| 211 |
+
ext_change = env2.state.serenity - serenity_before_ext
|
| 212 |
+
|
| 213 |
+
assert intro_change < ext_change
|
| 214 |
+
|
| 215 |
+
def test_different_rewards_same_action(self, env):
|
| 216 |
+
"""Same action produces different rewards for different profiles."""
|
| 217 |
+
rewards = {}
|
| 218 |
+
for profile_name in ["introvert_morning", "extrovert_night_owl", "workaholic_stoic"]:
|
| 219 |
+
e = RhythmEnvironment()
|
| 220 |
+
e.reset(seed=0, profile=profile_name)
|
| 221 |
+
obs = e.step(make_action(ActionType.DEEP_WORK))
|
| 222 |
+
rewards[profile_name] = obs.reward
|
| 223 |
+
|
| 224 |
+
values = list(rewards.values())
|
| 225 |
+
assert len(set(round(v, 3) for v in values)) > 1
|
| 226 |
+
|
| 227 |
+
def test_extrovert_night_cognition_bonus(self, env):
|
| 228 |
+
"""Extrovert gets better cognition gains in evening vs morning."""
|
| 229 |
+
env.reset(seed=0, profile="extrovert_night_owl")
|
| 230 |
+
env.step(make_action(ActionType.SLEEP)) # morning
|
| 231 |
+
env.step(make_action(ActionType.SLEEP)) # afternoon
|
| 232 |
+
cognition_before = env.state.cognition
|
| 233 |
+
env.step(make_action(ActionType.MEDITATE)) # evening
|
| 234 |
+
evening_gain = env.state.cognition - cognition_before
|
| 235 |
+
|
| 236 |
+
env.reset(seed=0, profile="extrovert_night_owl")
|
| 237 |
+
cognition_before_m = env.state.cognition
|
| 238 |
+
env.step(make_action(ActionType.MEDITATE)) # morning
|
| 239 |
+
morning_gain = env.state.cognition - cognition_before_m
|
| 240 |
+
|
| 241 |
+
assert evening_gain > morning_gain
|
| 242 |
+
|
| 243 |
+
|
| 244 |
+
# ---------------------------------------------------------------------------
|
| 245 |
+
# TestEvents
|
| 246 |
+
# ---------------------------------------------------------------------------
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
class TestEvents:
|
| 250 |
+
def test_events_deterministic_with_seed(self, env):
|
| 251 |
+
"""Same seed produces same event sequence."""
|
| 252 |
+
events1 = []
|
| 253 |
+
env.reset(seed=99)
|
| 254 |
for _ in range(MAX_STEPS):
|
| 255 |
+
obs = env.step(make_action(ActionType.SLEEP))
|
| 256 |
+
events1.append(obs.active_event)
|
|
|
|
| 257 |
|
| 258 |
+
events2 = []
|
| 259 |
+
env.reset(seed=99)
|
| 260 |
for _ in range(MAX_STEPS):
|
| 261 |
+
obs = env.step(make_action(ActionType.SLEEP))
|
| 262 |
+
events2.append(obs.active_event)
|
|
|
|
| 263 |
|
| 264 |
+
assert events1 == events2
|
|
|
|
|
|
|
|
|
|
| 265 |
|
| 266 |
+
def test_event_visible_in_observation(self, env):
|
| 267 |
+
"""When an event fires, active_event is set in observation."""
|
| 268 |
+
found_event = False
|
| 269 |
+
for seed in range(100):
|
| 270 |
+
env.reset(seed=seed)
|
| 271 |
+
for _ in range(MAX_STEPS):
|
| 272 |
+
obs = env.step(make_action(ActionType.SLEEP))
|
| 273 |
+
if obs.active_event is not None:
|
| 274 |
+
found_event = True
|
| 275 |
+
assert obs.active_event in [
|
| 276 |
+
"prod_crash", "family_emergency", "illness", "good_news"
|
| 277 |
+
]
|
| 278 |
+
break
|
| 279 |
+
if found_event:
|
| 280 |
break
|
| 281 |
+
assert found_event, "No events triggered in 100 episodes"
|
| 282 |
+
|
| 283 |
+
def test_no_event_when_none(self, env):
|
| 284 |
+
"""Most steps should have no event."""
|
| 285 |
+
env.reset(seed=0)
|
| 286 |
+
no_event_count = 0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 287 |
for _ in range(MAX_STEPS):
|
| 288 |
+
obs = env.step(make_action(ActionType.SLEEP))
|
| 289 |
+
if obs.active_event is None:
|
| 290 |
+
no_event_count += 1
|
| 291 |
+
assert no_event_count > MAX_STEPS * 0.7
|
| 292 |
+
|
| 293 |
+
|
| 294 |
+
# ---------------------------------------------------------------------------
|
| 295 |
+
# TestGrader
|
| 296 |
+
# ---------------------------------------------------------------------------
|
| 297 |
|
|
|
|
| 298 |
|
| 299 |
+
class TestGrader:
|
| 300 |
+
def test_final_score_in_range(self, env):
|
| 301 |
+
env.reset(seed=0)
|
| 302 |
+
for _ in range(MAX_STEPS):
|
| 303 |
+
obs = env.step(make_action(ActionType.SLEEP))
|
| 304 |
+
assert "final_score" in obs.reward_breakdown
|
| 305 |
+
score = obs.reward_breakdown["final_score"]
|
| 306 |
+
assert 0.0 <= score <= 1.0
|
| 307 |
+
|
| 308 |
+
def test_balanced_play_beats_all_sleep(self, env):
|
| 309 |
+
"""A balanced strategy should score higher than just sleeping."""
|
| 310 |
+
env.reset(seed=0)
|
| 311 |
+
for _ in range(MAX_STEPS):
|
| 312 |
+
obs_sleep = env.step(make_action(ActionType.SLEEP))
|
| 313 |
+
score_sleep = obs_sleep.reward_breakdown["final_score"]
|
| 314 |
+
|
| 315 |
+
balanced_actions = [
|
| 316 |
+
ActionType.DEEP_WORK, ActionType.LEARN,
|
| 317 |
+
ActionType.EXERCISE, ActionType.FAMILY_TIME,
|
| 318 |
+
] * 7
|
| 319 |
+
env.reset(seed=0)
|
| 320 |
+
for action_type in balanced_actions:
|
| 321 |
+
obs_balanced = env.step(make_action(action_type))
|
| 322 |
+
score_balanced = obs_balanced.reward_breakdown["final_score"]
|
| 323 |
+
|
| 324 |
+
assert score_balanced > score_sleep
|
| 325 |
+
|
| 326 |
+
def test_deterministic_grading(self, env):
|
| 327 |
+
"""Same actions produce same final score."""
|
| 328 |
+
scores = []
|
| 329 |
+
for _ in range(2):
|
| 330 |
+
env.reset(seed=42)
|
| 331 |
for _ in range(MAX_STEPS):
|
| 332 |
+
obs = env.step(make_action(ActionType.DEEP_WORK))
|
| 333 |
+
scores.append(obs.reward_breakdown["final_score"])
|
| 334 |
+
assert scores[0] == scores[1]
|
|
|
|
| 335 |
|
| 336 |
+
def test_all_binge_scores_low(self, env):
|
| 337 |
+
"""Binge watching everything should produce a low score."""
|
| 338 |
+
env.reset(seed=0)
|
| 339 |
+
for _ in range(MAX_STEPS):
|
| 340 |
+
obs = env.step(make_action(ActionType.BINGE_WATCH))
|
| 341 |
+
score = obs.reward_breakdown["final_score"]
|
| 342 |
+
assert score < 0.5
|
| 343 |
|
| 344 |
|
| 345 |
# ---------------------------------------------------------------------------
|
| 346 |
+
# TestEdgeCases
|
| 347 |
# ---------------------------------------------------------------------------
|
| 348 |
|
| 349 |
+
|
| 350 |
class TestEdgeCases:
|
| 351 |
+
def test_observation_hides_profile(self, env):
|
| 352 |
+
"""Observation should not expose profile_name."""
|
| 353 |
+
obs = env.reset(seed=0)
|
| 354 |
+
obs_dict = obs.model_dump()
|
| 355 |
+
assert "profile_name" not in obs_dict
|
| 356 |
+
|
| 357 |
+
def test_state_exposes_profile(self, env):
|
| 358 |
+
"""State should include profile_name for debugging."""
|
| 359 |
+
env.reset(seed=0)
|
| 360 |
+
assert env.state.profile_name != ""
|
| 361 |
+
assert env.state.profile_name in [p["name"] for p in PROFILES]
|
| 362 |
+
|
| 363 |
+
def test_all_action_types_valid(self, env):
|
| 364 |
+
"""Every ActionType should be processable without error."""
|
| 365 |
+
env.reset(seed=0)
|
| 366 |
+
for action_type in ActionType:
|
| 367 |
+
e = RhythmEnvironment()
|
| 368 |
+
e.reset(seed=0)
|
| 369 |
+
obs = e.step(make_action(action_type))
|
| 370 |
+
assert isinstance(obs, RhythmObservation)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
training/RhythmEnv_GRPO_Training.ipynb
ADDED
|
@@ -0,0 +1,557 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# RhythmEnv GRPO Training β Life Simulator\n",
|
| 8 |
+
"\n",
|
| 9 |
+
"Train an LLM agent to balance 5 life meters across a 7-day week using **Group Relative Policy Optimization (GRPO)**.\n",
|
| 10 |
+
"\n",
|
| 11 |
+
"The agent must discover hidden personality profiles (introvert, extrovert, workaholic) through reward signals alone.\n",
|
| 12 |
+
"\n",
|
| 13 |
+
"**Requirements**: Free Colab GPU (T4) is sufficient. Training takes ~30-60 minutes for 500 steps.\n",
|
| 14 |
+
"\n",
|
| 15 |
+
"**Environment**: https://huggingface.co/spaces/InosLihka/rhythm_env"
|
| 16 |
+
]
|
| 17 |
+
},
|
| 18 |
+
{
|
| 19 |
+
"cell_type": "markdown",
|
| 20 |
+
"metadata": {},
|
| 21 |
+
"source": [
|
| 22 |
+
"## 1. Install Dependencies"
|
| 23 |
+
]
|
| 24 |
+
},
|
| 25 |
+
{
|
| 26 |
+
"cell_type": "code",
|
| 27 |
+
"execution_count": null,
|
| 28 |
+
"metadata": {},
|
| 29 |
+
"outputs": [],
|
| 30 |
+
"source": [
|
| 31 |
+
"%%capture\n",
|
| 32 |
+
"!pip install --upgrade uv\n",
|
| 33 |
+
"!uv pip install --system torch>=2.8.0 triton>=3.4.0 torchvision bitsandbytes\n",
|
| 34 |
+
"!uv pip install --system transformers==4.56.2 trl==0.22.2 datasets wandb\n",
|
| 35 |
+
"!pip install unsloth\n",
|
| 36 |
+
"!pip install fastapi uvicorn pydantic matplotlib"
|
| 37 |
+
]
|
| 38 |
+
},
|
| 39 |
+
{
|
| 40 |
+
"cell_type": "markdown",
|
| 41 |
+
"metadata": {},
|
| 42 |
+
"source": [
|
| 43 |
+
"## 2. Clone RhythmEnv & Setup Paths"
|
| 44 |
+
]
|
| 45 |
+
},
|
| 46 |
+
{
|
| 47 |
+
"cell_type": "code",
|
| 48 |
+
"execution_count": null,
|
| 49 |
+
"metadata": {},
|
| 50 |
+
"outputs": [],
|
| 51 |
+
"source": [
|
| 52 |
+
"import os\n",
|
| 53 |
+
"\n",
|
| 54 |
+
"# Clone the repo (update URL to your actual repo)\n",
|
| 55 |
+
"REPO_URL = \"https://huggingface.co/spaces/InosLihka/rhythm_env\"\n",
|
| 56 |
+
"ENV_DIR = \"/content/rhythm_env\"\n",
|
| 57 |
+
"\n",
|
| 58 |
+
"if not os.path.exists(ENV_DIR):\n",
|
| 59 |
+
" !git clone {REPO_URL} {ENV_DIR}\n",
|
| 60 |
+
"\n",
|
| 61 |
+
"os.chdir(ENV_DIR)\n",
|
| 62 |
+
"\n",
|
| 63 |
+
"import sys\n",
|
| 64 |
+
"sys.path.insert(0, ENV_DIR)\n",
|
| 65 |
+
"sys.path.insert(0, os.path.join(ENV_DIR, \"training\"))\n",
|
| 66 |
+
"\n",
|
| 67 |
+
"print(f\"Working directory: {os.getcwd()}\")\n",
|
| 68 |
+
"print(f\"Files: {os.listdir('.')}\")"
|
| 69 |
+
]
|
| 70 |
+
},
|
| 71 |
+
{
|
| 72 |
+
"cell_type": "markdown",
|
| 73 |
+
"metadata": {},
|
| 74 |
+
"source": [
|
| 75 |
+
"## 3. Verify Environment Works"
|
| 76 |
+
]
|
| 77 |
+
},
|
| 78 |
+
{
|
| 79 |
+
"cell_type": "code",
|
| 80 |
+
"execution_count": null,
|
| 81 |
+
"metadata": {},
|
| 82 |
+
"outputs": [],
|
| 83 |
+
"source": [
|
| 84 |
+
"from models import ActionType, RhythmAction\n",
|
| 85 |
+
"from server.rhythm_environment import RhythmEnvironment, MAX_STEPS\n",
|
| 86 |
+
"\n",
|
| 87 |
+
"env = RhythmEnvironment()\n",
|
| 88 |
+
"obs = env.reset(seed=0)\n",
|
| 89 |
+
"print(f\"Initial state β Vitality: {obs.vitality:.2f}, Cognition: {obs.cognition:.2f}, \"\n",
|
| 90 |
+
" f\"Progress: {obs.progress:.2f}, Serenity: {obs.serenity:.2f}, Connection: {obs.connection:.2f}\")\n",
|
| 91 |
+
"print(f\"Day: {obs.day}, Slot: {obs.slot}, Steps: {obs.remaining_steps}\")\n",
|
| 92 |
+
"\n",
|
| 93 |
+
"obs = env.step(RhythmAction(action_type=ActionType.DEEP_WORK))\n",
|
| 94 |
+
"print(f\"\\nAfter DEEP_WORK β Reward: {obs.reward:.3f}\")\n",
|
| 95 |
+
"print(f\"Vitality: {obs.vitality:.2f}, Progress: {obs.progress:.2f}\")\n",
|
| 96 |
+
"print(\"\\nEnvironment is working!\")"
|
| 97 |
+
]
|
| 98 |
+
},
|
| 99 |
+
{
|
| 100 |
+
"cell_type": "markdown",
|
| 101 |
+
"metadata": {},
|
| 102 |
+
"source": [
|
| 103 |
+
"## 4. Run Baseline Evaluation (Before Training)"
|
| 104 |
+
]
|
| 105 |
+
},
|
| 106 |
+
{
|
| 107 |
+
"cell_type": "code",
|
| 108 |
+
"execution_count": null,
|
| 109 |
+
"metadata": {},
|
| 110 |
+
"outputs": [],
|
| 111 |
+
"source": [
|
| 112 |
+
"from training.inference_eval import run_episode, heuristic_action\n",
|
| 113 |
+
"import json\n",
|
| 114 |
+
"\n",
|
| 115 |
+
"PROFILES = [\"introvert_morning\", \"extrovert_night_owl\", \"workaholic_stoic\"]\n",
|
| 116 |
+
"NUM_EVAL_EPISODES = 5\n",
|
| 117 |
+
"\n",
|
| 118 |
+
"baseline_results = {}\n",
|
| 119 |
+
"\n",
|
| 120 |
+
"for strategy in [\"heuristic\", \"random\"]:\n",
|
| 121 |
+
" baseline_results[strategy] = {}\n",
|
| 122 |
+
" for profile in PROFILES:\n",
|
| 123 |
+
" scores = []\n",
|
| 124 |
+
" for ep in range(NUM_EVAL_EPISODES):\n",
|
| 125 |
+
" result = run_episode(profile=profile, seed=ep, strategy=strategy)\n",
|
| 126 |
+
" scores.append(result[\"final_score\"])\n",
|
| 127 |
+
" avg = sum(scores) / len(scores)\n",
|
| 128 |
+
" baseline_results[strategy][profile] = avg\n",
|
| 129 |
+
" print(f\"{strategy:10s} | {profile:25s} | avg_score={avg:.3f}\")\n",
|
| 130 |
+
"\n",
|
| 131 |
+
"print(\"\\nBaseline scores saved. These are our 'before training' numbers.\")"
|
| 132 |
+
]
|
| 133 |
+
},
|
| 134 |
+
{
|
| 135 |
+
"cell_type": "markdown",
|
| 136 |
+
"metadata": {},
|
| 137 |
+
"source": [
|
| 138 |
+
"## 5. Generate Training Dataset"
|
| 139 |
+
]
|
| 140 |
+
},
|
| 141 |
+
{
|
| 142 |
+
"cell_type": "code",
|
| 143 |
+
"execution_count": null,
|
| 144 |
+
"metadata": {},
|
| 145 |
+
"outputs": [],
|
| 146 |
+
"source": "from dataset import generate_dataset\nfrom datasets import Dataset\n\nNUM_EPISODES = 200\nMAX_SAMPLES = 2000\n\nraw_samples = generate_dataset(\n num_episodes=NUM_EPISODES,\n strategy=\"mixed\",\n max_samples=MAX_SAMPLES,\n)\n\n# Include replay metadata so env_reward can reconstruct the exact state\ndataset = Dataset.from_list([\n {\n \"prompt\": sample[\"prompt\"],\n \"seed\": sample[\"seed\"],\n \"step_index\": sample[\"step_index\"],\n \"action_history\": sample[\"action_history\"],\n }\n for sample in raw_samples\n])\n\nprint(f\"\\nDataset size: {len(dataset)}\")\nprint(f\"\\nSample prompt:\")\nprint(dataset[0][\"prompt\"][1][\"content\"])"
|
| 147 |
+
},
|
| 148 |
+
{
|
| 149 |
+
"cell_type": "markdown",
|
| 150 |
+
"metadata": {},
|
| 151 |
+
"source": [
|
| 152 |
+
"## 6. Load Model with Unsloth (4-bit + LoRA)"
|
| 153 |
+
]
|
| 154 |
+
},
|
| 155 |
+
{
|
| 156 |
+
"cell_type": "code",
|
| 157 |
+
"execution_count": null,
|
| 158 |
+
"metadata": {},
|
| 159 |
+
"outputs": [],
|
| 160 |
+
"source": [
|
| 161 |
+
"from unsloth import FastLanguageModel\n",
|
| 162 |
+
"\n",
|
| 163 |
+
"MODEL_NAME = \"unsloth/Qwen2.5-3B-Instruct\" # Good balance of capability vs VRAM\n",
|
| 164 |
+
"MAX_SEQ_LENGTH = 768\n",
|
| 165 |
+
"LORA_RANK = 4\n",
|
| 166 |
+
"\n",
|
| 167 |
+
"model, tokenizer = FastLanguageModel.from_pretrained(\n",
|
| 168 |
+
" model_name=MODEL_NAME,\n",
|
| 169 |
+
" load_in_4bit=True,\n",
|
| 170 |
+
" max_seq_length=MAX_SEQ_LENGTH,\n",
|
| 171 |
+
")\n",
|
| 172 |
+
"\n",
|
| 173 |
+
"model = FastLanguageModel.get_peft_model(\n",
|
| 174 |
+
" model,\n",
|
| 175 |
+
" r=LORA_RANK,\n",
|
| 176 |
+
" target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
|
| 177 |
+
" \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
|
| 178 |
+
" lora_alpha=LORA_RANK * 2,\n",
|
| 179 |
+
" use_gradient_checkpointing=\"unsloth\",\n",
|
| 180 |
+
" random_state=3407,\n",
|
| 181 |
+
")\n",
|
| 182 |
+
"\n",
|
| 183 |
+
"print(f\"Model loaded: {MODEL_NAME}\")\n",
|
| 184 |
+
"print(f\"LoRA rank: {LORA_RANK}, seq_length: {MAX_SEQ_LENGTH}\")"
|
| 185 |
+
]
|
| 186 |
+
},
|
| 187 |
+
{
|
| 188 |
+
"cell_type": "markdown",
|
| 189 |
+
"metadata": {},
|
| 190 |
+
"source": [
|
| 191 |
+
"## 7. Setup Reward Functions"
|
| 192 |
+
]
|
| 193 |
+
},
|
| 194 |
+
{
|
| 195 |
+
"cell_type": "code",
|
| 196 |
+
"execution_count": null,
|
| 197 |
+
"metadata": {},
|
| 198 |
+
"outputs": [],
|
| 199 |
+
"source": "from reward_functions import format_valid, action_legal, env_reward\n\n# Three-layer reward stack:\n# 1. format_valid β did the model output a parseable action name? (+1.0 / -2.0)\n# 2. action_legal β is it one of the 10 valid actions? (+0.5 / -1.0)\n# 3. env_reward β actual reward from stepping the environment (varies)\n#\n# GRPOTrainer passes seed/step_index/action_history as dataset columns,\n# so env_reward replays the exact episode state for each training prompt.\nreward_funcs = [format_valid, action_legal, env_reward]\n\n# Quick sanity check\ntest_completions = [\n [{\"content\": \"DEEP_WORK\"}],\n [{\"content\": \"garbage\"}],\n [{\"content\": \"SLEEP\"}],\n]\nprint(\"format_valid:\", format_valid(test_completions))\nprint(\"action_legal:\", action_legal(test_completions))\nprint(\"env_reward (no history, seed=0):\", env_reward(test_completions))"
|
| 200 |
+
},
|
| 201 |
+
{
|
| 202 |
+
"cell_type": "markdown",
|
| 203 |
+
"metadata": {},
|
| 204 |
+
"source": [
|
| 205 |
+
"## 8. Configure & Run GRPO Training"
|
| 206 |
+
]
|
| 207 |
+
},
|
| 208 |
+
{
|
| 209 |
+
"cell_type": "code",
|
| 210 |
+
"execution_count": null,
|
| 211 |
+
"metadata": {},
|
| 212 |
+
"outputs": [],
|
| 213 |
+
"source": [
|
| 214 |
+
"import wandb\n",
|
| 215 |
+
"\n",
|
| 216 |
+
"# Optional: login to W&B for tracking (or set report_to=\"none\" below)\n",
|
| 217 |
+
"# wandb.login(key=\"YOUR_KEY\")\n",
|
| 218 |
+
"\n",
|
| 219 |
+
"REPORT_TO = \"none\" # Change to \"wandb\" if you have a key"
|
| 220 |
+
]
|
| 221 |
+
},
|
| 222 |
+
{
|
| 223 |
+
"cell_type": "code",
|
| 224 |
+
"execution_count": null,
|
| 225 |
+
"metadata": {},
|
| 226 |
+
"outputs": [],
|
| 227 |
+
"source": [
|
| 228 |
+
"from trl import GRPOConfig, GRPOTrainer\n",
|
| 229 |
+
"\n",
|
| 230 |
+
"MAX_STEPS = 500 # Increase to 1000 if time allows\n",
|
| 231 |
+
"NUM_GENERATIONS = 4\n",
|
| 232 |
+
"LEARNING_RATE = 2e-4\n",
|
| 233 |
+
"\n",
|
| 234 |
+
"max_prompt_length = 400\n",
|
| 235 |
+
"max_completion_length = MAX_SEQ_LENGTH - max_prompt_length\n",
|
| 236 |
+
"\n",
|
| 237 |
+
"training_args = GRPOConfig(\n",
|
| 238 |
+
" temperature=1.0,\n",
|
| 239 |
+
" learning_rate=LEARNING_RATE,\n",
|
| 240 |
+
" weight_decay=0.001,\n",
|
| 241 |
+
" warmup_ratio=0.1,\n",
|
| 242 |
+
" lr_scheduler_type=\"linear\",\n",
|
| 243 |
+
" optim=\"adamw_8bit\",\n",
|
| 244 |
+
" logging_steps=1,\n",
|
| 245 |
+
" per_device_train_batch_size=1,\n",
|
| 246 |
+
" gradient_accumulation_steps=4,\n",
|
| 247 |
+
" num_generations=NUM_GENERATIONS,\n",
|
| 248 |
+
" max_prompt_length=max_prompt_length,\n",
|
| 249 |
+
" max_completion_length=max_completion_length,\n",
|
| 250 |
+
" max_steps=MAX_STEPS,\n",
|
| 251 |
+
" save_steps=100,\n",
|
| 252 |
+
" report_to=REPORT_TO,\n",
|
| 253 |
+
" output_dir=\"outputs/rhythmenv_trained\",\n",
|
| 254 |
+
")\n",
|
| 255 |
+
"\n",
|
| 256 |
+
"trainer = GRPOTrainer(\n",
|
| 257 |
+
" model=model,\n",
|
| 258 |
+
" processing_class=tokenizer,\n",
|
| 259 |
+
" reward_funcs=reward_funcs,\n",
|
| 260 |
+
" args=training_args,\n",
|
| 261 |
+
" train_dataset=dataset,\n",
|
| 262 |
+
")\n",
|
| 263 |
+
"\n",
|
| 264 |
+
"print(f\"Training config: {MAX_STEPS} steps, {NUM_GENERATIONS} generations, lr={LEARNING_RATE}\")\n",
|
| 265 |
+
"print(\"Starting training...\")"
|
| 266 |
+
]
|
| 267 |
+
},
|
| 268 |
+
{
|
| 269 |
+
"cell_type": "code",
|
| 270 |
+
"execution_count": null,
|
| 271 |
+
"metadata": {},
|
| 272 |
+
"outputs": [],
|
| 273 |
+
"source": [
|
| 274 |
+
"trainer.train()"
|
| 275 |
+
]
|
| 276 |
+
},
|
| 277 |
+
{
|
| 278 |
+
"cell_type": "markdown",
|
| 279 |
+
"metadata": {},
|
| 280 |
+
"source": [
|
| 281 |
+
"## 9. Save Trained Model"
|
| 282 |
+
]
|
| 283 |
+
},
|
| 284 |
+
{
|
| 285 |
+
"cell_type": "code",
|
| 286 |
+
"execution_count": null,
|
| 287 |
+
"metadata": {},
|
| 288 |
+
"outputs": [],
|
| 289 |
+
"source": [
|
| 290 |
+
"OUTPUT_DIR = \"outputs/rhythmenv_trained\"\n",
|
| 291 |
+
"\n",
|
| 292 |
+
"model.save_pretrained_merged(\n",
|
| 293 |
+
" OUTPUT_DIR,\n",
|
| 294 |
+
" tokenizer,\n",
|
| 295 |
+
" save_method=\"merged_16bit\",\n",
|
| 296 |
+
")\n",
|
| 297 |
+
"print(f\"Model saved to: {OUTPUT_DIR}\")"
|
| 298 |
+
]
|
| 299 |
+
},
|
| 300 |
+
{
|
| 301 |
+
"cell_type": "markdown",
|
| 302 |
+
"metadata": {},
|
| 303 |
+
"source": [
|
| 304 |
+
"## 10. Generate Training Plots\n",
|
| 305 |
+
"\n",
|
| 306 |
+
"These plots are **required** for the hackathon submission. They must be saved as .png files."
|
| 307 |
+
]
|
| 308 |
+
},
|
| 309 |
+
{
|
| 310 |
+
"cell_type": "code",
|
| 311 |
+
"execution_count": null,
|
| 312 |
+
"metadata": {},
|
| 313 |
+
"outputs": [],
|
| 314 |
+
"source": [
|
| 315 |
+
"import matplotlib.pyplot as plt\n",
|
| 316 |
+
"import json\n",
|
| 317 |
+
"import os\n",
|
| 318 |
+
"\n",
|
| 319 |
+
"# Extract training logs from trainer\n",
|
| 320 |
+
"log_history = trainer.state.log_history\n",
|
| 321 |
+
"\n",
|
| 322 |
+
"steps = []\n",
|
| 323 |
+
"losses = []\n",
|
| 324 |
+
"rewards = []\n",
|
| 325 |
+
"reward_stds = []\n",
|
| 326 |
+
"\n",
|
| 327 |
+
"for entry in log_history:\n",
|
| 328 |
+
" if \"loss\" in entry:\n",
|
| 329 |
+
" steps.append(entry.get(\"step\", 0))\n",
|
| 330 |
+
" losses.append(entry[\"loss\"])\n",
|
| 331 |
+
" if \"reward\" in entry:\n",
|
| 332 |
+
" rewards.append(entry[\"reward\"])\n",
|
| 333 |
+
" if \"reward_std\" in entry:\n",
|
| 334 |
+
" reward_stds.append(entry[\"reward_std\"])\n",
|
| 335 |
+
"\n",
|
| 336 |
+
"# Also try rewards/mean key used by some TRL versions\n",
|
| 337 |
+
"if not rewards:\n",
|
| 338 |
+
" for entry in log_history:\n",
|
| 339 |
+
" if \"rewards/mean\" in entry:\n",
|
| 340 |
+
" rewards.append(entry[\"rewards/mean\"])\n",
|
| 341 |
+
" if \"rewards/std\" in entry:\n",
|
| 342 |
+
" reward_stds.append(entry[\"rewards/std\"])\n",
|
| 343 |
+
"\n",
|
| 344 |
+
"os.makedirs(\"plots\", exist_ok=True)\n",
|
| 345 |
+
"\n",
|
| 346 |
+
"# --- Plot 1: Training Loss ---\n",
|
| 347 |
+
"fig, ax = plt.subplots(figsize=(10, 5))\n",
|
| 348 |
+
"ax.plot(steps[:len(losses)], losses, color=\"#2563eb\", linewidth=1.5, alpha=0.8)\n",
|
| 349 |
+
"ax.set_xlabel(\"Training Step\", fontsize=12)\n",
|
| 350 |
+
"ax.set_ylabel(\"Loss\", fontsize=12)\n",
|
| 351 |
+
"ax.set_title(\"GRPO Training Loss β RhythmEnv Life Simulator\", fontsize=14)\n",
|
| 352 |
+
"ax.grid(True, alpha=0.3)\n",
|
| 353 |
+
"plt.tight_layout()\n",
|
| 354 |
+
"plt.savefig(\"plots/training_loss.png\", dpi=150)\n",
|
| 355 |
+
"plt.show()\n",
|
| 356 |
+
"print(\"Saved: plots/training_loss.png\")\n",
|
| 357 |
+
"\n",
|
| 358 |
+
"# --- Plot 2: Mean Reward ---\n",
|
| 359 |
+
"if rewards:\n",
|
| 360 |
+
" fig, ax = plt.subplots(figsize=(10, 5))\n",
|
| 361 |
+
" reward_steps = steps[:len(rewards)]\n",
|
| 362 |
+
" ax.plot(reward_steps, rewards, color=\"#16a34a\", linewidth=1.5, alpha=0.8, label=\"Mean Reward\")\n",
|
| 363 |
+
" if reward_stds and len(reward_stds) == len(rewards):\n",
|
| 364 |
+
" import numpy as np\n",
|
| 365 |
+
" r = np.array(rewards)\n",
|
| 366 |
+
" s = np.array(reward_stds)\n",
|
| 367 |
+
" ax.fill_between(reward_steps, r - s, r + s, color=\"#16a34a\", alpha=0.15, label=\"Β±1 Std Dev\")\n",
|
| 368 |
+
" ax.set_xlabel(\"Training Step\", fontsize=12)\n",
|
| 369 |
+
" ax.set_ylabel(\"Mean Reward\", fontsize=12)\n",
|
| 370 |
+
" ax.set_title(\"GRPO Mean Reward β RhythmEnv Life Simulator\", fontsize=14)\n",
|
| 371 |
+
" ax.legend()\n",
|
| 372 |
+
" ax.grid(True, alpha=0.3)\n",
|
| 373 |
+
" plt.tight_layout()\n",
|
| 374 |
+
" plt.savefig(\"plots/reward_curve.png\", dpi=150)\n",
|
| 375 |
+
" plt.show()\n",
|
| 376 |
+
" print(\"Saved: plots/reward_curve.png\")\n",
|
| 377 |
+
"else:\n",
|
| 378 |
+
" print(\"No reward data in logs. Check trainer.state.log_history keys:\")\n",
|
| 379 |
+
" if log_history:\n",
|
| 380 |
+
" print(list(log_history[0].keys()))\n",
|
| 381 |
+
"\n",
|
| 382 |
+
"# Save raw log data for reference\n",
|
| 383 |
+
"with open(\"plots/training_log.json\", \"w\") as f:\n",
|
| 384 |
+
" json.dump(log_history, f, indent=2)\n",
|
| 385 |
+
"print(\"Saved: plots/training_log.json\")"
|
| 386 |
+
]
|
| 387 |
+
},
|
| 388 |
+
{
|
| 389 |
+
"cell_type": "markdown",
|
| 390 |
+
"metadata": {},
|
| 391 |
+
"source": [
|
| 392 |
+
"## 11. Evaluate Trained Model vs Baselines"
|
| 393 |
+
]
|
| 394 |
+
},
|
| 395 |
+
{
|
| 396 |
+
"cell_type": "code",
|
| 397 |
+
"execution_count": null,
|
| 398 |
+
"metadata": {},
|
| 399 |
+
"outputs": [],
|
| 400 |
+
"source": [
|
| 401 |
+
"from training.inference_eval import run_episode, model_action\n",
|
| 402 |
+
"from unsloth import FastLanguageModel\n",
|
| 403 |
+
"\n",
|
| 404 |
+
"# Load trained model for inference\n",
|
| 405 |
+
"trained_model, trained_tokenizer = FastLanguageModel.from_pretrained(\n",
|
| 406 |
+
" model_name=\"outputs/rhythmenv_trained\",\n",
|
| 407 |
+
" load_in_4bit=True,\n",
|
| 408 |
+
" max_seq_length=768,\n",
|
| 409 |
+
")\n",
|
| 410 |
+
"FastLanguageModel.for_inference(trained_model)\n",
|
| 411 |
+
"\n",
|
| 412 |
+
"PROFILES = [\"introvert_morning\", \"extrovert_night_owl\", \"workaholic_stoic\"]\n",
|
| 413 |
+
"NUM_EVAL = 5\n",
|
| 414 |
+
"\n",
|
| 415 |
+
"trained_results = {}\n",
|
| 416 |
+
"all_eval_results = []\n",
|
| 417 |
+
"\n",
|
| 418 |
+
"for profile in PROFILES:\n",
|
| 419 |
+
" scores = []\n",
|
| 420 |
+
" for ep in range(NUM_EVAL):\n",
|
| 421 |
+
" result = run_episode(\n",
|
| 422 |
+
" profile=profile, seed=ep, strategy=\"model\",\n",
|
| 423 |
+
" model=trained_model, tokenizer=trained_tokenizer,\n",
|
| 424 |
+
" )\n",
|
| 425 |
+
" scores.append(result[\"final_score\"])\n",
|
| 426 |
+
" all_eval_results.append(result)\n",
|
| 427 |
+
" avg = sum(scores) / len(scores)\n",
|
| 428 |
+
" trained_results[profile] = avg\n",
|
| 429 |
+
" print(f\"Trained | {profile:25s} | avg_score={avg:.3f} | scores={[round(s,3) for s in scores]}\")\n",
|
| 430 |
+
"\n",
|
| 431 |
+
"# Save eval results\n",
|
| 432 |
+
"with open(\"plots/eval_results.json\", \"w\") as f:\n",
|
| 433 |
+
" json.dump(all_eval_results, f, indent=2)"
|
| 434 |
+
]
|
| 435 |
+
},
|
| 436 |
+
{
|
| 437 |
+
"cell_type": "markdown",
|
| 438 |
+
"metadata": {},
|
| 439 |
+
"source": [
|
| 440 |
+
"## 12. Comparison Plot (Before vs After Training)"
|
| 441 |
+
]
|
| 442 |
+
},
|
| 443 |
+
{
|
| 444 |
+
"cell_type": "code",
|
| 445 |
+
"execution_count": null,
|
| 446 |
+
"metadata": {},
|
| 447 |
+
"outputs": [],
|
| 448 |
+
"source": [
|
| 449 |
+
"import matplotlib.pyplot as plt\n",
|
| 450 |
+
"import numpy as np\n",
|
| 451 |
+
"\n",
|
| 452 |
+
"profiles_short = [\"Introvert\\nMorning\", \"Extrovert\\nNight Owl\", \"Workaholic\\nStoic\"]\n",
|
| 453 |
+
"profile_keys = [\"introvert_morning\", \"extrovert_night_owl\", \"workaholic_stoic\"]\n",
|
| 454 |
+
"\n",
|
| 455 |
+
"random_scores = [baseline_results[\"random\"][p] for p in profile_keys]\n",
|
| 456 |
+
"heuristic_scores = [baseline_results[\"heuristic\"][p] for p in profile_keys]\n",
|
| 457 |
+
"trained_scores = [trained_results[p] for p in profile_keys]\n",
|
| 458 |
+
"\n",
|
| 459 |
+
"x = np.arange(len(profiles_short))\n",
|
| 460 |
+
"width = 0.25\n",
|
| 461 |
+
"\n",
|
| 462 |
+
"fig, ax = plt.subplots(figsize=(10, 6))\n",
|
| 463 |
+
"bars1 = ax.bar(x - width, random_scores, width, label=\"Random Baseline\", color=\"#94a3b8\")\n",
|
| 464 |
+
"bars2 = ax.bar(x, heuristic_scores, width, label=\"Heuristic Baseline\", color=\"#60a5fa\")\n",
|
| 465 |
+
"bars3 = ax.bar(x + width, trained_scores, width, label=\"GRPO Trained\", color=\"#22c55e\")\n",
|
| 466 |
+
"\n",
|
| 467 |
+
"ax.set_ylabel(\"Final Episode Score (0-1)\", fontsize=12)\n",
|
| 468 |
+
"ax.set_title(\"RhythmEnv: Baseline vs Trained Agent Across Hidden Profiles\", fontsize=14)\n",
|
| 469 |
+
"ax.set_xticks(x)\n",
|
| 470 |
+
"ax.set_xticklabels(profiles_short, fontsize=11)\n",
|
| 471 |
+
"ax.legend(fontsize=11)\n",
|
| 472 |
+
"ax.set_ylim(0, 1.0)\n",
|
| 473 |
+
"ax.grid(axis=\"y\", alpha=0.3)\n",
|
| 474 |
+
"\n",
|
| 475 |
+
"# Add value labels on bars\n",
|
| 476 |
+
"for bars in [bars1, bars2, bars3]:\n",
|
| 477 |
+
" for bar in bars:\n",
|
| 478 |
+
" height = bar.get_height()\n",
|
| 479 |
+
" ax.annotate(f\"{height:.3f}\",\n",
|
| 480 |
+
" xy=(bar.get_x() + bar.get_width() / 2, height),\n",
|
| 481 |
+
" xytext=(0, 3), textcoords=\"offset points\",\n",
|
| 482 |
+
" ha=\"center\", va=\"bottom\", fontsize=9)\n",
|
| 483 |
+
"\n",
|
| 484 |
+
"plt.tight_layout()\n",
|
| 485 |
+
"plt.savefig(\"plots/baseline_vs_trained.png\", dpi=150)\n",
|
| 486 |
+
"plt.show()\n",
|
| 487 |
+
"print(\"Saved: plots/baseline_vs_trained.png\")\n",
|
| 488 |
+
"\n",
|
| 489 |
+
"# Print summary table\n",
|
| 490 |
+
"print(f\"\\n{'Profile':<25} {'Random':>10} {'Heuristic':>10} {'Trained':>10} {'Improvement':>12}\")\n",
|
| 491 |
+
"print(\"-\" * 70)\n",
|
| 492 |
+
"for i, p in enumerate(profile_keys):\n",
|
| 493 |
+
" improvement = trained_scores[i] - heuristic_scores[i]\n",
|
| 494 |
+
" print(f\"{p:<25} {random_scores[i]:>10.3f} {heuristic_scores[i]:>10.3f} {trained_scores[i]:>10.3f} {improvement:>+12.3f}\")"
|
| 495 |
+
]
|
| 496 |
+
},
|
| 497 |
+
{
|
| 498 |
+
"cell_type": "markdown",
|
| 499 |
+
"metadata": {},
|
| 500 |
+
"source": [
|
| 501 |
+
"## 13. Download Plots\n",
|
| 502 |
+
"\n",
|
| 503 |
+
"Download these files and commit them to your repo for the README:\n",
|
| 504 |
+
"- `plots/training_loss.png`\n",
|
| 505 |
+
"- `plots/reward_curve.png`\n",
|
| 506 |
+
"- `plots/baseline_vs_trained.png`"
|
| 507 |
+
]
|
| 508 |
+
},
|
| 509 |
+
{
|
| 510 |
+
"cell_type": "code",
|
| 511 |
+
"execution_count": null,
|
| 512 |
+
"metadata": {},
|
| 513 |
+
"outputs": [],
|
| 514 |
+
"source": [
|
| 515 |
+
"# List all generated artifacts\n",
|
| 516 |
+
"print(\"Generated files:\")\n",
|
| 517 |
+
"for f in os.listdir(\"plots\"):\n",
|
| 518 |
+
" size = os.path.getsize(f\"plots/{f}\")\n",
|
| 519 |
+
" print(f\" plots/{f} ({size:,} bytes)\")\n",
|
| 520 |
+
"\n",
|
| 521 |
+
"print(f\"\\nModel saved at: outputs/rhythmenv_trained/\")\n",
|
| 522 |
+
"print(\"\\nNext steps:\")\n",
|
| 523 |
+
"print(\"1. Download plots/*.png and commit to your repo\")\n",
|
| 524 |
+
"print(\"2. Embed plots in README.md\")\n",
|
| 525 |
+
"print(\"3. Create mini-blog or video\")"
|
| 526 |
+
]
|
| 527 |
+
},
|
| 528 |
+
{
|
| 529 |
+
"cell_type": "code",
|
| 530 |
+
"execution_count": null,
|
| 531 |
+
"metadata": {},
|
| 532 |
+
"outputs": [],
|
| 533 |
+
"source": [
|
| 534 |
+
"# Optional: zip everything for easy download\n",
|
| 535 |
+
"!zip -r /content/rhythmenv_training_artifacts.zip plots/ outputs/rhythmenv_trained/training_config.json\n",
|
| 536 |
+
"print(\"Download: /content/rhythmenv_training_artifacts.zip\")"
|
| 537 |
+
]
|
| 538 |
+
}
|
| 539 |
+
],
|
| 540 |
+
"metadata": {
|
| 541 |
+
"accelerator": "GPU",
|
| 542 |
+
"colab": {
|
| 543 |
+
"gpuType": "T4",
|
| 544 |
+
"provenance": []
|
| 545 |
+
},
|
| 546 |
+
"kernelspec": {
|
| 547 |
+
"display_name": "Python 3",
|
| 548 |
+
"name": "python3"
|
| 549 |
+
},
|
| 550 |
+
"language_info": {
|
| 551 |
+
"name": "python",
|
| 552 |
+
"version": "3.10.12"
|
| 553 |
+
}
|
| 554 |
+
},
|
| 555 |
+
"nbformat": 4,
|
| 556 |
+
"nbformat_minor": 0
|
| 557 |
+
}
|
training/dataset.py
ADDED
|
@@ -0,0 +1,180 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Dataset generator for RhythmEnv GRPO training.
|
| 3 |
+
|
| 4 |
+
Generates observation prompts by playing episodes with random/heuristic
|
| 5 |
+
strategies, capturing diverse states across all profiles and time slots.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import sys
|
| 9 |
+
import os
|
| 10 |
+
import random
|
| 11 |
+
|
| 12 |
+
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
| 13 |
+
|
| 14 |
+
from models import ActionType, RhythmAction
|
| 15 |
+
from server.rhythm_environment import RhythmEnvironment, MAX_STEPS, METERS
|
| 16 |
+
|
| 17 |
+
SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
|
| 18 |
+
DAY_NAMES = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
|
| 19 |
+
|
| 20 |
+
SYSTEM_PROMPT = (
|
| 21 |
+
"You are a life management agent. You balance 5 life meters over a 7-day week.\n"
|
| 22 |
+
"Each day has 4 slots: Morning, Afternoon, Evening, Night. Total: 28 steps.\n\n"
|
| 23 |
+
"Meters (0.0 to 1.0):\n"
|
| 24 |
+
"- Vitality: physical energy, sleep quality\n"
|
| 25 |
+
"- Cognition: mental clarity, focus\n"
|
| 26 |
+
"- Progress: career/skill advancement\n"
|
| 27 |
+
"- Serenity: inner peace, stress management\n"
|
| 28 |
+
"- Connection: relationship health\n\n"
|
| 29 |
+
"Available actions: DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE, "
|
| 30 |
+
"FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH\n\n"
|
| 31 |
+
"Key rules:\n"
|
| 32 |
+
"- Low Vitality reduces effectiveness of all positive actions.\n"
|
| 33 |
+
"- If ANY meter drops below 0.1, you get a heavy penalty.\n"
|
| 34 |
+
"- Connection decays passively if you don't socialize.\n"
|
| 35 |
+
"- Different episodes may have different hidden dynamics - adapt!\n"
|
| 36 |
+
"- Time of day matters for cognitive tasks.\n\n"
|
| 37 |
+
"Respond with ONLY the action name, nothing else."
|
| 38 |
+
)
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def format_observation_prompt(obs) -> str:
|
| 42 |
+
"""Format an observation into a user prompt for the LLM."""
|
| 43 |
+
day_name = DAY_NAMES[obs.day] if obs.day < 7 else f"Day {obs.day}"
|
| 44 |
+
slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
|
| 45 |
+
event_str = f"\nActive event: {obs.active_event}" if obs.active_event else ""
|
| 46 |
+
|
| 47 |
+
return (
|
| 48 |
+
f"Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})\n"
|
| 49 |
+
f"Remaining steps: {obs.remaining_steps}\n\n"
|
| 50 |
+
f"Meters:\n"
|
| 51 |
+
f" Vitality: {obs.vitality:.2f}\n"
|
| 52 |
+
f" Cognition: {obs.cognition:.2f}\n"
|
| 53 |
+
f" Progress: {obs.progress:.2f}\n"
|
| 54 |
+
f" Serenity: {obs.serenity:.2f}\n"
|
| 55 |
+
f" Connection: {obs.connection:.2f}"
|
| 56 |
+
f"{event_str}\n\n"
|
| 57 |
+
f"Choose your action:"
|
| 58 |
+
)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def generate_episode_samples(seed: int, strategy: str = "random") -> list:
|
| 62 |
+
"""
|
| 63 |
+
Play one episode and return a list of (prompt, seed, step_index, action_sequence) tuples.
|
| 64 |
+
|
| 65 |
+
Each sample captures the state at a specific step, along with enough info
|
| 66 |
+
to replay to that point for reward evaluation.
|
| 67 |
+
"""
|
| 68 |
+
env = RhythmEnvironment()
|
| 69 |
+
obs = env.reset(seed=seed)
|
| 70 |
+
rng = random.Random(seed + 1000)
|
| 71 |
+
actions_taken = []
|
| 72 |
+
samples = []
|
| 73 |
+
all_actions = list(ActionType)
|
| 74 |
+
|
| 75 |
+
for step in range(MAX_STEPS):
|
| 76 |
+
if obs.done:
|
| 77 |
+
break
|
| 78 |
+
|
| 79 |
+
# Generate prompt from current observation
|
| 80 |
+
prompt = format_observation_prompt(obs)
|
| 81 |
+
|
| 82 |
+
# Record sample with replay info
|
| 83 |
+
samples.append({
|
| 84 |
+
"prompt": [
|
| 85 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 86 |
+
{"role": "user", "content": prompt},
|
| 87 |
+
],
|
| 88 |
+
"seed": seed,
|
| 89 |
+
"step_index": step,
|
| 90 |
+
"action_history": list(actions_taken), # copy for replay
|
| 91 |
+
})
|
| 92 |
+
|
| 93 |
+
# Choose action based on strategy
|
| 94 |
+
if strategy == "random":
|
| 95 |
+
action_type = rng.choice(all_actions)
|
| 96 |
+
elif strategy == "heuristic":
|
| 97 |
+
action_type = _heuristic_action(obs)
|
| 98 |
+
else:
|
| 99 |
+
action_type = rng.choice(all_actions)
|
| 100 |
+
|
| 101 |
+
action = RhythmAction(action_type=action_type)
|
| 102 |
+
actions_taken.append(action_type.value)
|
| 103 |
+
obs = env.step(action)
|
| 104 |
+
|
| 105 |
+
return samples
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
def _heuristic_action(obs) -> ActionType:
|
| 109 |
+
"""Simple heuristic for diverse state generation."""
|
| 110 |
+
slot = obs.slot
|
| 111 |
+
v, c, p, s, cn = obs.vitality, obs.cognition, obs.progress, obs.serenity, obs.connection
|
| 112 |
+
|
| 113 |
+
if v < 0.15:
|
| 114 |
+
return ActionType.SLEEP
|
| 115 |
+
if s < 0.15:
|
| 116 |
+
return ActionType.MEDITATE
|
| 117 |
+
if cn < 0.15:
|
| 118 |
+
return ActionType.FAMILY_TIME
|
| 119 |
+
if slot == 3:
|
| 120 |
+
return ActionType.SLEEP
|
| 121 |
+
if slot == 0:
|
| 122 |
+
return ActionType.DEEP_WORK if (v > 0.4 and c > 0.3) else ActionType.EXERCISE
|
| 123 |
+
if slot == 1:
|
| 124 |
+
if cn < 0.3:
|
| 125 |
+
return ActionType.FAMILY_TIME
|
| 126 |
+
if p < 0.3 and v > 0.3:
|
| 127 |
+
return ActionType.LEARN
|
| 128 |
+
return ActionType.ADMIN_WORK
|
| 129 |
+
# evening
|
| 130 |
+
if cn < 0.4:
|
| 131 |
+
return ActionType.SOCIALIZE
|
| 132 |
+
if s < 0.5:
|
| 133 |
+
return ActionType.ME_TIME
|
| 134 |
+
return ActionType.MEDITATE
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
def generate_dataset(
|
| 138 |
+
num_episodes: int = 200,
|
| 139 |
+
strategy: str = "mixed",
|
| 140 |
+
max_samples: int = 2000,
|
| 141 |
+
) -> list:
|
| 142 |
+
"""
|
| 143 |
+
Generate a training dataset by playing multiple episodes.
|
| 144 |
+
|
| 145 |
+
Args:
|
| 146 |
+
num_episodes: Number of episodes to play.
|
| 147 |
+
strategy: "random", "heuristic", or "mixed" (alternating).
|
| 148 |
+
max_samples: Maximum samples to return.
|
| 149 |
+
|
| 150 |
+
Returns:
|
| 151 |
+
List of prompt dicts suitable for HF Dataset.from_list().
|
| 152 |
+
"""
|
| 153 |
+
all_samples = []
|
| 154 |
+
|
| 155 |
+
for i in range(num_episodes):
|
| 156 |
+
seed = i
|
| 157 |
+
if strategy == "mixed":
|
| 158 |
+
s = "heuristic" if i % 2 == 0 else "random"
|
| 159 |
+
else:
|
| 160 |
+
s = strategy
|
| 161 |
+
|
| 162 |
+
episode_samples = generate_episode_samples(seed=seed, strategy=s)
|
| 163 |
+
all_samples.extend(episode_samples)
|
| 164 |
+
|
| 165 |
+
if len(all_samples) >= max_samples:
|
| 166 |
+
break
|
| 167 |
+
|
| 168 |
+
# Shuffle and truncate
|
| 169 |
+
random.shuffle(all_samples)
|
| 170 |
+
all_samples = all_samples[:max_samples]
|
| 171 |
+
|
| 172 |
+
print(f"Generated {len(all_samples)} training samples from {min(i+1, num_episodes)} episodes")
|
| 173 |
+
return all_samples
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
if __name__ == "__main__":
|
| 177 |
+
samples = generate_dataset(num_episodes=50, strategy="mixed", max_samples=500)
|
| 178 |
+
print(f"\nFirst sample prompt:")
|
| 179 |
+
print(samples[0]["prompt"][1]["content"])
|
| 180 |
+
print(f"\nSeed: {samples[0]['seed']}, Step: {samples[0]['step_index']}")
|
training/inference_eval.py
ADDED
|
@@ -0,0 +1,226 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
RhythmEnv Inference Evaluation β Baseline vs Trained comparison.
|
| 3 |
+
|
| 4 |
+
Runs both a heuristic baseline and a trained model on all 3 profiles,
|
| 5 |
+
producing a comparison table and score summary.
|
| 6 |
+
|
| 7 |
+
Usage:
|
| 8 |
+
# Heuristic baseline only:
|
| 9 |
+
python training/inference_eval.py
|
| 10 |
+
|
| 11 |
+
# With trained model:
|
| 12 |
+
python training/inference_eval.py --model_path outputs/rhythmenv_trained
|
| 13 |
+
|
| 14 |
+
# With remote API:
|
| 15 |
+
python training/inference_eval.py --api_url https://api.example.com/v1 --api_key xxx
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
import argparse
|
| 19 |
+
import json
|
| 20 |
+
import os
|
| 21 |
+
import sys
|
| 22 |
+
from typing import Optional
|
| 23 |
+
|
| 24 |
+
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
| 25 |
+
|
| 26 |
+
from models import ActionType, RhythmAction
|
| 27 |
+
from server.rhythm_environment import RhythmEnvironment, MAX_STEPS
|
| 28 |
+
|
| 29 |
+
PROFILES = ["introvert_morning", "extrovert_night_owl", "workaholic_stoic"]
|
| 30 |
+
SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
|
| 31 |
+
DAY_NAMES = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def heuristic_action(obs) -> ActionType:
|
| 35 |
+
"""Priority-based heuristic baseline."""
|
| 36 |
+
slot = obs.slot
|
| 37 |
+
v, c, p, s, cn = obs.vitality, obs.cognition, obs.progress, obs.serenity, obs.connection
|
| 38 |
+
|
| 39 |
+
if v < 0.15:
|
| 40 |
+
return ActionType.SLEEP
|
| 41 |
+
if s < 0.15:
|
| 42 |
+
return ActionType.MEDITATE
|
| 43 |
+
if cn < 0.15:
|
| 44 |
+
return ActionType.FAMILY_TIME
|
| 45 |
+
if slot == 3:
|
| 46 |
+
return ActionType.SLEEP
|
| 47 |
+
if slot == 0:
|
| 48 |
+
return ActionType.DEEP_WORK if (v > 0.4 and c > 0.3) else ActionType.EXERCISE
|
| 49 |
+
if slot == 1:
|
| 50 |
+
if cn < 0.3:
|
| 51 |
+
return ActionType.FAMILY_TIME
|
| 52 |
+
if p < 0.3 and v > 0.3:
|
| 53 |
+
return ActionType.LEARN
|
| 54 |
+
return ActionType.ADMIN_WORK
|
| 55 |
+
if cn < 0.4:
|
| 56 |
+
return ActionType.SOCIALIZE
|
| 57 |
+
if s < 0.5:
|
| 58 |
+
return ActionType.ME_TIME
|
| 59 |
+
return ActionType.MEDITATE
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def random_action(rng) -> ActionType:
|
| 63 |
+
"""Random baseline."""
|
| 64 |
+
import random
|
| 65 |
+
return rng.choice(list(ActionType))
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
def run_episode(profile: str, seed: int, strategy: str, model=None, tokenizer=None) -> dict:
|
| 69 |
+
"""Run a single episode and return results."""
|
| 70 |
+
import random
|
| 71 |
+
rng = random.Random(seed + 500)
|
| 72 |
+
|
| 73 |
+
env = RhythmEnvironment()
|
| 74 |
+
obs = env.reset(seed=seed, profile=profile)
|
| 75 |
+
|
| 76 |
+
total_reward = 0.0
|
| 77 |
+
actions_taken = []
|
| 78 |
+
|
| 79 |
+
for step in range(MAX_STEPS):
|
| 80 |
+
if obs.done:
|
| 81 |
+
break
|
| 82 |
+
|
| 83 |
+
if strategy == "heuristic":
|
| 84 |
+
action_type = heuristic_action(obs)
|
| 85 |
+
elif strategy == "random":
|
| 86 |
+
action_type = random_action(rng)
|
| 87 |
+
elif strategy == "model" and model is not None:
|
| 88 |
+
action_type = model_action(obs, model, tokenizer)
|
| 89 |
+
else:
|
| 90 |
+
action_type = random_action(rng)
|
| 91 |
+
|
| 92 |
+
action = RhythmAction(action_type=action_type)
|
| 93 |
+
actions_taken.append(action_type.value)
|
| 94 |
+
obs = env.step(action)
|
| 95 |
+
total_reward += obs.reward
|
| 96 |
+
|
| 97 |
+
final_score = obs.reward_breakdown.get("final_score", 0.0)
|
| 98 |
+
|
| 99 |
+
return {
|
| 100 |
+
"profile": profile,
|
| 101 |
+
"strategy": strategy,
|
| 102 |
+
"seed": seed,
|
| 103 |
+
"final_score": round(final_score, 4),
|
| 104 |
+
"total_reward": round(total_reward, 2),
|
| 105 |
+
"vitality": round(obs.vitality, 2),
|
| 106 |
+
"cognition": round(obs.cognition, 2),
|
| 107 |
+
"progress": round(obs.progress, 2),
|
| 108 |
+
"serenity": round(obs.serenity, 2),
|
| 109 |
+
"connection": round(obs.connection, 2),
|
| 110 |
+
"actions": actions_taken,
|
| 111 |
+
}
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
def model_action(obs, model, tokenizer) -> ActionType:
|
| 115 |
+
"""Get action from trained model."""
|
| 116 |
+
from training.dataset import format_observation_prompt, SYSTEM_PROMPT
|
| 117 |
+
from training.reward_functions import extract_action
|
| 118 |
+
|
| 119 |
+
prompt = format_observation_prompt(obs)
|
| 120 |
+
messages = [
|
| 121 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 122 |
+
{"role": "user", "content": prompt},
|
| 123 |
+
]
|
| 124 |
+
|
| 125 |
+
text = tokenizer.apply_chat_template(
|
| 126 |
+
messages,
|
| 127 |
+
tokenize=False,
|
| 128 |
+
add_generation_prompt=True,
|
| 129 |
+
)
|
| 130 |
+
|
| 131 |
+
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
| 132 |
+
outputs = model.generate(
|
| 133 |
+
**inputs,
|
| 134 |
+
max_new_tokens=32,
|
| 135 |
+
temperature=0.7,
|
| 136 |
+
do_sample=True,
|
| 137 |
+
)
|
| 138 |
+
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
|
| 139 |
+
|
| 140 |
+
action_type = extract_action(response)
|
| 141 |
+
return action_type if action_type is not None else ActionType.SLEEP
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
def main():
|
| 145 |
+
parser = argparse.ArgumentParser(description="Evaluate RhythmEnv agent")
|
| 146 |
+
parser.add_argument("--model_path", type=str, default=None,
|
| 147 |
+
help="Path to trained model (skip for baseline only)")
|
| 148 |
+
parser.add_argument("--num_episodes", type=int, default=5,
|
| 149 |
+
help="Episodes per profile per strategy")
|
| 150 |
+
parser.add_argument("--output_file", type=str, default="eval_results.json",
|
| 151 |
+
help="Output file for results")
|
| 152 |
+
args = parser.parse_args()
|
| 153 |
+
|
| 154 |
+
model, tokenizer = None, None
|
| 155 |
+
strategies = ["heuristic", "random"]
|
| 156 |
+
|
| 157 |
+
if args.model_path and os.path.exists(args.model_path):
|
| 158 |
+
try:
|
| 159 |
+
from unsloth import FastLanguageModel
|
| 160 |
+
model, tokenizer = FastLanguageModel.from_pretrained(
|
| 161 |
+
model_name=args.model_path,
|
| 162 |
+
load_in_4bit=True,
|
| 163 |
+
max_seq_length=768,
|
| 164 |
+
)
|
| 165 |
+
FastLanguageModel.for_inference(model)
|
| 166 |
+
strategies.append("model")
|
| 167 |
+
print(f"Loaded trained model from: {args.model_path}")
|
| 168 |
+
except Exception as e:
|
| 169 |
+
print(f"Warning: Could not load model: {e}")
|
| 170 |
+
print("Running baseline-only evaluation.")
|
| 171 |
+
|
| 172 |
+
# ---------------------------------------------------------------
|
| 173 |
+
# Run evaluations
|
| 174 |
+
# ---------------------------------------------------------------
|
| 175 |
+
all_results = []
|
| 176 |
+
|
| 177 |
+
for strategy in strategies:
|
| 178 |
+
print(f"\n{'=' * 50}")
|
| 179 |
+
print(f"Strategy: {strategy.upper()}")
|
| 180 |
+
print(f"{'=' * 50}")
|
| 181 |
+
|
| 182 |
+
for profile in PROFILES:
|
| 183 |
+
scores = []
|
| 184 |
+
for ep in range(args.num_episodes):
|
| 185 |
+
seed = ep
|
| 186 |
+
result = run_episode(
|
| 187 |
+
profile=profile,
|
| 188 |
+
seed=seed,
|
| 189 |
+
strategy=strategy,
|
| 190 |
+
model=model,
|
| 191 |
+
tokenizer=tokenizer,
|
| 192 |
+
)
|
| 193 |
+
all_results.append(result)
|
| 194 |
+
scores.append(result["final_score"])
|
| 195 |
+
|
| 196 |
+
avg_score = sum(scores) / len(scores)
|
| 197 |
+
print(f" {profile:25s} avg_score={avg_score:.3f} scores={[round(s,3) for s in scores]}")
|
| 198 |
+
|
| 199 |
+
# ---------------------------------------------------------------
|
| 200 |
+
# Summary table
|
| 201 |
+
# ---------------------------------------------------------------
|
| 202 |
+
print(f"\n{'=' * 70}")
|
| 203 |
+
print("SUMMARY")
|
| 204 |
+
print(f"{'=' * 70}")
|
| 205 |
+
print(f"{'Profile':<25} {'Heuristic':>10} {'Random':>10}", end="")
|
| 206 |
+
if "model" in strategies:
|
| 207 |
+
print(f" {'Trained':>10}", end="")
|
| 208 |
+
print()
|
| 209 |
+
print("-" * 70)
|
| 210 |
+
|
| 211 |
+
for profile in PROFILES:
|
| 212 |
+
row = f"{profile:<25}"
|
| 213 |
+
for strategy in strategies:
|
| 214 |
+
profile_results = [r for r in all_results if r["profile"] == profile and r["strategy"] == strategy]
|
| 215 |
+
avg = sum(r["final_score"] for r in profile_results) / len(profile_results)
|
| 216 |
+
row += f" {avg:>10.3f}"
|
| 217 |
+
print(row)
|
| 218 |
+
|
| 219 |
+
# Save results
|
| 220 |
+
with open(args.output_file, "w") as f:
|
| 221 |
+
json.dump(all_results, f, indent=2)
|
| 222 |
+
print(f"\nResults saved to: {args.output_file}")
|
| 223 |
+
|
| 224 |
+
|
| 225 |
+
if __name__ == "__main__":
|
| 226 |
+
main()
|
training/reward_functions.py
ADDED
|
@@ -0,0 +1,214 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Reward functions for RhythmEnv GRPO training.
|
| 3 |
+
|
| 4 |
+
Three-layer reward stack:
|
| 5 |
+
1. format_valid β is the LLM output a parseable action name?
|
| 6 |
+
2. action_legal β is it one of the 10 valid actions?
|
| 7 |
+
3. env_reward β actual reward from stepping the environment
|
| 8 |
+
|
| 9 |
+
Each function returns a list of floats (one per completion).
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
import sys
|
| 13 |
+
import os
|
| 14 |
+
import re
|
| 15 |
+
|
| 16 |
+
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
| 17 |
+
|
| 18 |
+
from models import ActionType, RhythmAction
|
| 19 |
+
from server.rhythm_environment import RhythmEnvironment
|
| 20 |
+
|
| 21 |
+
# All valid action names (uppercase for matching)
|
| 22 |
+
VALID_ACTIONS = {at.value.upper(): at for at in ActionType}
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
def extract_action(text: str) -> ActionType | None:
|
| 26 |
+
"""Parse LLM output into an ActionType. Returns None if unparseable."""
|
| 27 |
+
if not text:
|
| 28 |
+
return None
|
| 29 |
+
|
| 30 |
+
# Clean up: take first line, strip whitespace, uppercase
|
| 31 |
+
text = text.strip().split("\n")[0].strip().upper().replace(" ", "_")
|
| 32 |
+
|
| 33 |
+
# Direct match
|
| 34 |
+
if text in VALID_ACTIONS:
|
| 35 |
+
return VALID_ACTIONS[text]
|
| 36 |
+
|
| 37 |
+
# Partial match (e.g., "I choose DEEP_WORK" β DEEP_WORK)
|
| 38 |
+
for name, at in VALID_ACTIONS.items():
|
| 39 |
+
if name in text:
|
| 40 |
+
return at
|
| 41 |
+
|
| 42 |
+
return None
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def format_valid(completions, **kwargs) -> list[float]:
|
| 46 |
+
"""
|
| 47 |
+
Layer 1: Does the LLM output contain a parseable action?
|
| 48 |
+
|
| 49 |
+
+1.0 if valid action found, -2.0 if not.
|
| 50 |
+
This teaches the model to output clean action names.
|
| 51 |
+
"""
|
| 52 |
+
scores = []
|
| 53 |
+
for completion in completions:
|
| 54 |
+
response = completion[0]["content"] if isinstance(completion, list) else completion
|
| 55 |
+
action = extract_action(response)
|
| 56 |
+
scores.append(1.0 if action is not None else -2.0)
|
| 57 |
+
return scores
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def action_legal(completions, **kwargs) -> list[float]:
|
| 61 |
+
"""
|
| 62 |
+
Layer 2: Is the parsed action one of the 10 valid actions?
|
| 63 |
+
|
| 64 |
+
For our env, all 10 actions are always legal (no state-dependent validity).
|
| 65 |
+
This is simpler than the workday scheduler where START_TASK required idle state.
|
| 66 |
+
|
| 67 |
+
+0.5 if legal, -1.0 if not parseable.
|
| 68 |
+
"""
|
| 69 |
+
scores = []
|
| 70 |
+
for completion in completions:
|
| 71 |
+
response = completion[0]["content"] if isinstance(completion, list) else completion
|
| 72 |
+
action = extract_action(response)
|
| 73 |
+
scores.append(0.5 if action is not None else -1.0)
|
| 74 |
+
return scores
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def env_reward(
|
| 78 |
+
completions,
|
| 79 |
+
prompts=None,
|
| 80 |
+
seed=None,
|
| 81 |
+
step_index=None,
|
| 82 |
+
action_history=None,
|
| 83 |
+
**kwargs,
|
| 84 |
+
) -> list[float]:
|
| 85 |
+
"""
|
| 86 |
+
Layer 3: Actual environment reward from stepping with the chosen action.
|
| 87 |
+
|
| 88 |
+
Replays the episode to the correct state, then steps with the model's action.
|
| 89 |
+
Returns the raw env reward (scaled).
|
| 90 |
+
|
| 91 |
+
This is the core training signal β the model learns which actions
|
| 92 |
+
produce higher rewards for different states and hidden profiles.
|
| 93 |
+
|
| 94 |
+
seed/step_index/action_history are passed as dataset columns by GRPOTrainer.
|
| 95 |
+
"""
|
| 96 |
+
scores = []
|
| 97 |
+
|
| 98 |
+
for i, completion in enumerate(completions):
|
| 99 |
+
response = completion[0]["content"] if isinstance(completion, list) else completion
|
| 100 |
+
action_type = extract_action(response)
|
| 101 |
+
|
| 102 |
+
if action_type is None:
|
| 103 |
+
scores.append(-3.0)
|
| 104 |
+
continue
|
| 105 |
+
|
| 106 |
+
# Prefer direct dataset column kwargs (passed by GRPOTrainer)
|
| 107 |
+
if seed is not None and i < len(seed):
|
| 108 |
+
ep_seed = seed[i]
|
| 109 |
+
ep_history = action_history[i] if action_history is not None else []
|
| 110 |
+
else:
|
| 111 |
+
# Legacy: try to read from prompt dict (fallback for old-style prompts)
|
| 112 |
+
prompt_data = prompts[i] if prompts and i < len(prompts) else None
|
| 113 |
+
if prompt_data and isinstance(prompt_data, dict) and "seed" in prompt_data:
|
| 114 |
+
ep_seed = prompt_data["seed"]
|
| 115 |
+
ep_history = prompt_data.get("action_history", [])
|
| 116 |
+
else:
|
| 117 |
+
ep_seed = i % 50 # vary seed so training sees diverse states
|
| 118 |
+
ep_history = []
|
| 119 |
+
|
| 120 |
+
try:
|
| 121 |
+
# Replay to the correct state
|
| 122 |
+
env = RhythmEnvironment()
|
| 123 |
+
env.reset(seed=ep_seed)
|
| 124 |
+
|
| 125 |
+
for past_action_name in ep_history:
|
| 126 |
+
past_action = RhythmAction(action_type=ActionType(past_action_name))
|
| 127 |
+
env.step(past_action)
|
| 128 |
+
|
| 129 |
+
# Step with the model's chosen action
|
| 130 |
+
action = RhythmAction(action_type=action_type)
|
| 131 |
+
obs = env.step(action)
|
| 132 |
+
|
| 133 |
+
# Return the env reward (already in reasonable range from REWARD_SCALE)
|
| 134 |
+
scores.append(obs.reward)
|
| 135 |
+
except Exception:
|
| 136 |
+
scores.append(-3.0)
|
| 137 |
+
|
| 138 |
+
return scores
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
def env_reward_simple(completions, prompts=None, **kwargs) -> list[float]:
|
| 142 |
+
"""
|
| 143 |
+
State-aware heuristic reward (no env replay).
|
| 144 |
+
|
| 145 |
+
Parses meter values and time-of-day from the prompt text,
|
| 146 |
+
then scores whether the chosen action addresses the most pressing need.
|
| 147 |
+
Faster than env_reward but less precise.
|
| 148 |
+
"""
|
| 149 |
+
scores = []
|
| 150 |
+
for i, completion in enumerate(completions):
|
| 151 |
+
response = completion[0]["content"] if isinstance(completion, list) else completion
|
| 152 |
+
action_type = extract_action(response)
|
| 153 |
+
|
| 154 |
+
if action_type is None:
|
| 155 |
+
scores.append(-2.0)
|
| 156 |
+
continue
|
| 157 |
+
|
| 158 |
+
# Parse prompt for state info
|
| 159 |
+
prompt_text = ""
|
| 160 |
+
if prompts and i < len(prompts):
|
| 161 |
+
p = prompts[i]
|
| 162 |
+
if isinstance(p, list):
|
| 163 |
+
prompt_text = p[-1].get("content", "") if p else ""
|
| 164 |
+
elif isinstance(p, dict):
|
| 165 |
+
prompt_text = p.get("content", "")
|
| 166 |
+
else:
|
| 167 |
+
prompt_text = str(p)
|
| 168 |
+
|
| 169 |
+
# Extract meter values with regex
|
| 170 |
+
meters = {}
|
| 171 |
+
for meter_name in ["Vitality", "Cognition", "Progress", "Serenity", "Connection"]:
|
| 172 |
+
match = re.search(rf"{meter_name}:\s*([\d.]+)", prompt_text)
|
| 173 |
+
if match:
|
| 174 |
+
meters[meter_name.lower()] = float(match.group(1))
|
| 175 |
+
|
| 176 |
+
# Extract time slot
|
| 177 |
+
is_morning = "Morning" in prompt_text
|
| 178 |
+
is_night = "Night" in prompt_text
|
| 179 |
+
|
| 180 |
+
score = 0.0
|
| 181 |
+
|
| 182 |
+
v = meters.get("vitality", 0.5)
|
| 183 |
+
s = meters.get("serenity", 0.5)
|
| 184 |
+
cn = meters.get("connection", 0.5)
|
| 185 |
+
|
| 186 |
+
# Reward actions that address critically low meters
|
| 187 |
+
if v < 0.2 and action_type in (ActionType.SLEEP, ActionType.EXERCISE):
|
| 188 |
+
score += 0.5
|
| 189 |
+
elif v < 0.2:
|
| 190 |
+
score -= 0.3
|
| 191 |
+
|
| 192 |
+
if s < 0.2 and action_type in (ActionType.MEDITATE, ActionType.ME_TIME):
|
| 193 |
+
score += 0.5
|
| 194 |
+
elif s < 0.2 and action_type != ActionType.SLEEP:
|
| 195 |
+
score -= 0.2
|
| 196 |
+
|
| 197 |
+
if cn < 0.2 and action_type in (ActionType.FAMILY_TIME, ActionType.SOCIALIZE):
|
| 198 |
+
score += 0.5
|
| 199 |
+
elif cn < 0.2:
|
| 200 |
+
score -= 0.1
|
| 201 |
+
|
| 202 |
+
# Time-appropriate bonuses
|
| 203 |
+
if is_morning and action_type in (ActionType.DEEP_WORK, ActionType.LEARN):
|
| 204 |
+
score += 0.2
|
| 205 |
+
if is_night and action_type == ActionType.SLEEP:
|
| 206 |
+
score += 0.3
|
| 207 |
+
|
| 208 |
+
# Penalize binge
|
| 209 |
+
if action_type == ActionType.BINGE_WATCH:
|
| 210 |
+
score -= 0.3
|
| 211 |
+
|
| 212 |
+
scores.append(score)
|
| 213 |
+
|
| 214 |
+
return scores
|
training/train.py
ADDED
|
@@ -0,0 +1,202 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
RhythmEnv GRPO Training Script
|
| 3 |
+
|
| 4 |
+
Trains an LLM agent to balance life meters using Group Relative Policy
|
| 5 |
+
Optimization (GRPO) with Unsloth for efficient fine-tuning.
|
| 6 |
+
|
| 7 |
+
Usage (Colab with GPU):
|
| 8 |
+
!pip install unsloth transformers trl datasets
|
| 9 |
+
!python training/train.py
|
| 10 |
+
|
| 11 |
+
Usage (local):
|
| 12 |
+
python training/train.py --max_steps 100 --model_name "unsloth/Qwen2.5-3B-Instruct"
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
import argparse
|
| 16 |
+
import json
|
| 17 |
+
import os
|
| 18 |
+
import sys
|
| 19 |
+
|
| 20 |
+
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
def main():
|
| 24 |
+
parser = argparse.ArgumentParser(description="Train RhythmEnv agent with GRPO")
|
| 25 |
+
parser.add_argument("--model_name", type=str, default="unsloth/Qwen2.5-3B-Instruct",
|
| 26 |
+
help="Model to fine-tune (HF model ID)")
|
| 27 |
+
parser.add_argument("--max_steps", type=int, default=500,
|
| 28 |
+
help="Number of GRPO training steps")
|
| 29 |
+
parser.add_argument("--num_episodes", type=int, default=200,
|
| 30 |
+
help="Number of episodes for dataset generation")
|
| 31 |
+
parser.add_argument("--max_samples", type=int, default=2000,
|
| 32 |
+
help="Maximum training samples")
|
| 33 |
+
parser.add_argument("--num_generations", type=int, default=4,
|
| 34 |
+
help="Number of completions per prompt for GRPO")
|
| 35 |
+
parser.add_argument("--learning_rate", type=float, default=2e-4,
|
| 36 |
+
help="Learning rate")
|
| 37 |
+
parser.add_argument("--output_dir", type=str, default="outputs/rhythmenv_trained",
|
| 38 |
+
help="Output directory for model and logs")
|
| 39 |
+
parser.add_argument("--use_simple_reward", action="store_true",
|
| 40 |
+
help="Use simplified reward (no env replay, faster)")
|
| 41 |
+
parser.add_argument("--report_to", type=str, default="none",
|
| 42 |
+
help="Logging: 'wandb', 'trackio', or 'none'")
|
| 43 |
+
args = parser.parse_args()
|
| 44 |
+
|
| 45 |
+
# ---------------------------------------------------------------
|
| 46 |
+
# 1. Generate dataset
|
| 47 |
+
# ---------------------------------------------------------------
|
| 48 |
+
print("=" * 60)
|
| 49 |
+
print("Step 1: Generating training dataset...")
|
| 50 |
+
print("=" * 60)
|
| 51 |
+
|
| 52 |
+
from dataset import generate_dataset
|
| 53 |
+
from datasets import Dataset
|
| 54 |
+
|
| 55 |
+
raw_samples = generate_dataset(
|
| 56 |
+
num_episodes=args.num_episodes,
|
| 57 |
+
strategy="mixed",
|
| 58 |
+
max_samples=args.max_samples,
|
| 59 |
+
)
|
| 60 |
+
|
| 61 |
+
# Include replay metadata so env_reward can reconstruct the exact state
|
| 62 |
+
dataset = Dataset.from_list([
|
| 63 |
+
{
|
| 64 |
+
"prompt": sample["prompt"],
|
| 65 |
+
"seed": sample["seed"],
|
| 66 |
+
"step_index": sample["step_index"],
|
| 67 |
+
"action_history": sample["action_history"],
|
| 68 |
+
}
|
| 69 |
+
for sample in raw_samples
|
| 70 |
+
])
|
| 71 |
+
print(f"Dataset size: {len(dataset)}")
|
| 72 |
+
|
| 73 |
+
# ---------------------------------------------------------------
|
| 74 |
+
# 2. Load model with Unsloth
|
| 75 |
+
# ---------------------------------------------------------------
|
| 76 |
+
print("\n" + "=" * 60)
|
| 77 |
+
print("Step 2: Loading model...")
|
| 78 |
+
print("=" * 60)
|
| 79 |
+
|
| 80 |
+
from unsloth import FastLanguageModel
|
| 81 |
+
|
| 82 |
+
max_seq_length = 768
|
| 83 |
+
lora_rank = 4
|
| 84 |
+
|
| 85 |
+
model, tokenizer = FastLanguageModel.from_pretrained(
|
| 86 |
+
model_name=args.model_name,
|
| 87 |
+
load_in_4bit=True,
|
| 88 |
+
max_seq_length=max_seq_length,
|
| 89 |
+
)
|
| 90 |
+
|
| 91 |
+
model = FastLanguageModel.get_peft_model(
|
| 92 |
+
model,
|
| 93 |
+
r=lora_rank,
|
| 94 |
+
target_modules=[
|
| 95 |
+
"q_proj", "k_proj", "v_proj", "o_proj",
|
| 96 |
+
"gate_proj", "up_proj", "down_proj",
|
| 97 |
+
],
|
| 98 |
+
lora_alpha=lora_rank * 2,
|
| 99 |
+
use_gradient_checkpointing="unsloth",
|
| 100 |
+
random_state=3407,
|
| 101 |
+
)
|
| 102 |
+
|
| 103 |
+
print(f"Model loaded: {args.model_name}")
|
| 104 |
+
|
| 105 |
+
# ---------------------------------------------------------------
|
| 106 |
+
# 3. Set up reward functions
|
| 107 |
+
# ---------------------------------------------------------------
|
| 108 |
+
print("\n" + "=" * 60)
|
| 109 |
+
print("Step 3: Setting up reward functions...")
|
| 110 |
+
print("=" * 60)
|
| 111 |
+
|
| 112 |
+
from reward_functions import format_valid, action_legal, env_reward, env_reward_simple
|
| 113 |
+
|
| 114 |
+
if args.use_simple_reward:
|
| 115 |
+
reward_funcs = [format_valid, action_legal, env_reward_simple]
|
| 116 |
+
print("Using: format_valid + action_legal + env_reward_simple (fast mode)")
|
| 117 |
+
else:
|
| 118 |
+
reward_funcs = [format_valid, action_legal, env_reward]
|
| 119 |
+
print("Using: format_valid + action_legal + env_reward (full replay)")
|
| 120 |
+
|
| 121 |
+
# ---------------------------------------------------------------
|
| 122 |
+
# 4. Configure GRPO trainer
|
| 123 |
+
# ---------------------------------------------------------------
|
| 124 |
+
print("\n" + "=" * 60)
|
| 125 |
+
print("Step 4: Configuring GRPO trainer...")
|
| 126 |
+
print("=" * 60)
|
| 127 |
+
|
| 128 |
+
from trl import GRPOConfig, GRPOTrainer
|
| 129 |
+
|
| 130 |
+
max_prompt_length = 400
|
| 131 |
+
max_completion_length = max_seq_length - max_prompt_length
|
| 132 |
+
|
| 133 |
+
training_args = GRPOConfig(
|
| 134 |
+
temperature=1.0,
|
| 135 |
+
learning_rate=args.learning_rate,
|
| 136 |
+
weight_decay=0.001,
|
| 137 |
+
warmup_ratio=0.1,
|
| 138 |
+
lr_scheduler_type="linear",
|
| 139 |
+
optim="adamw_8bit",
|
| 140 |
+
logging_steps=1,
|
| 141 |
+
per_device_train_batch_size=1,
|
| 142 |
+
gradient_accumulation_steps=4,
|
| 143 |
+
num_generations=args.num_generations,
|
| 144 |
+
max_prompt_length=max_prompt_length,
|
| 145 |
+
max_completion_length=max_completion_length,
|
| 146 |
+
max_steps=args.max_steps,
|
| 147 |
+
save_steps=100,
|
| 148 |
+
report_to=args.report_to,
|
| 149 |
+
output_dir=args.output_dir,
|
| 150 |
+
)
|
| 151 |
+
|
| 152 |
+
print(f"Max steps: {args.max_steps}")
|
| 153 |
+
print(f"Num generations: {args.num_generations}")
|
| 154 |
+
print(f"Learning rate: {args.learning_rate}")
|
| 155 |
+
|
| 156 |
+
# ---------------------------------------------------------------
|
| 157 |
+
# 5. Train
|
| 158 |
+
# ---------------------------------------------------------------
|
| 159 |
+
print("\n" + "=" * 60)
|
| 160 |
+
print("Step 5: Starting GRPO training...")
|
| 161 |
+
print("=" * 60)
|
| 162 |
+
|
| 163 |
+
trainer = GRPOTrainer(
|
| 164 |
+
model=model,
|
| 165 |
+
processing_class=tokenizer,
|
| 166 |
+
reward_funcs=reward_funcs,
|
| 167 |
+
args=training_args,
|
| 168 |
+
train_dataset=dataset,
|
| 169 |
+
)
|
| 170 |
+
|
| 171 |
+
trainer.train()
|
| 172 |
+
|
| 173 |
+
# ---------------------------------------------------------------
|
| 174 |
+
# 6. Save model
|
| 175 |
+
# ---------------------------------------------------------------
|
| 176 |
+
print("\n" + "=" * 60)
|
| 177 |
+
print("Step 6: Saving model...")
|
| 178 |
+
print("=" * 60)
|
| 179 |
+
|
| 180 |
+
# Save merged model (correct way β don't upcast 4-bit naively)
|
| 181 |
+
model.save_pretrained_merged(
|
| 182 |
+
args.output_dir,
|
| 183 |
+
tokenizer,
|
| 184 |
+
save_method="merged_16bit",
|
| 185 |
+
)
|
| 186 |
+
print(f"Model saved to: {args.output_dir}")
|
| 187 |
+
|
| 188 |
+
# Also save training config for reference
|
| 189 |
+
config_path = os.path.join(args.output_dir, "training_config.json")
|
| 190 |
+
with open(config_path, "w") as f:
|
| 191 |
+
json.dump(vars(args), f, indent=2)
|
| 192 |
+
print(f"Training config saved to: {config_path}")
|
| 193 |
+
|
| 194 |
+
print("\n" + "=" * 60)
|
| 195 |
+
print("Training complete!")
|
| 196 |
+
print(f"Model: {args.output_dir}")
|
| 197 |
+
print("Next: run inference_eval.py to compare baseline vs trained")
|
| 198 |
+
print("=" * 60)
|
| 199 |
+
|
| 200 |
+
|
| 201 |
+
if __name__ == "__main__":
|
| 202 |
+
main()
|
ui/app.py
ADDED
|
@@ -0,0 +1,384 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
RhythmEnv Visual Explorer β Life Simulator
|
| 3 |
+
Run: python ui/app.py
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
| 9 |
+
|
| 10 |
+
import gradio as gr
|
| 11 |
+
from server.rhythm_environment import (
|
| 12 |
+
RhythmEnvironment, MAX_STEPS, METERS, ACTION_EFFECTS, PROFILES
|
| 13 |
+
)
|
| 14 |
+
from models import RhythmAction, ActionType
|
| 15 |
+
|
| 16 |
+
SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
|
| 17 |
+
DAY_NAMES = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
|
| 18 |
+
PROFILE_NAMES = ["introvert_morning", "extrovert_night_owl", "workaholic_stoic", "random"]
|
| 19 |
+
ACTION_NAMES = [at.value.upper() for at in ActionType]
|
| 20 |
+
|
| 21 |
+
# ---------------------------------------------------------------------------
|
| 22 |
+
# Display helpers
|
| 23 |
+
# ---------------------------------------------------------------------------
|
| 24 |
+
|
| 25 |
+
def _meter_bar(value: float, label: str, width: int = 20) -> str:
|
| 26 |
+
filled = int(round(value * width))
|
| 27 |
+
bar = "β" * filled + "β" * (width - filled)
|
| 28 |
+
indicator = "π΄" if value < 0.2 else ("π‘" if value < 0.4 else "π’")
|
| 29 |
+
return f"{indicator} {label:<12} [{bar}] {value:.2f}"
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
def format_meters(obs) -> str:
|
| 33 |
+
day_name = DAY_NAMES[obs.day] if obs.day < 7 else f"Day {obs.day + 1}"
|
| 34 |
+
slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
|
| 35 |
+
event_line = f"\n β‘ Event: {obs.active_event}" if obs.active_event else ""
|
| 36 |
+
header = (
|
| 37 |
+
f"π
{day_name} {slot_name} | "
|
| 38 |
+
f"Step {obs.timestep}/{MAX_STEPS} | "
|
| 39 |
+
f"{obs.remaining_steps} steps left{event_line}"
|
| 40 |
+
)
|
| 41 |
+
bars = [
|
| 42 |
+
_meter_bar(obs.vitality, "Vitality"),
|
| 43 |
+
_meter_bar(obs.cognition, "Cognition"),
|
| 44 |
+
_meter_bar(obs.progress, "Progress"),
|
| 45 |
+
_meter_bar(obs.serenity, "Serenity"),
|
| 46 |
+
_meter_bar(obs.connection, "Connection"),
|
| 47 |
+
]
|
| 48 |
+
return header + "\n\n" + "\n".join(bars)
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def format_reward_breakdown(breakdown: dict) -> str:
|
| 52 |
+
if not breakdown:
|
| 53 |
+
return "β"
|
| 54 |
+
lines = []
|
| 55 |
+
for k, v in breakdown.items():
|
| 56 |
+
sign = "+" if v >= 0 else ""
|
| 57 |
+
lines.append(f" {k:<25} {sign}{v:.4f}")
|
| 58 |
+
return "\n".join(lines)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def show_action_effects() -> str:
|
| 62 |
+
header = f"{'Action':<15}" + "".join(f" {m[:3]:>6}" for m in METERS)
|
| 63 |
+
lines = [header, "β" * 52]
|
| 64 |
+
for action, effects in ACTION_EFFECTS.items():
|
| 65 |
+
row = f"{action:<15}"
|
| 66 |
+
for m in METERS:
|
| 67 |
+
row += f" {effects[m]:>+6.2f}"
|
| 68 |
+
lines.append(row)
|
| 69 |
+
return "\n".join(lines)
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def show_profiles() -> str:
|
| 73 |
+
lines = []
|
| 74 |
+
for p in PROFILES:
|
| 75 |
+
name = p["name"]
|
| 76 |
+
weights = p["reward_weights"]
|
| 77 |
+
lines += [f"\n{'β'*52}", f" {name.upper()}", f"{'β'*52}"]
|
| 78 |
+
lines.append(" Reward weights (hidden from agent):")
|
| 79 |
+
for m, w in weights.items():
|
| 80 |
+
bar = "β" * int(w * 20)
|
| 81 |
+
lines.append(f" {m:<12} {bar:<20} {w:.0%}")
|
| 82 |
+
lines.append("\n Key hidden modifiers:")
|
| 83 |
+
if p.get("morning_cognition_bonus"):
|
| 84 |
+
lines.append(f" β’ Morning: cognition/progress Γ{p['morning_cognition_bonus']} (HV1 peak)")
|
| 85 |
+
if p.get("evening_night_cognition_bonus"):
|
| 86 |
+
lines.append(f" β’ Evening/Night: cognition/progress Γ{p['evening_night_cognition_bonus']} (HV1 peak)")
|
| 87 |
+
if p.get("morning_penalty"):
|
| 88 |
+
lines.append(f" β’ Morning: cognition/progress Γ{p['morning_penalty']} (HV1 penalty)")
|
| 89 |
+
sv = p.get("social_vitality_multiplier", 1.0)
|
| 90 |
+
if sv != 1.0:
|
| 91 |
+
lines.append(f" β’ Social vitality drain Γ{sv} (HV2 social cost)")
|
| 92 |
+
if p.get("binge_shame"):
|
| 93 |
+
lines.append(f" β’ Binge watch: shame spiral β0.15 serenity β0.06 cognition (HV3)")
|
| 94 |
+
if p.get("progress_serenity_bonus"):
|
| 95 |
+
lines.append(f" β’ Work gives serenity +{p['progress_serenity_bonus']} (HV3 meaning)")
|
| 96 |
+
if p.get("idle_serenity_decay"):
|
| 97 |
+
lines.append(f" β’ Idle drains serenity β{p['idle_serenity_decay']} (HV3 guilt)")
|
| 98 |
+
if p.get("work_vitality_recovery"):
|
| 99 |
+
lines.append(f" β’ Work recovers vitality +{p['work_vitality_recovery']} (HV2 energized)")
|
| 100 |
+
if p.get("solo_serenity_bonus"):
|
| 101 |
+
lines.append(f" β’ Solo time (me_time/meditate) gives serenity +{p['solo_serenity_bonus']} (HV3 recharge)")
|
| 102 |
+
scm = p.get("social_connection_multiplier", 1.0)
|
| 103 |
+
if scm != 1.0:
|
| 104 |
+
lines.append(f" β’ Social connection Γ{scm} (HV2 amplifier)")
|
| 105 |
+
lines.append(f" β’ Connection passive decay: β{p['connection_decay_rate']}/step")
|
| 106 |
+
return "\n".join(lines)
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
# ---------------------------------------------------------------------------
|
| 110 |
+
# Global env state (single-session)
|
| 111 |
+
# ---------------------------------------------------------------------------
|
| 112 |
+
|
| 113 |
+
_env = None
|
| 114 |
+
_last_obs = None
|
| 115 |
+
_step_log = []
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
def get_env():
|
| 119 |
+
global _env
|
| 120 |
+
if _env is None:
|
| 121 |
+
_env = RhythmEnvironment()
|
| 122 |
+
return _env
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
# ---------------------------------------------------------------------------
|
| 126 |
+
# Tab 1 β Configure & Reset
|
| 127 |
+
# ---------------------------------------------------------------------------
|
| 128 |
+
|
| 129 |
+
def reset_episode(profile_name: str, seed_str: str):
|
| 130 |
+
global _last_obs, _step_log
|
| 131 |
+
|
| 132 |
+
try:
|
| 133 |
+
seed = int(seed_str.strip()) if seed_str.strip() else 42
|
| 134 |
+
except ValueError:
|
| 135 |
+
seed = 42
|
| 136 |
+
|
| 137 |
+
env = get_env()
|
| 138 |
+
|
| 139 |
+
if profile_name == "random":
|
| 140 |
+
_last_obs = env.reset(seed=seed)
|
| 141 |
+
else:
|
| 142 |
+
_last_obs = env.reset(seed=seed, profile=profile_name)
|
| 143 |
+
|
| 144 |
+
_step_log = []
|
| 145 |
+
meters_text = format_meters(_last_obs)
|
| 146 |
+
log = (
|
| 147 |
+
f"Episode started.\n"
|
| 148 |
+
f" Profile : {env._profile['name']}\n"
|
| 149 |
+
f" Seed : {seed}\n\n"
|
| 150 |
+
"Choose an action and press Take Step, or use an auto-run button."
|
| 151 |
+
)
|
| 152 |
+
return meters_text, log, "β", "β", False
|
| 153 |
+
|
| 154 |
+
|
| 155 |
+
# ---------------------------------------------------------------------------
|
| 156 |
+
# Tab 2 β Run Episode
|
| 157 |
+
# ---------------------------------------------------------------------------
|
| 158 |
+
|
| 159 |
+
def take_action(action_str: str):
|
| 160 |
+
global _last_obs, _step_log
|
| 161 |
+
|
| 162 |
+
if _last_obs is None:
|
| 163 |
+
return "β οΈ Go to Configure & Reset and start an episode first.", "β", "β", "β", False
|
| 164 |
+
if _last_obs.done:
|
| 165 |
+
return "Episode is done. Reset to start a new one.", "β", "β", "β", True
|
| 166 |
+
|
| 167 |
+
env = get_env()
|
| 168 |
+
action_type = ActionType(action_str.lower())
|
| 169 |
+
obs = env.step(RhythmAction(action_type=action_type))
|
| 170 |
+
_last_obs = obs
|
| 171 |
+
|
| 172 |
+
sign = "+" if obs.reward >= 0 else ""
|
| 173 |
+
step_line = (
|
| 174 |
+
f"Step {obs.timestep:>2} | {action_str:<15} | "
|
| 175 |
+
f"reward {sign}{obs.reward:.4f} | "
|
| 176 |
+
f"V:{obs.vitality:.2f} Co:{obs.cognition:.2f} "
|
| 177 |
+
f"P:{obs.progress:.2f} S:{obs.serenity:.2f} Cn:{obs.connection:.2f}"
|
| 178 |
+
)
|
| 179 |
+
if obs.active_event:
|
| 180 |
+
step_line += f" β‘{obs.active_event}"
|
| 181 |
+
if obs.done:
|
| 182 |
+
final = obs.reward_breakdown.get("final_score", 0.0)
|
| 183 |
+
step_line += f"\n{'β'*72}\nβ
DONE β Final score: {final:.4f}"
|
| 184 |
+
_step_log.append(step_line)
|
| 185 |
+
|
| 186 |
+
score_text = (
|
| 187 |
+
f"Final: {obs.reward_breakdown.get('final_score', 0.0):.4f}"
|
| 188 |
+
if obs.done else f"Step reward: {obs.reward:+.4f}"
|
| 189 |
+
)
|
| 190 |
+
return (
|
| 191 |
+
format_meters(obs),
|
| 192 |
+
"\n".join(_step_log),
|
| 193 |
+
format_reward_breakdown(obs.reward_breakdown),
|
| 194 |
+
score_text,
|
| 195 |
+
obs.done,
|
| 196 |
+
)
|
| 197 |
+
|
| 198 |
+
|
| 199 |
+
def _run_auto_episode(profile_name: str, seed_str: str, strategy: str):
|
| 200 |
+
"""Shared runner for heuristic and random auto-episodes."""
|
| 201 |
+
global _last_obs, _step_log
|
| 202 |
+
import random as _random
|
| 203 |
+
|
| 204 |
+
try:
|
| 205 |
+
seed = int(seed_str.strip()) if seed_str.strip() else 42
|
| 206 |
+
except ValueError:
|
| 207 |
+
seed = 42
|
| 208 |
+
|
| 209 |
+
from training.inference_eval import heuristic_action
|
| 210 |
+
|
| 211 |
+
rng = _random.Random(seed + 999)
|
| 212 |
+
all_actions = list(ActionType)
|
| 213 |
+
|
| 214 |
+
env = get_env()
|
| 215 |
+
obs = env.reset(seed=seed) if profile_name == "random" else env.reset(seed=seed, profile=profile_name)
|
| 216 |
+
_last_obs = obs
|
| 217 |
+
_step_log = []
|
| 218 |
+
logs = []
|
| 219 |
+
total_reward = 0.0
|
| 220 |
+
|
| 221 |
+
while not obs.done:
|
| 222 |
+
if strategy == "heuristic":
|
| 223 |
+
action_type = heuristic_action(obs)
|
| 224 |
+
else:
|
| 225 |
+
action_type = rng.choice(all_actions)
|
| 226 |
+
obs = env.step(RhythmAction(action_type=action_type))
|
| 227 |
+
_last_obs = obs
|
| 228 |
+
total_reward += obs.reward
|
| 229 |
+
sign = "+" if obs.reward >= 0 else ""
|
| 230 |
+
line = (
|
| 231 |
+
f"Step {obs.timestep:>2} | {action_type.value.upper():<15} | "
|
| 232 |
+
f"reward {sign}{obs.reward:.4f} | "
|
| 233 |
+
f"V:{obs.vitality:.2f} Co:{obs.cognition:.2f} "
|
| 234 |
+
f"P:{obs.progress:.2f} S:{obs.serenity:.2f} Cn:{obs.connection:.2f}"
|
| 235 |
+
)
|
| 236 |
+
if obs.active_event:
|
| 237 |
+
line += f" β‘{obs.active_event}"
|
| 238 |
+
logs.append(line)
|
| 239 |
+
|
| 240 |
+
final = obs.reward_breakdown.get("final_score", 0.0)
|
| 241 |
+
logs += [f"{'β'*72}", f"β
DONE β Final score: {final:.4f} | Total reward: {total_reward:.2f}"]
|
| 242 |
+
_step_log = logs
|
| 243 |
+
|
| 244 |
+
return (
|
| 245 |
+
format_meters(obs),
|
| 246 |
+
"\n".join(logs),
|
| 247 |
+
format_reward_breakdown(obs.reward_breakdown),
|
| 248 |
+
f"Final: {final:.4f} | Total: {total_reward:.2f}",
|
| 249 |
+
)
|
| 250 |
+
|
| 251 |
+
|
| 252 |
+
def run_heuristic_episode(profile_name, seed_str):
|
| 253 |
+
return _run_auto_episode(profile_name, seed_str, "heuristic")
|
| 254 |
+
|
| 255 |
+
|
| 256 |
+
def run_random_episode(profile_name, seed_str):
|
| 257 |
+
return _run_auto_episode(profile_name, seed_str, "random")
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
# ---------------------------------------------------------------------------
|
| 261 |
+
# Build UI
|
| 262 |
+
# ---------------------------------------------------------------------------
|
| 263 |
+
|
| 264 |
+
with gr.Blocks(title="RhythmEnv β Life Simulator") as demo:
|
| 265 |
+
|
| 266 |
+
gr.Markdown(
|
| 267 |
+
"# RhythmEnv β Life Simulator\n"
|
| 268 |
+
"**Can an LLM discover your personality through experience?**\n\n"
|
| 269 |
+
"Balance 5 life meters across a 7-day week. "
|
| 270 |
+
"Hidden personality profiles secretly control how actions affect you β "
|
| 271 |
+
"the agent must discover those patterns through reward signals alone."
|
| 272 |
+
)
|
| 273 |
+
|
| 274 |
+
with gr.Tabs():
|
| 275 |
+
|
| 276 |
+
# ββ Tab 1: Configure & Reset ββββββββββββββββββββββββββββββββββββββ
|
| 277 |
+
with gr.TabItem("Configure & Reset"):
|
| 278 |
+
gr.Markdown("### Start a new episode")
|
| 279 |
+
|
| 280 |
+
with gr.Row():
|
| 281 |
+
profile_dd = gr.Dropdown(
|
| 282 |
+
choices=PROFILE_NAMES,
|
| 283 |
+
value="introvert_morning",
|
| 284 |
+
label="Hidden Profile (normally agent cannot see this β demo only)",
|
| 285 |
+
)
|
| 286 |
+
seed_in = gr.Textbox(label="Seed", value="42", scale=1)
|
| 287 |
+
|
| 288 |
+
reset_btn = gr.Button("βΆ Reset Episode", variant="primary")
|
| 289 |
+
|
| 290 |
+
gr.Markdown("---")
|
| 291 |
+
gr.Markdown(
|
| 292 |
+
"| Profile | Core Trait | What agent must discover |\n"
|
| 293 |
+
"|---|---|---|\n"
|
| 294 |
+
"| `introvert_morning` | Recharges alone, peaks at dawn | "
|
| 295 |
+
"Social drain is Γ3; morning deep work gives Γ2 progress |\n"
|
| 296 |
+
"| `extrovert_night_owl` | Energised by people, peaks at night | "
|
| 297 |
+
"Morning is a penalty zone; social gives Γ2 connection |\n"
|
| 298 |
+
"| `workaholic_stoic` | Finds meaning in output, resilient | "
|
| 299 |
+
"Idle time drains serenity; work actually recovers vitality |"
|
| 300 |
+
)
|
| 301 |
+
|
| 302 |
+
meters_out = gr.Textbox(label="Current State", lines=9, interactive=False)
|
| 303 |
+
log_out = gr.Textbox(label="Log", lines=5, interactive=False)
|
| 304 |
+
breakdown_t1 = gr.Textbox(label="Last Reward Breakdown", lines=6, interactive=False)
|
| 305 |
+
score_t1 = gr.Textbox(label="Score", interactive=False)
|
| 306 |
+
done_flag = gr.Checkbox(visible=False)
|
| 307 |
+
|
| 308 |
+
reset_btn.click(
|
| 309 |
+
reset_episode,
|
| 310 |
+
inputs=[profile_dd, seed_in],
|
| 311 |
+
outputs=[meters_out, log_out, breakdown_t1, score_t1, done_flag],
|
| 312 |
+
)
|
| 313 |
+
|
| 314 |
+
# ββ Tab 2: Run Episode ββββββββββββββββββββββββββββββββββββββββββββ
|
| 315 |
+
with gr.TabItem("Run Episode"):
|
| 316 |
+
gr.Markdown("### Manual control β or use the auto-run buttons for a full episode")
|
| 317 |
+
|
| 318 |
+
with gr.Row():
|
| 319 |
+
action_dd = gr.Dropdown(
|
| 320 |
+
choices=ACTION_NAMES,
|
| 321 |
+
value="DEEP_WORK",
|
| 322 |
+
label="Action",
|
| 323 |
+
)
|
| 324 |
+
step_btn = gr.Button("βΆ Take Step", variant="primary", scale=1)
|
| 325 |
+
|
| 326 |
+
with gr.Row():
|
| 327 |
+
profile_dd2 = gr.Dropdown(
|
| 328 |
+
choices=PROFILE_NAMES,
|
| 329 |
+
value="introvert_morning",
|
| 330 |
+
label="Profile (for auto-run)",
|
| 331 |
+
)
|
| 332 |
+
seed_in2 = gr.Textbox(label="Seed (for auto-run)", value="42", scale=1)
|
| 333 |
+
|
| 334 |
+
with gr.Row():
|
| 335 |
+
heuristic_btn = gr.Button("βΆβΆ Full Episode β Heuristic Baseline")
|
| 336 |
+
random_btn = gr.Button("βΆβΆ Full Episode β Random Baseline")
|
| 337 |
+
|
| 338 |
+
meters_display = gr.Textbox(label="Meters", lines=9, interactive=False)
|
| 339 |
+
score_display = gr.Textbox(label="Step Reward / Final Score", interactive=False)
|
| 340 |
+
log_display = gr.Textbox(label="Step Log", lines=20, interactive=False)
|
| 341 |
+
breakdown_display = gr.Textbox(label="Last Reward Breakdown", lines=8, interactive=False)
|
| 342 |
+
|
| 343 |
+
step_btn.click(
|
| 344 |
+
take_action,
|
| 345 |
+
inputs=[action_dd],
|
| 346 |
+
outputs=[meters_display, log_display, breakdown_display, score_display, done_flag],
|
| 347 |
+
)
|
| 348 |
+
heuristic_btn.click(
|
| 349 |
+
run_heuristic_episode,
|
| 350 |
+
inputs=[profile_dd2, seed_in2],
|
| 351 |
+
outputs=[meters_display, log_display, breakdown_display, score_display],
|
| 352 |
+
)
|
| 353 |
+
random_btn.click(
|
| 354 |
+
run_random_episode,
|
| 355 |
+
inputs=[profile_dd2, seed_in2],
|
| 356 |
+
outputs=[meters_display, log_display, breakdown_display, score_display],
|
| 357 |
+
)
|
| 358 |
+
|
| 359 |
+
# ββ Tab 3: Profiles & Hidden Variables βββββββββββββββββββββββββββ
|
| 360 |
+
with gr.TabItem("Profiles & Hidden Variables"):
|
| 361 |
+
gr.Markdown("### Action Effect Matrix")
|
| 362 |
+
gr.Markdown(
|
| 363 |
+
"Base delta per action on each meter. "
|
| 364 |
+
"Profile modifiers are applied on top β invisibly."
|
| 365 |
+
)
|
| 366 |
+
gr.Textbox(
|
| 367 |
+
value=show_action_effects(),
|
| 368 |
+
lines=14, interactive=False, label="",
|
| 369 |
+
)
|
| 370 |
+
|
| 371 |
+
gr.Markdown("### Hidden Personality Profiles")
|
| 372 |
+
gr.Markdown(
|
| 373 |
+
"The agent **cannot see these** during play. "
|
| 374 |
+
"It must infer the active profile through reward patterns β "
|
| 375 |
+
"the core learning challenge of RhythmEnv."
|
| 376 |
+
)
|
| 377 |
+
gr.Textbox(
|
| 378 |
+
value=show_profiles(),
|
| 379 |
+
lines=55, interactive=False, label="",
|
| 380 |
+
)
|
| 381 |
+
|
| 382 |
+
|
| 383 |
+
if __name__ == "__main__":
|
| 384 |
+
demo.launch(server_port=7862, share=False, theme=gr.themes.Monochrome())
|