File size: 23,154 Bytes
025774a
 
 
 
 
 
 
 
 
 
 
ecbe0d8
025774a
ecbe0d8
 
 
025774a
1ba0d0e
 
 
 
eccca42
6226884
1ba0d0e
6226884
1ba0d0e
efe2271
 
6226884
 
d64efa6
6226884
666b4ce
d64efa6
666b4ce
 
 
6226884
d64efa6
 
666b4ce
 
 
d64efa6
 
6226884
666b4ce
 
efe2271
 
 
 
 
 
4dd50e0
 
 
 
d64efa6
 
 
 
 
 
 
 
 
f36d90a
cc6473a
f36d90a
ecbe0d8
cc6473a
ecbe0d8
cc6473a
ecbe0d8
025774a
 
 
 
 
 
 
 
 
 
 
 
f36d90a
cc6473a
 
 
025774a
cc6473a
025774a
 
 
 
 
cc6473a
025774a
cc6473a
c07f15e
cc6473a
 
 
 
 
 
 
025774a
cc6473a
 
 
 
f36d90a
cc6473a
f36d90a
cc6473a
 
 
 
 
 
 
 
 
 
 
 
f36d90a
cc6473a
f36d90a
cc6473a
 
 
 
 
f36d90a
ecbe0d8
f36d90a
ecbe0d8
 
cc6473a
ecbe0d8
361aed7
ecbe0d8
f36d90a
361aed7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc6473a
ecbe0d8
 
 
 
 
 
 
 
361aed7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ecbe0d8
b9c9b8f
ecbe0d8
b9c9b8f
 
ecbe0d8
 
 
 
 
 
f36d90a
b9c9b8f
 
 
 
 
 
 
 
 
 
 
ecbe0d8
b9c9b8f
 
ecbe0d8
 
 
 
b9c9b8f
 
 
 
 
 
f36d90a
cc6473a
f36d90a
cc6473a
ecbe0d8
 
 
f36d90a
ecbe0d8
f36d90a
ecbe0d8
 
 
 
 
 
f36d90a
ecbe0d8
f36d90a
ecbe0d8
f36d90a
b9c9b8f
cc6473a
b9c9b8f
 
f36d90a
 
b9c9b8f
 
 
 
025774a
b9c9b8f
 
 
 
ecbe0d8
b9c9b8f
 
 
 
 
 
025774a
b9c9b8f
92808b9
b9c9b8f
 
 
92808b9
b9c9b8f
 
 
 
 
 
 
 
 
025774a
b9c9b8f
 
 
 
025774a
ecbe0d8
b9c9b8f
 
 
 
ecbe0d8
025774a
0503beb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
025774a
b9c9b8f
f36d90a
b9c9b8f
 
92808b9
b9c9b8f
 
 
 
 
025774a
b9c9b8f
 
c67f463
b9c9b8f
 
 
 
025774a
361aed7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
025774a
 
 
 
 
 
 
 
 
 
 
 
 
cc6473a
025774a
 
 
 
 
 
f36d90a
 
 
 
025774a
 
 
 
 
 
f36d90a
 
 
 
b9c9b8f
cc6473a
ecbe0d8
f36d90a
 
 
 
cc6473a
f36d90a
92808b9
025774a
 
92808b9
 
 
 
cc6473a
 
92808b9
 
 
 
 
 
 
 
 
 
 
025774a
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
---
title: RhythmEnv
emoji: 🎯
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
tags:
  - openenv
---

# RhythmEnv β€” Meta-RL Life Simulator

An OpenEnv environment where an LLM agent learns *how to learn a person*. Each episode samples a different hidden personality from a continuous parameter space β€” the agent must infer who it's helping from rewards alone, then adapt its strategy mid-episode.

This is **meta-reinforcement learning** for personalization: the agent isn't trained to optimize one person's life, it's trained to acquire the *skill of figuring out a new person* from a handful of interactions.

## Submission links (for judges)

- **HF Space (the environment)**: https://huggingface.co/spaces/InosLihka/rhythm_env
- **Training notebook**: [training/RhythmEnv_GRPO_Training.ipynb](training/RhythmEnv_GRPO_Training.ipynb)
- **Blog post**: [BLOG.md](BLOG.md) β€” *Teaching an AI to Know You (Without Asking)*
- **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3
- **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
- **Detailed results**: [docs/results.md](docs/results.md)
- **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
- **Training plots**: [plots/](plots/) (also embedded below)

## Headline result

A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on all three eval conditions**:

| Condition | Random | Heuristic | **Distilled Qwen 3B** | **+ GRPO refine** | belief_MAE |
|---|---|---|---|---|---|
| **continuous in-distribution** | 0.393 | 0.463 | **0.574** *(+0.111)* | 0.573 | 0.213 |
| **continuous OOD (generalization)** | 0.393 | 0.455 | 0.536 *(+0.081)* | **0.559** *(+0.104)* | 0.263 |
| discrete-3-profiles (legacy) | 0.426 | 0.455 | 0.507 *(+0.052)* | **0.520** *(+0.065)* | 0.430 |

The student's **belief_MAE of 0.213 in-distribution matches the gpt-5.4 teacher (0.196)** β€” the inference skill transferred nearly perfectly via SFT-prime. On OOD profiles the agent never saw, it still beats heuristic by +0.081, proving generalization (not memorization).

A subsequent GRPO refine on top of the SFT'd student lifted **OOD generalization by another +0.023 (4% relative)** and discrete-3 by +0.013, with no in-dist regression. The GRPO-refined model is at [`InosLihka/rhythm-env-meta-trained-sft-grpo-v1`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1).

For reference, the gpt-5.4 teacher (upper bound) hits 0.611 in-dist / 0.621 OOD on a 150-episode reeval. Full numbers in [docs/results.md](docs/results.md). Eval JSONs: [SFT v3](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json) Β· [SFT v3 + GRPO](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1/blob/main/eval_results_v2.json).

![v3 baseline vs trained across conditions](plots/sft_v3_baseline_vs_trained.png)

![SFT v3 vs SFT+GRPO comparison](plots/sft_grpo_comparison.png)

## Training evidence

**SFT v3 loss curve** β€” distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 β†’ 0.083 over 525 steps and stays converged. No overfitting.

![SFT v3 loss](plots/sft_v3_training_loss.png)

**Reward curve** β€” mean per-step env reward over training (real env-replay reward, with Β±1 std band). Climbs steadily as the agent learns profile-aware play.

![Reward curve](plots/grpo_iter2_reward_curve.png)

**Reward components** β€” all 4 reward layers overlaid (format_valid, action_legal, env_reward, belief_accuracy). Lets you read what each layer is contributing as training progresses.

![Reward components](plots/grpo_iter2_reward_components.png)

**Belief-accuracy curve** β€” the meta-RL signal. Rolling mean of how close the agent's emitted belief vector is to the true profile vector, per step.

![Belief accuracy](plots/grpo_iter2_belief_accuracy.png)

Numbers source: `eval_results_v2.json` in the [trained model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3).

## Why a Life Simulator?

Personal AI assistants give generic advice. They don't know *you*. RhythmEnv trains an agent on a continuous distribution of simulated personalities so that, when it meets a real user, it already knows the *structure* of personality inference β€” it personalizes in ~5 interactions instead of 50+.

Every sampled person has a hidden "DNA" β€” a vector of preferences (social tolerance, morning energy, work motivation) plus action modifiers β€” drawn from distributions. The agent sees only the resulting meter changes and rewards. To do well, it must infer the hidden vector and adapt.

This is **Theme 3.2 (Personalized Tasks)** + **Theme 2 (Long-Horizon Planning)** β€” the agent plans across a full week while inferring a hidden personality from observation alone.

## Quick Start

```bash
pip install openenv-core
```

```python
import asyncio
from rhythm_env import RhythmEnv, RhythmAction, ActionType

async def main():
    async with RhythmEnv(base_url="https://InosLihka-rhythm-env.hf.space") as env:
        result = await env.reset(seed=0)
        print(f"Vitality: {result.observation.vitality}")
        print(f"Day: {result.observation.day}, Slot: {result.observation.slot}")

        result = await env.step(RhythmAction(action_type=ActionType.DEEP_WORK))
        print(f"Reward: {result.reward}")

asyncio.run(main())
```

## The 5 Life Meters

All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receives a heavy penalty.

| Meter | What It Represents | Increases With | Decreases With |
|-------|-------------------|----------------|----------------|
| **Vitality** | Physical energy, sleep | Sleep, Exercise | Work, Socializing |
| **Cognition** | Focus, mental clarity | Sleep, Meditate | Deep Work, Binge Watch |
| **Progress** | Career/skill growth | Deep Work, Learn, Admin | Binge Watch (slightly) |
| **Serenity** | Inner peace vs stress | Meditate, Me Time, Exercise | Deep Work, Admin |
| **Connection** | Relationship health | Family Time, Socialize | Passive decay every step |

**Key interactions**:
- Low Vitality reduces the effectiveness of ALL positive actions (global multiplier)
- Connection decays passively β€” you must actively maintain relationships
- Meters interact non-linearly: a crash in one often cascades to others

## Action Space (10 Actions)

| Category | Action | Primary Effect |
|----------|--------|---------------|
| **Productivity** | `DEEP_WORK` | High Progress, drains Vitality + Cognition |
| | `ADMIN_WORK` | Moderate Progress, low drain |
| | `LEARN` | Progress + slight Cognition drain |
| **Recovery** | `SLEEP` | Strong Vitality + Cognition recovery |
| | `EXERCISE` | Vitality + Serenity boost |
| | `MEDITATE` | Strong Serenity + Cognition boost |
| **Social** | `FAMILY_TIME` | Strong Connection, costs Vitality |
| | `SOCIALIZE` | Connection + mild Serenity |
| **Leisure** | `ME_TIME` | Serenity + mild Cognition recovery |
| | `BINGE_WATCH` | Mild Serenity, drains Cognition (trap action) |

## Episode Structure

- **1 episode = 1 week** = 7 days Γ— 4 slots/day = **28 steps**
- **Time slots**: Morning (0), Afternoon (1), Evening (2), Night (3)
- **Time-of-day effects**: Morning boosts cognitive gains (+20%), Night penalizes them (-40%)
- **Random events** (~8% per step): Prod Crash, Family Emergency, Illness, Good News
- **Deterministic** given seed: same seed β†’ same episode trajectory

## The Meta-Learning Setup (Core Innovation)

### What the Agent Sees Each Step
- All 5 meter values + per-meter deltas from the last action
- Current day, slot, timestep
- Active random event (if any)
- **Rolling 7-step history of (action, reward, deltas, *anomalies*)** β€” see below
- Total scalar reward

### The anomaly signal (the cleanest inference channel)

For every past step in the rolling history, the agent sees both the actual
per-meter delta *and* a per-meter **anomaly** = `actual_delta βˆ’ expected_delta_under_neutral_profile`.

A neutral profile is the average person; the anomaly therefore tells the
agent **how much this specific user's response deviated from the average
user's response to the same action**. That deviation is what encodes the
hidden personality.

Concrete example: the agent does `SOCIALIZE` and observes `vitality_delta = -0.18`.
Under a neutral profile the expected change is `-0.06`, so
`vitality_anomaly = -0.12` β€” the user lost twice as much energy as average.
Strong evidence the hidden `social_vitality_multiplier` is high
(introvert: socializing is costly). The agent should down-weight social
actions and reach for solo recovery instead.

This is why a tiny model can learn meta-RL inference here: the env hands it
a clean, well-typed deviation-from-baseline signal at every step. See
[`models.py` (StepRecord)](models.py) for the exact field layout.

### What the Agent Does NOT See
- **The hidden personality vector** β€” sampled per episode, controls everything below
- **Reward weight decomposition** β€” same meter changes produce different rewards for different people
- **Action modifiers** β€” social drain, cognitive bonuses, shame spirals vary continuously

### Continuous Personality Space

Each `reset()` samples a fresh personality from parameter distributions:

| Parameter | Distribution | What it controls | Concrete intuition |
|---|---|---|---|
| `social_vitality_multiplier` | U(0.2, 3.0) | how much vitality `SOCIALIZE` / `FAMILY_TIME` drain | **0.2** = "people energize me" (extrovert). **3.0** = "every party leaves me wrecked" (introvert). |
| `morning_cognition_bonus` | U(0.4, 2.0) or none | bonus on cognitive gains in morning slots | High = sharp at 8am, useless after 6pm (morning person). |
| `evening_night_cognition_bonus` | U(0.6, 1.8) or none | same, evening/night | High = comes alive after dark (night owl). |
| `work_vitality_recovery` | U(0, 0.06) | `DEEP_WORK` *adds* vitality instead of draining | High = work is fuel, not cost (workaholic). |
| `progress_serenity_bonus` | U(0, 0.10) | calmness gained from career progress | High = "I am at peace when I'm shipping." |
| `solo_serenity_bonus` | U(0, 0.10) | `ME_TIME` extra calm gain | High = recharges by being alone (introvert). |
| `social_connection_multiplier` | U(1, 2) | how strongly social actions build Connection | High = relationships compound fast (extrovert fluency). |
| `social_serenity_bonus` | U(0, 0.06) | extra calm from social actions | High = extrovert peace-from-people. |
| `binge_shame` | Bernoulli(0.5) | does `BINGE_WATCH` cost extra serenity afterwards | True = guilt spiral, False = guilt-free. |
| `connection_decay_rate` | U(0.005, 0.02) | passive Connection drop per step | High = relationships need active maintenance. |
| `vitality_decay_rate` | U(0, 0.04) | passive Vitality drop per step | High = always low-energy. |
| `event_impact_multiplier` | U(0.5, 1.0) | how hard random events (Prod Crash etc.) hit | High = brittle to setbacks. |
| `stress_tolerance` | U(0.15, 0.30) | meter level where stress-spiral penalty fires | Low = falls apart easily. |
| `reward_weights` | Dirichlet (biased to non-vit/cog) | which meter changes are *valuable to this person* | One person scores high on `progress` gains; another scores high on `connection`. |

This produces an effectively infinite personality space β€” memorization is impossible, the agent must learn the *skill* of inference. The three named reference profiles below are concrete points in this space, useful as test anchors. For the exact sampling logic see [`server/rhythm_environment.py`](server/rhythm_environment.py) (`sample_profile`).

### Three reference profiles

The env exposes 3 named personalities as anchor points in the continuous space.
Useful for tests and reproducible eval. Reach them via `profile=<name>` on `reset()`:

- **Introvert Morning Person** β†’ belief vector β‰ˆ `[0.0 social, 1.0 morning, 0.07 work]`
- **Extrovert Night Owl** β†’ belief vector β‰ˆ `[1.0 social, 0.20 morning, 0.02 work]`
- **Workaholic Stoic** β†’ belief vector β‰ˆ `[0.36 social, 0.50 morning, 1.0 work]`

### The Action+Belief Output Format

Each step the agent outputs a brief reasoning block followed by an answer line:

```
<reasoning>
Last step's socialize gave Vβˆ’0.12 (anomaly βˆ’0.06, much worse than neutral) β€”
high social drain, suggests low S. Morning DEEP_WORK earlier gave bonus
cognition (+0.04) β†’ high M. With low S + high M, MEDITATE is the recovery
play that fits.
</reasoning>
2 8 5 MEDITATE
```

`S M W ACTION_NAME` is the contract. Three belief digits (0-9) representing
the agent's current belief about the user:
- **S** = social preference (0=hates social, 9=loves social)
- **M** = morning preference (0=night owl, 9=morning person)
- **W** = work preference (0=avoids work, 9=workaholic)

Belief-first ordering matters: in causal-LM generation, tokens generated
earlier condition tokens generated later, so the action is causally
conditioned on the belief β€” making the belief functionally useful rather
than a post-hoc afterthought. The reasoning block isn't required for
parseability (parser searches for the last `S M W ACTION` match), but the
SFT-distilled student learns to emit it because the teacher did.

### The Discovery Challenge

The agent must:
1. **Probe** β€” try different actions in early steps to see how the person responds
2. **Infer** β€” update its belief vector each step based on observed rewards
3. **Adapt** β€” late in the episode, exploit the belief by choosing actions matching the inferred personality

## Reward Architecture (4-layer training stack)

| Layer | Function | Range | Purpose |
|---|---|---|---|
| 1 | `format_valid` | -2 to +1 | parseable as ACTION + 3 belief digits |
| 2 | `action_legal` | -1 to +0.5 | action is one of 10 valid types |
| 3 | `env_reward` | -3 to ~+1.5 | actual env reward via seed-based replay |
| 4 | `belief_accuracy` | -0.5 to +0.5 | cosine-MAE vs true profile vector |

**Per-step env reward** = `sum(meter_delta Γ— hidden_weight) Γ— 15` β€” weights are sampled per profile.

**Critical threshold**: any meter < 0.1 β†’ -0.30 penalty.

**Final grader (v2 β€” measures inference, not just reflex)** β€” `final_score ∈ [0, 1]`:
```
score = 0.15 Γ— crash_free + 0.20 Γ— progress + 0.10 Γ— connection
      + 0.25 Γ— adaptation_score + 0.10 Γ— efficiency + 0.20 Γ— belief_accuracy
```

`belief_accuracy` is `1 βˆ’ MAE` between the agent's last-emitted belief and
the true profile vector. Heuristic / random baselines emit no belief and
score 0 here by design β€” that's the point: the meta-RL skill is *inference*,
and only agents that actually try get credit.

`adaptation_score` is the implicit signal: late-half mean reward minus
early-half mean, gated by absolute late-half quality. Per-step rewards are
profile-weighted, so a high late-half mean means the agent figured out the
hidden weights and started exploiting them.

> **Why we changed the grader.** Five GRPO iterations under the v1 grader
> kept tying with heuristic. Reading the model's reasoning showed it was
> doing real inference β€” but inference didn't lift the score because the
> v1 grader didn't measure inference. Adding `belief_accuracy` (Ξ” +0.20
> weight) fixed the structural mismatch. See [`docs/iterations.md`](docs/iterations.md)
> for the full journey.

## Training: Algorithm Distillation

We train via [Algorithm Distillation](https://arxiv.org/abs/2210.14215) β€” a
frontier teacher plays episodes, writes down its reasoning, and the student
imitates the trajectories. Two stages:

**Stage 1 β€” Teacher rollouts.** gpt-5.4 (Azure AI Foundry) plays 30+ episodes
of RhythmEnv. At each step it outputs a `<reasoning>` block + `S M W ACTION`
answer line. Each episode produces 28 (state, response) tuples. ~$3 per 30
episodes via Azure pay-as-you-go.

```bash
python scripts/generate_teacher_trajectories.py \
    --seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3
```

**Stage 2 β€” SFT prime.** Qwen 2.5-3B (Unsloth + 4-bit + LoRA r=16) is
fine-tuned on the teacher's full trajectories. The student learns BOTH the
output format and the reasoning pattern. ~25 min on a HF Jobs `a10g-large`
(~$2-3).

```bash
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
    -e TEACHER_DATA_REPO=InosLihka/rhythm-env-teacher-trajectories \
    -e MODEL_REPO_SUFFIX=sft-v1 \
    -d scripts/sft_on_hf.py
```

**Stage 3 β€” GRPO refine on top of SFT.** Run GRPO with the existing 4-layer
reward stack starting from the SFT'd checkpoint (lr 1e-5, beta 0.1 KL anchor).
This lifts OOD generalization by another **+0.023** and discrete-3 by +0.013
without regressing in-dist. The GRPO-refined model is uploaded to
[`InosLihka/rhythm-env-meta-trained-sft-grpo-v1`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1).
The bulk of the improvement still comes from SFT (Stage 2); GRPO refine is
the polish.

```bash
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
    -e MODEL_NAME=InosLihka/rhythm-env-meta-trained-sft-v3 \
    -e MAX_STEPS=200 -e LEARNING_RATE=1e-5 -e BETA=0.1 \
    -e MODEL_REPO_SUFFIX=sft-grpo-v1 \
    -d scripts/train_on_hf.py
```

### Why algorithm distillation, not GRPO from scratch

We tried 5 GRPO iterations from scratch on Qwen 2.5-3B before switching
recipes. They all matched heuristic but never beat it.

The literature was unambiguous on why: small models (≀3B) need a teacher to
bootstrap reasoning skills. Pure GRPO from scratch produces shallow,
non-generalizing behavior at this scale β€” every successful 3B reasoning
recipe (DeepSeek-R1-Distill, PRIME-RL, BREAD) uses SFT-prime or trajectory
distillation.

Once we knew that, the answer was: **use a strong teacher (gpt-5.4) we
already have access to, distill its reasoning into Qwen, ship.**

The `training/train.py` GRPO script is preserved for completeness and as
the optional Stage 3, but it isn't on the critical path of the headline
result. See [`docs/iterations.md`](docs/iterations.md) for the full journey
and what each GRPO iteration taught us.

## Reproducing the headline numbers

There are two reproduction paths depending on how much time and budget you
have. Both produce the numbers in the *Headline result* table above.

### Fast path (~10–20 min, $0): re-run eval against the published checkpoint

This is the path most reviewers should take. The trained model is already
on the Hub. Download it once, then run `inference_eval.py` (which expects a
local path) against all three eval conditions.

```bash
# Prereqs: Python 3.10+, ~12 GB free disk, GPU strongly recommended (CPU works but is slow).
pip install -e .
export HF_TOKEN=...   # any read-scoped HF token; the model is public.

# 1. Snapshot the trained checkpoint locally.
hf download InosLihka/rhythm-env-meta-trained-sft-v3 \
    --local-dir ./models/sft-v3

# 2. Run all 3 eval conditions (discrete-3, continuous-in-dist, continuous-OOD).
python training/inference_eval.py \
    --model_path ./models/sft-v3 \
    --num_episodes 5 \
    --output_file eval_results_v2.json
```

Expected output: `eval_results_v2.json` whose per-condition averages match
[the published JSON](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json)
(distilled student 0.574 / 0.536 / 0.507 across continuous-in-dist /
continuous-OOD / discrete-3) within Β±0.02 stochastic noise.

> If you'd rather not run a GPU locally, `scripts/eval_on_hf.py` does the
> same thing as a HF Jobs run: it snapshots the model on a remote a10g/a100,
> runs `inference_eval.py`, and uploads the resulting JSON back to the
> model repo. See the docstring in that script for the submit command.

### Full path (~1.5 hr, ~$5–6 in API + GPU credits): retrain from scratch

```bash
# 1. Teacher rollouts (Azure OpenAI gpt-5.4, ~$3 for 30 episodes).
cp .env.example .env   # fill in AZURE_OPENAI_* + HF_TOKEN
python scripts/generate_teacher_trajectories.py \
    --seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3

# 2. Push teacher data to a HF dataset repo of your choice.
python scripts/upload_teacher_data.py \
    --files data/teacher_30ep.jsonl \
    --repo <your_user>/rhythm-env-teacher-trajectories

# 3. SFT distill on HF Jobs (~25 min on a10g-large, ~$2–3).
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
    -e TEACHER_DATA_REPO=<your_user>/rhythm-env-teacher-trajectories \
    -e MODEL_REPO_SUFFIX=sft-v3-repro \
    -d scripts/sft_on_hf.py

# 4. Eval the new checkpoint via the Fast Path above:
hf download <your_user>/rhythm-env-meta-trained-sft-v3-repro \
    --local-dir ./models/sft-v3-repro
python training/inference_eval.py \
    --model_path ./models/sft-v3-repro \
    --num_episodes 5 \
    --output_file eval_results_v2.json
```

Stage 3 (optional GRPO refine on top of SFT) is the same `scripts/train_on_hf.py`
command shown earlier in the *Training: Algorithm Distillation* section.

## Setup Instructions

### Local Development

```bash
cd rhythm_env
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
```

### Docker

```bash
docker build -t rhythm-env:latest .
docker run -p 8000:8000 rhythm-env:latest
```

### Running the Baseline

```bash
# Heuristic only (no API key needed):
python inference.py

# With LLM:
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your-token"
python inference.py
```

## API Endpoints

| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/reset` | Start a new episode. Optional kwargs: `seed`, `profile=<name>` (one of the 3 reference profiles). Default samples a fresh continuous profile. |
| `POST` | `/step` | Execute an action (`action_type`) |
| `GET` | `/state` | Get current state (includes hidden profile name for debugging) |
| `GET` | `/health` | Health check |
| `GET` | `/metadata` | Environment metadata |
| `GET` | `/schema` | Action/observation JSON schemas |

## Why It Matters

This environment is a training ground for **truly personalized AI**. The product vision: wearables (HRV, sleep score) feed meter proxies, your calendar gets parsed into action types, and every Accept/Ignore tap on a recommendation is a reward signal. A small model trained in RhythmEnv already knows the *structure* of personality inference β€” so it personalizes to a real user in 5–10 interactions instead of 50+.

```
User installs app β†’ wearables feed meter proxies
Calendar events β†’ mapped to action types
Accept/Ignore taps β†’ reward signal
Agent learns who you are β†’ recommendations adapt
```

No setup. No personality quiz. The agent figures you out. See [sim-to-real architecture](docs/references/sim_to_real_deployment.md) for the full deployment pipeline.

## Validation

```bash
# Step 1: HF Space live  βœ“  (returns HTTP 200 on /reset)
# Step 2: Docker build   β†’  docker build server/
# Step 3: openenv check  βœ“  openenv validate passes

./scripts/validate-submission.sh https://InosLihka-rhythm-env.hf.space .
```

## License

BSD 3-Clause License