---
title: RhythmEnv
emoji: 🎯
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
tags:
  - openenv
---

# RhythmEnv — Meta-RL Life Simulator

An OpenEnv environment where an LLM agent learns *how to learn a person*. Each episode samples a different hidden personality from a continuous parameter space — the agent must infer who it's helping from rewards alone, then adapt its strategy mid-episode.

This is **meta-reinforcement learning** for personalization: the agent isn't trained to optimize one person's life, it's trained to acquire the *skill of figuring out a new person* from a handful of interactions.

## Submission links (for judges)

- **HF Space (the environment)**: https://huggingface.co/spaces/InosLihka/rhythm_env
- **Training notebook**: [training/RhythmEnv_GRPO_Training.ipynb](training/RhythmEnv_GRPO_Training.ipynb)
- **Blog post**: [BLOG.md](BLOG.md) — *Teaching an AI to Know You (Without Asking)*
- **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3
- **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
- **Detailed results**: [docs/results.md](docs/results.md)
- **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
- **Training plots**: [plots/](plots/) (also embedded below)

## Headline result

A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on all three eval conditions**:

| Condition | Random | Heuristic | **Distilled Qwen 3B** | **+ GRPO refine** | belief_MAE |
|---|---|---|---|---|---|
| **continuous in-distribution** | 0.393 | 0.463 | **0.574** *(+0.111)* | 0.573 | 0.213 |
| **continuous OOD (generalization)** | 0.393 | 0.455 | 0.536 *(+0.081)* | **0.559** *(+0.104)* | 0.263 |
| discrete-3-profiles (legacy) | 0.426 | 0.455 | 0.507 *(+0.052)* | **0.520** *(+0.065)* | 0.430 |

The student's **belief_MAE of 0.213 in-distribution matches the gpt-5.4 teacher (0.196)** — the inference skill transferred nearly perfectly via SFT-prime. On OOD profiles the agent never saw, it still beats heuristic by +0.081, proving generalization (not memorization).

A subsequent GRPO refine on top of the SFT'd student lifted **OOD generalization by another +0.023 (4% relative)** and discrete-3 by +0.013, with no in-dist regression. The GRPO-refined model is at [`InosLihka/rhythm-env-meta-trained-sft-grpo-v1`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1).

For reference, the gpt-5.4 teacher (upper bound) hits 0.611 in-dist / 0.621 OOD on a 150-episode reeval. Full numbers in [docs/results.md](docs/results.md). Eval JSONs: [SFT v3](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json) · [SFT v3 + GRPO](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1/blob/main/eval_results_v2.json).

![v3 baseline vs trained across conditions](plots/sft_v3_baseline_vs_trained.png)

![SFT v3 vs SFT+GRPO comparison](plots/sft_grpo_comparison.png)

## Training evidence

**SFT v3 loss curve** — distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 → 0.083 over 525 steps and stays converged. No overfitting.

![SFT v3 loss](plots/sft_v3_training_loss.png)

**Reward curve** — mean per-step env reward over training (real env-replay reward, with ±1 std band). Climbs steadily as the agent learns profile-aware play.

![Reward curve](plots/grpo_iter2_reward_curve.png)

**Reward components** — all 4 reward layers overlaid (format_valid, action_legal, env_reward, belief_accuracy). Lets you read what each layer is contributing as training progresses.

![Reward components](plots/grpo_iter2_reward_components.png)

**Belief-accuracy curve** — the meta-RL signal. Rolling mean of how close the agent's emitted belief vector is to the true profile vector, per step.

![Belief accuracy](plots/grpo_iter2_belief_accuracy.png)

Numbers source: `eval_results_v2.json` in the [trained model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3).

## Why a Life Simulator?

Personal AI assistants give generic advice. They don't know *you*. RhythmEnv trains an agent on a continuous distribution of simulated personalities so that, when it meets a real user, it already knows the *structure* of personality inference — it personalizes in ~5 interactions instead of 50+.

Every sampled person has a hidden "DNA" — a vector of preferences (social tolerance, morning energy, work motivation) plus action modifiers — drawn from distributions. The agent sees only the resulting meter changes and rewards. To do well, it must infer the hidden vector and adapt.

This is **Theme 3.2 (Personalized Tasks)** + **Theme 2 (Long-Horizon Planning)** — the agent plans across a full week while inferring a hidden personality from observation alone.

## Quick Start

```bash
pip install openenv-core
```

```python
import asyncio
from rhythm_env import RhythmEnv, RhythmAction, ActionType

async def main():
    async with RhythmEnv(base_url="https://InosLihka-rhythm-env.hf.space") as env:
        result = await env.reset(seed=0)
        print(f"Vitality: {result.observation.vitality}")
        print(f"Day: {result.observation.day}, Slot: {result.observation.slot}")

        result = await env.step(RhythmAction(action_type=ActionType.DEEP_WORK))
        print(f"Reward: {result.reward}")

asyncio.run(main())
```

## The 5 Life Meters

All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receives a heavy penalty.

| Meter | What It Represents | Increases With | Decreases With |
|-------|-------------------|----------------|----------------|
| **Vitality** | Physical energy, sleep | Sleep, Exercise | Work, Socializing |
| **Cognition** | Focus, mental clarity | Sleep, Meditate | Deep Work, Binge Watch |
| **Progress** | Career/skill growth | Deep Work, Learn, Admin | Binge Watch (slightly) |
| **Serenity** | Inner peace vs stress | Meditate, Me Time, Exercise | Deep Work, Admin |
| **Connection** | Relationship health | Family Time, Socialize | Passive decay every step |

**Key interactions**:
- Low Vitality reduces the effectiveness of ALL positive actions (global multiplier)
- Connection decays passively — you must actively maintain relationships
- Meters interact non-linearly: a crash in one often cascades to others

## Action Space (10 Actions)

| Category | Action | Primary Effect |
|----------|--------|---------------|
| **Productivity** | `DEEP_WORK` | High Progress, drains Vitality + Cognition |
| | `ADMIN_WORK` | Moderate Progress, low drain |
| | `LEARN` | Progress + slight Cognition drain |
| **Recovery** | `SLEEP` | Strong Vitality + Cognition recovery |
| | `EXERCISE` | Vitality + Serenity boost |
| | `MEDITATE` | Strong Serenity + Cognition boost |
| **Social** | `FAMILY_TIME` | Strong Connection, costs Vitality |
| | `SOCIALIZE` | Connection + mild Serenity |
| **Leisure** | `ME_TIME` | Serenity + mild Cognition recovery |
| | `BINGE_WATCH` | Mild Serenity, drains Cognition (trap action) |

## Episode Structure

- **1 episode = 1 week** = 7 days × 4 slots/day = **28 steps**
- **Time slots**: Morning (0), Afternoon (1), Evening (2), Night (3)
- **Time-of-day effects**: Morning boosts cognitive gains (+20%), Night penalizes them (-40%)
- **Random events** (~8% per step): Prod Crash, Family Emergency, Illness, Good News
- **Deterministic** given seed: same seed → same episode trajectory

## The Meta-Learning Setup (Core Innovation)

### What the Agent Sees Each Step
- All 5 meter values + per-meter deltas from the last action
- Current day, slot, timestep
- Active random event (if any)
- **Rolling 7-step history of (action, reward, deltas, *anomalies*)** — see below
- Total scalar reward

### The anomaly signal (the cleanest inference channel)

For every past step in the rolling history, the agent sees both the actual
per-meter delta *and* a per-meter **anomaly** = `actual_delta − expected_delta_under_neutral_profile`.

A neutral profile is the average person; the anomaly therefore tells the
agent **how much this specific user's response deviated from the average
user's response to the same action**. That deviation is what encodes the
hidden personality.

Concrete example: the agent does `SOCIALIZE` and observes `vitality_delta = -0.18`.
Under a neutral profile the expected change is `-0.06`, so
`vitality_anomaly = -0.12` — the user lost twice as much energy as average.
Strong evidence the hidden `social_vitality_multiplier` is high
(introvert: socializing is costly). The agent should down-weight social
actions and reach for solo recovery instead.

This is why a tiny model can learn meta-RL inference here: the env hands it
a clean, well-typed deviation-from-baseline signal at every step. See
[`models.py` (StepRecord)](models.py) for the exact field layout.

### What the Agent Does NOT See
- **The hidden personality vector** — sampled per episode, controls everything below
- **Reward weight decomposition** — same meter changes produce different rewards for different people
- **Action modifiers** — social drain, cognitive bonuses, shame spirals vary continuously

### Continuous Personality Space

Each `reset()` samples a fresh personality from parameter distributions:

| Parameter | Distribution | What it controls | Concrete intuition |
|---|---|---|---|
| `social_vitality_multiplier` | U(0.2, 3.0) | how much vitality `SOCIALIZE` / `FAMILY_TIME` drain | **0.2** = "people energize me" (extrovert). **3.0** = "every party leaves me wrecked" (introvert). |
| `morning_cognition_bonus` | U(0.4, 2.0) or none | bonus on cognitive gains in morning slots | High = sharp at 8am, useless after 6pm (morning person). |
| `evening_night_cognition_bonus` | U(0.6, 1.8) or none | same, evening/night | High = comes alive after dark (night owl). |
| `work_vitality_recovery` | U(0, 0.06) | `DEEP_WORK` *adds* vitality instead of draining | High = work is fuel, not cost (workaholic). |
| `progress_serenity_bonus` | U(0, 0.10) | calmness gained from career progress | High = "I am at peace when I'm shipping." |
| `solo_serenity_bonus` | U(0, 0.10) | `ME_TIME` extra calm gain | High = recharges by being alone (introvert). |
| `social_connection_multiplier` | U(1, 2) | how strongly social actions build Connection | High = relationships compound fast (extrovert fluency). |
| `social_serenity_bonus` | U(0, 0.06) | extra calm from social actions | High = extrovert peace-from-people. |
| `binge_shame` | Bernoulli(0.5) | does `BINGE_WATCH` cost extra serenity afterwards | True = guilt spiral, False = guilt-free. |
| `connection_decay_rate` | U(0.005, 0.02) | passive Connection drop per step | High = relationships need active maintenance. |
| `vitality_decay_rate` | U(0, 0.04) | passive Vitality drop per step | High = always low-energy. |
| `event_impact_multiplier` | U(0.5, 1.0) | how hard random events (Prod Crash etc.) hit | High = brittle to setbacks. |
| `stress_tolerance` | U(0.15, 0.30) | meter level where stress-spiral penalty fires | Low = falls apart easily. |
| `reward_weights` | Dirichlet (biased to non-vit/cog) | which meter changes are *valuable to this person* | One person scores high on `progress` gains; another scores high on `connection`. |

This produces an effectively infinite personality space — memorization is impossible, the agent must learn the *skill* of inference. The three named reference profiles below are concrete points in this space, useful as test anchors. For the exact sampling logic see [`server/rhythm_environment.py`](server/rhythm_environment.py) (`sample_profile`).

### Three reference profiles

The env exposes 3 named personalities as anchor points in the continuous space.
Useful for tests and reproducible eval. Reach them via `profile=<name>` on `reset()`:

- **Introvert Morning Person** → belief vector ≈ `[0.0 social, 1.0 morning, 0.07 work]`
- **Extrovert Night Owl** → belief vector ≈ `[1.0 social, 0.20 morning, 0.02 work]`
- **Workaholic Stoic** → belief vector ≈ `[0.36 social, 0.50 morning, 1.0 work]`

### The Action+Belief Output Format

Each step the agent outputs a brief reasoning block followed by an answer line:

```
<reasoning>
Last step's socialize gave V−0.12 (anomaly −0.06, much worse than neutral) —
high social drain, suggests low S. Morning DEEP_WORK earlier gave bonus
cognition (+0.04) → high M. With low S + high M, MEDITATE is the recovery
play that fits.
</reasoning>
2 8 5 MEDITATE
```

`S M W ACTION_NAME` is the contract. Three belief digits (0-9) representing
the agent's current belief about the user:
- **S** = social preference (0=hates social, 9=loves social)
- **M** = morning preference (0=night owl, 9=morning person)
- **W** = work preference (0=avoids work, 9=workaholic)

Belief-first ordering matters: in causal-LM generation, tokens generated
earlier condition tokens generated later, so the action is causally
conditioned on the belief — making the belief functionally useful rather
than a post-hoc afterthought. The reasoning block isn't required for
parseability (parser searches for the last `S M W ACTION` match), but the
SFT-distilled student learns to emit it because the teacher did.

### The Discovery Challenge

The agent must:
1. **Probe** — try different actions in early steps to see how the person responds
2. **Infer** — update its belief vector each step based on observed rewards
3. **Adapt** — late in the episode, exploit the belief by choosing actions matching the inferred personality

## Reward Architecture (4-layer training stack)

| Layer | Function | Range | Purpose |
|---|---|---|---|
| 1 | `format_valid` | -2 to +1 | parseable as ACTION + 3 belief digits |
| 2 | `action_legal` | -1 to +0.5 | action is one of 10 valid types |
| 3 | `env_reward` | -3 to ~+1.5 | actual env reward via seed-based replay |
| 4 | `belief_accuracy` | -0.5 to +0.5 | cosine-MAE vs true profile vector |

**Per-step env reward** = `sum(meter_delta × hidden_weight) × 15` — weights are sampled per profile.

**Critical threshold**: any meter < 0.1 → -0.30 penalty.

**Final grader (v2 — measures inference, not just reflex)** — `final_score ∈ [0, 1]`:
```
score = 0.15 × crash_free + 0.20 × progress + 0.10 × connection
      + 0.25 × adaptation_score + 0.10 × efficiency + 0.20 × belief_accuracy
```

`belief_accuracy` is `1 − MAE` between the agent's last-emitted belief and
the true profile vector. Heuristic / random baselines emit no belief and
score 0 here by design — that's the point: the meta-RL skill is *inference*,
and only agents that actually try get credit.

`adaptation_score` is the implicit signal: late-half mean reward minus
early-half mean, gated by absolute late-half quality. Per-step rewards are
profile-weighted, so a high late-half mean means the agent figured out the
hidden weights and started exploiting them.

> **Why we changed the grader.** Five GRPO iterations under the v1 grader
> kept tying with heuristic. Reading the model's reasoning showed it was
> doing real inference — but inference didn't lift the score because the
> v1 grader didn't measure inference. Adding `belief_accuracy` (Δ +0.20
> weight) fixed the structural mismatch. See [`docs/iterations.md`](docs/iterations.md)
> for the full journey.

## Training: Algorithm Distillation

We train via [Algorithm Distillation](https://arxiv.org/abs/2210.14215) — a
frontier teacher plays episodes, writes down its reasoning, and the student
imitates the trajectories. Two stages:

**Stage 1 — Teacher rollouts.** gpt-5.4 (Azure AI Foundry) plays 30+ episodes
of RhythmEnv. At each step it outputs a `<reasoning>` block + `S M W ACTION`
answer line. Each episode produces 28 (state, response) tuples. ~$3 per 30
episodes via Azure pay-as-you-go.

```bash
python scripts/generate_teacher_trajectories.py \
    --seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3
```

**Stage 2 — SFT prime.** Qwen 2.5-3B (Unsloth + 4-bit + LoRA r=16) is
fine-tuned on the teacher's full trajectories. The student learns BOTH the
output format and the reasoning pattern. ~25 min on a HF Jobs `a10g-large`
(~$2-3).

```bash
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
    -e TEACHER_DATA_REPO=InosLihka/rhythm-env-teacher-trajectories \
    -e MODEL_REPO_SUFFIX=sft-v1 \
    -d scripts/sft_on_hf.py
```

**Stage 3 — GRPO refine on top of SFT.** Run GRPO with the existing 4-layer
reward stack starting from the SFT'd checkpoint (lr 1e-5, beta 0.1 KL anchor).
This lifts OOD generalization by another **+0.023** and discrete-3 by +0.013
without regressing in-dist. The GRPO-refined model is uploaded to
[`InosLihka/rhythm-env-meta-trained-sft-grpo-v1`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1).
The bulk of the improvement still comes from SFT (Stage 2); GRPO refine is
the polish.

```bash
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
    -e MODEL_NAME=InosLihka/rhythm-env-meta-trained-sft-v3 \
    -e MAX_STEPS=200 -e LEARNING_RATE=1e-5 -e BETA=0.1 \
    -e MODEL_REPO_SUFFIX=sft-grpo-v1 \
    -d scripts/train_on_hf.py
```

### Why algorithm distillation, not GRPO from scratch

We tried 5 GRPO iterations from scratch on Qwen 2.5-3B before switching
recipes. They all matched heuristic but never beat it.

The literature was unambiguous on why: small models (≤3B) need a teacher to
bootstrap reasoning skills. Pure GRPO from scratch produces shallow,
non-generalizing behavior at this scale — every successful 3B reasoning
recipe (DeepSeek-R1-Distill, PRIME-RL, BREAD) uses SFT-prime or trajectory
distillation.

Once we knew that, the answer was: **use a strong teacher (gpt-5.4) we
already have access to, distill its reasoning into Qwen, ship.**

The `training/train.py` GRPO script is preserved for completeness and as
the optional Stage 3, but it isn't on the critical path of the headline
result. See [`docs/iterations.md`](docs/iterations.md) for the full journey
and what each GRPO iteration taught us.

## Reproducing the headline numbers

There are two reproduction paths depending on how much time and budget you
have. Both produce the numbers in the *Headline result* table above.

### Fast path (~10–20 min, $0): re-run eval against the published checkpoint

This is the path most reviewers should take. The trained model is already
on the Hub. Download it once, then run `inference_eval.py` (which expects a
local path) against all three eval conditions.

```bash
# Prereqs: Python 3.10+, ~12 GB free disk, GPU strongly recommended (CPU works but is slow).
pip install -e .
export HF_TOKEN=...   # any read-scoped HF token; the model is public.

# 1. Snapshot the trained checkpoint locally.
hf download InosLihka/rhythm-env-meta-trained-sft-v3 \
    --local-dir ./models/sft-v3

# 2. Run all 3 eval conditions (discrete-3, continuous-in-dist, continuous-OOD).
python training/inference_eval.py \
    --model_path ./models/sft-v3 \
    --num_episodes 5 \
    --output_file eval_results_v2.json
```

Expected output: `eval_results_v2.json` whose per-condition averages match
[the published JSON](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json)
(distilled student 0.574 / 0.536 / 0.507 across continuous-in-dist /
continuous-OOD / discrete-3) within ±0.02 stochastic noise.

> If you'd rather not run a GPU locally, `scripts/eval_on_hf.py` does the
> same thing as a HF Jobs run: it snapshots the model on a remote a10g/a100,
> runs `inference_eval.py`, and uploads the resulting JSON back to the
> model repo. See the docstring in that script for the submit command.

### Full path (~1.5 hr, ~$5–6 in API + GPU credits): retrain from scratch

```bash
# 1. Teacher rollouts (Azure OpenAI gpt-5.4, ~$3 for 30 episodes).
cp .env.example .env   # fill in AZURE_OPENAI_* + HF_TOKEN
python scripts/generate_teacher_trajectories.py \
    --seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3

# 2. Push teacher data to a HF dataset repo of your choice.
python scripts/upload_teacher_data.py \
    --files data/teacher_30ep.jsonl \
    --repo <your_user>/rhythm-env-teacher-trajectories

# 3. SFT distill on HF Jobs (~25 min on a10g-large, ~$2–3).
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
    -e TEACHER_DATA_REPO=<your_user>/rhythm-env-teacher-trajectories \
    -e MODEL_REPO_SUFFIX=sft-v3-repro \
    -d scripts/sft_on_hf.py

# 4. Eval the new checkpoint via the Fast Path above:
hf download <your_user>/rhythm-env-meta-trained-sft-v3-repro \
    --local-dir ./models/sft-v3-repro
python training/inference_eval.py \
    --model_path ./models/sft-v3-repro \
    --num_episodes 5 \
    --output_file eval_results_v2.json
```

Stage 3 (optional GRPO refine on top of SFT) is the same `scripts/train_on_hf.py`
command shown earlier in the *Training: Algorithm Distillation* section.

## Setup Instructions

### Local Development

```bash
cd rhythm_env
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
```

### Docker

```bash
docker build -t rhythm-env:latest .
docker run -p 8000:8000 rhythm-env:latest
```

### Running the Baseline

```bash
# Heuristic only (no API key needed):
python inference.py

# With LLM:
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your-token"
python inference.py
```

## API Endpoints

| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/reset` | Start a new episode. Optional kwargs: `seed`, `profile=<name>` (one of the 3 reference profiles). Default samples a fresh continuous profile. |
| `POST` | `/step` | Execute an action (`action_type`) |
| `GET` | `/state` | Get current state (includes hidden profile name for debugging) |
| `GET` | `/health` | Health check |
| `GET` | `/metadata` | Environment metadata |
| `GET` | `/schema` | Action/observation JSON schemas |

## Why It Matters

This environment is a training ground for **truly personalized AI**. The product vision: wearables (HRV, sleep score) feed meter proxies, your calendar gets parsed into action types, and every Accept/Ignore tap on a recommendation is a reward signal. A small model trained in RhythmEnv already knows the *structure* of personality inference — so it personalizes to a real user in 5–10 interactions instead of 50+.

```
User installs app → wearables feed meter proxies
Calendar events → mapped to action types
Accept/Ignore taps → reward signal
Agent learns who you are → recommendations adapt
```

No setup. No personality quiz. The agent figures you out. See [sim-to-real architecture](docs/references/sim_to_real_deployment.md) for the full deployment pipeline.

## Validation

```bash
# Step 1: HF Space live  ✓  (returns HTTP 200 on /reset)
# Step 2: Docker build   →  docker build server/
# Step 3: openenv check  ✓  openenv validate passes

./scripts/validate-submission.sh https://InosLihka-rhythm-env.hf.space .
```

## License

BSD 3-Clause License