rhythm_env / README.md
InosLihka's picture
Clarify documentation: anomaly signal explainer, GRPO scope notes
361aed7
---
title: RhythmEnv
emoji: 🎯
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
tags:
- openenv
---
# RhythmEnv β€” Meta-RL Life Simulator
An OpenEnv environment where an LLM agent learns *how to learn a person*. Each episode samples a different hidden personality from a continuous parameter space β€” the agent must infer who it's helping from rewards alone, then adapt its strategy mid-episode.
This is **meta-reinforcement learning** for personalization: the agent isn't trained to optimize one person's life, it's trained to acquire the *skill of figuring out a new person* from a handful of interactions.
## Submission links (for judges)
- **HF Space (the environment)**: https://huggingface.co/spaces/InosLihka/rhythm_env
- **Training notebook**: [training/RhythmEnv_GRPO_Training.ipynb](training/RhythmEnv_GRPO_Training.ipynb)
- **Blog post**: [BLOG.md](BLOG.md) β€” *Teaching an AI to Know You (Without Asking)*
- **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3
- **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
- **Detailed results**: [docs/results.md](docs/results.md)
- **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md)
- **Training plots**: [plots/](plots/) (also embedded below)
## Headline result
A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on all three eval conditions**:
| Condition | Random | Heuristic | **Distilled Qwen 3B** | **+ GRPO refine** | belief_MAE |
|---|---|---|---|---|---|
| **continuous in-distribution** | 0.393 | 0.463 | **0.574** *(+0.111)* | 0.573 | 0.213 |
| **continuous OOD (generalization)** | 0.393 | 0.455 | 0.536 *(+0.081)* | **0.559** *(+0.104)* | 0.263 |
| discrete-3-profiles (legacy) | 0.426 | 0.455 | 0.507 *(+0.052)* | **0.520** *(+0.065)* | 0.430 |
The student's **belief_MAE of 0.213 in-distribution matches the gpt-5.4 teacher (0.196)** β€” the inference skill transferred nearly perfectly via SFT-prime. On OOD profiles the agent never saw, it still beats heuristic by +0.081, proving generalization (not memorization).
A subsequent GRPO refine on top of the SFT'd student lifted **OOD generalization by another +0.023 (4% relative)** and discrete-3 by +0.013, with no in-dist regression. The GRPO-refined model is at [`InosLihka/rhythm-env-meta-trained-sft-grpo-v1`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1).
For reference, the gpt-5.4 teacher (upper bound) hits 0.611 in-dist / 0.621 OOD on a 150-episode reeval. Full numbers in [docs/results.md](docs/results.md). Eval JSONs: [SFT v3](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json) Β· [SFT v3 + GRPO](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1/blob/main/eval_results_v2.json).
![v3 baseline vs trained across conditions](plots/sft_v3_baseline_vs_trained.png)
![SFT v3 vs SFT+GRPO comparison](plots/sft_grpo_comparison.png)
## Training evidence
**SFT v3 loss curve** β€” distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 β†’ 0.083 over 525 steps and stays converged. No overfitting.
![SFT v3 loss](plots/sft_v3_training_loss.png)
**Reward curve** β€” mean per-step env reward over training (real env-replay reward, with Β±1 std band). Climbs steadily as the agent learns profile-aware play.
![Reward curve](plots/grpo_iter2_reward_curve.png)
**Reward components** β€” all 4 reward layers overlaid (format_valid, action_legal, env_reward, belief_accuracy). Lets you read what each layer is contributing as training progresses.
![Reward components](plots/grpo_iter2_reward_components.png)
**Belief-accuracy curve** β€” the meta-RL signal. Rolling mean of how close the agent's emitted belief vector is to the true profile vector, per step.
![Belief accuracy](plots/grpo_iter2_belief_accuracy.png)
Numbers source: `eval_results_v2.json` in the [trained model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3).
## Why a Life Simulator?
Personal AI assistants give generic advice. They don't know *you*. RhythmEnv trains an agent on a continuous distribution of simulated personalities so that, when it meets a real user, it already knows the *structure* of personality inference β€” it personalizes in ~5 interactions instead of 50+.
Every sampled person has a hidden "DNA" β€” a vector of preferences (social tolerance, morning energy, work motivation) plus action modifiers β€” drawn from distributions. The agent sees only the resulting meter changes and rewards. To do well, it must infer the hidden vector and adapt.
This is **Theme 3.2 (Personalized Tasks)** + **Theme 2 (Long-Horizon Planning)** β€” the agent plans across a full week while inferring a hidden personality from observation alone.
## Quick Start
```bash
pip install openenv-core
```
```python
import asyncio
from rhythm_env import RhythmEnv, RhythmAction, ActionType
async def main():
async with RhythmEnv(base_url="https://InosLihka-rhythm-env.hf.space") as env:
result = await env.reset(seed=0)
print(f"Vitality: {result.observation.vitality}")
print(f"Day: {result.observation.day}, Slot: {result.observation.slot}")
result = await env.step(RhythmAction(action_type=ActionType.DEEP_WORK))
print(f"Reward: {result.reward}")
asyncio.run(main())
```
## The 5 Life Meters
All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receives a heavy penalty.
| Meter | What It Represents | Increases With | Decreases With |
|-------|-------------------|----------------|----------------|
| **Vitality** | Physical energy, sleep | Sleep, Exercise | Work, Socializing |
| **Cognition** | Focus, mental clarity | Sleep, Meditate | Deep Work, Binge Watch |
| **Progress** | Career/skill growth | Deep Work, Learn, Admin | Binge Watch (slightly) |
| **Serenity** | Inner peace vs stress | Meditate, Me Time, Exercise | Deep Work, Admin |
| **Connection** | Relationship health | Family Time, Socialize | Passive decay every step |
**Key interactions**:
- Low Vitality reduces the effectiveness of ALL positive actions (global multiplier)
- Connection decays passively β€” you must actively maintain relationships
- Meters interact non-linearly: a crash in one often cascades to others
## Action Space (10 Actions)
| Category | Action | Primary Effect |
|----------|--------|---------------|
| **Productivity** | `DEEP_WORK` | High Progress, drains Vitality + Cognition |
| | `ADMIN_WORK` | Moderate Progress, low drain |
| | `LEARN` | Progress + slight Cognition drain |
| **Recovery** | `SLEEP` | Strong Vitality + Cognition recovery |
| | `EXERCISE` | Vitality + Serenity boost |
| | `MEDITATE` | Strong Serenity + Cognition boost |
| **Social** | `FAMILY_TIME` | Strong Connection, costs Vitality |
| | `SOCIALIZE` | Connection + mild Serenity |
| **Leisure** | `ME_TIME` | Serenity + mild Cognition recovery |
| | `BINGE_WATCH` | Mild Serenity, drains Cognition (trap action) |
## Episode Structure
- **1 episode = 1 week** = 7 days Γ— 4 slots/day = **28 steps**
- **Time slots**: Morning (0), Afternoon (1), Evening (2), Night (3)
- **Time-of-day effects**: Morning boosts cognitive gains (+20%), Night penalizes them (-40%)
- **Random events** (~8% per step): Prod Crash, Family Emergency, Illness, Good News
- **Deterministic** given seed: same seed β†’ same episode trajectory
## The Meta-Learning Setup (Core Innovation)
### What the Agent Sees Each Step
- All 5 meter values + per-meter deltas from the last action
- Current day, slot, timestep
- Active random event (if any)
- **Rolling 7-step history of (action, reward, deltas, *anomalies*)** β€” see below
- Total scalar reward
### The anomaly signal (the cleanest inference channel)
For every past step in the rolling history, the agent sees both the actual
per-meter delta *and* a per-meter **anomaly** = `actual_delta βˆ’ expected_delta_under_neutral_profile`.
A neutral profile is the average person; the anomaly therefore tells the
agent **how much this specific user's response deviated from the average
user's response to the same action**. That deviation is what encodes the
hidden personality.
Concrete example: the agent does `SOCIALIZE` and observes `vitality_delta = -0.18`.
Under a neutral profile the expected change is `-0.06`, so
`vitality_anomaly = -0.12` β€” the user lost twice as much energy as average.
Strong evidence the hidden `social_vitality_multiplier` is high
(introvert: socializing is costly). The agent should down-weight social
actions and reach for solo recovery instead.
This is why a tiny model can learn meta-RL inference here: the env hands it
a clean, well-typed deviation-from-baseline signal at every step. See
[`models.py` (StepRecord)](models.py) for the exact field layout.
### What the Agent Does NOT See
- **The hidden personality vector** β€” sampled per episode, controls everything below
- **Reward weight decomposition** β€” same meter changes produce different rewards for different people
- **Action modifiers** β€” social drain, cognitive bonuses, shame spirals vary continuously
### Continuous Personality Space
Each `reset()` samples a fresh personality from parameter distributions:
| Parameter | Distribution | What it controls | Concrete intuition |
|---|---|---|---|
| `social_vitality_multiplier` | U(0.2, 3.0) | how much vitality `SOCIALIZE` / `FAMILY_TIME` drain | **0.2** = "people energize me" (extrovert). **3.0** = "every party leaves me wrecked" (introvert). |
| `morning_cognition_bonus` | U(0.4, 2.0) or none | bonus on cognitive gains in morning slots | High = sharp at 8am, useless after 6pm (morning person). |
| `evening_night_cognition_bonus` | U(0.6, 1.8) or none | same, evening/night | High = comes alive after dark (night owl). |
| `work_vitality_recovery` | U(0, 0.06) | `DEEP_WORK` *adds* vitality instead of draining | High = work is fuel, not cost (workaholic). |
| `progress_serenity_bonus` | U(0, 0.10) | calmness gained from career progress | High = "I am at peace when I'm shipping." |
| `solo_serenity_bonus` | U(0, 0.10) | `ME_TIME` extra calm gain | High = recharges by being alone (introvert). |
| `social_connection_multiplier` | U(1, 2) | how strongly social actions build Connection | High = relationships compound fast (extrovert fluency). |
| `social_serenity_bonus` | U(0, 0.06) | extra calm from social actions | High = extrovert peace-from-people. |
| `binge_shame` | Bernoulli(0.5) | does `BINGE_WATCH` cost extra serenity afterwards | True = guilt spiral, False = guilt-free. |
| `connection_decay_rate` | U(0.005, 0.02) | passive Connection drop per step | High = relationships need active maintenance. |
| `vitality_decay_rate` | U(0, 0.04) | passive Vitality drop per step | High = always low-energy. |
| `event_impact_multiplier` | U(0.5, 1.0) | how hard random events (Prod Crash etc.) hit | High = brittle to setbacks. |
| `stress_tolerance` | U(0.15, 0.30) | meter level where stress-spiral penalty fires | Low = falls apart easily. |
| `reward_weights` | Dirichlet (biased to non-vit/cog) | which meter changes are *valuable to this person* | One person scores high on `progress` gains; another scores high on `connection`. |
This produces an effectively infinite personality space β€” memorization is impossible, the agent must learn the *skill* of inference. The three named reference profiles below are concrete points in this space, useful as test anchors. For the exact sampling logic see [`server/rhythm_environment.py`](server/rhythm_environment.py) (`sample_profile`).
### Three reference profiles
The env exposes 3 named personalities as anchor points in the continuous space.
Useful for tests and reproducible eval. Reach them via `profile=<name>` on `reset()`:
- **Introvert Morning Person** β†’ belief vector β‰ˆ `[0.0 social, 1.0 morning, 0.07 work]`
- **Extrovert Night Owl** β†’ belief vector β‰ˆ `[1.0 social, 0.20 morning, 0.02 work]`
- **Workaholic Stoic** β†’ belief vector β‰ˆ `[0.36 social, 0.50 morning, 1.0 work]`
### The Action+Belief Output Format
Each step the agent outputs a brief reasoning block followed by an answer line:
```
<reasoning>
Last step's socialize gave Vβˆ’0.12 (anomaly βˆ’0.06, much worse than neutral) β€”
high social drain, suggests low S. Morning DEEP_WORK earlier gave bonus
cognition (+0.04) β†’ high M. With low S + high M, MEDITATE is the recovery
play that fits.
</reasoning>
2 8 5 MEDITATE
```
`S M W ACTION_NAME` is the contract. Three belief digits (0-9) representing
the agent's current belief about the user:
- **S** = social preference (0=hates social, 9=loves social)
- **M** = morning preference (0=night owl, 9=morning person)
- **W** = work preference (0=avoids work, 9=workaholic)
Belief-first ordering matters: in causal-LM generation, tokens generated
earlier condition tokens generated later, so the action is causally
conditioned on the belief β€” making the belief functionally useful rather
than a post-hoc afterthought. The reasoning block isn't required for
parseability (parser searches for the last `S M W ACTION` match), but the
SFT-distilled student learns to emit it because the teacher did.
### The Discovery Challenge
The agent must:
1. **Probe** β€” try different actions in early steps to see how the person responds
2. **Infer** β€” update its belief vector each step based on observed rewards
3. **Adapt** β€” late in the episode, exploit the belief by choosing actions matching the inferred personality
## Reward Architecture (4-layer training stack)
| Layer | Function | Range | Purpose |
|---|---|---|---|
| 1 | `format_valid` | -2 to +1 | parseable as ACTION + 3 belief digits |
| 2 | `action_legal` | -1 to +0.5 | action is one of 10 valid types |
| 3 | `env_reward` | -3 to ~+1.5 | actual env reward via seed-based replay |
| 4 | `belief_accuracy` | -0.5 to +0.5 | cosine-MAE vs true profile vector |
**Per-step env reward** = `sum(meter_delta Γ— hidden_weight) Γ— 15` β€” weights are sampled per profile.
**Critical threshold**: any meter < 0.1 β†’ -0.30 penalty.
**Final grader (v2 β€” measures inference, not just reflex)** β€” `final_score ∈ [0, 1]`:
```
score = 0.15 Γ— crash_free + 0.20 Γ— progress + 0.10 Γ— connection
+ 0.25 Γ— adaptation_score + 0.10 Γ— efficiency + 0.20 Γ— belief_accuracy
```
`belief_accuracy` is `1 βˆ’ MAE` between the agent's last-emitted belief and
the true profile vector. Heuristic / random baselines emit no belief and
score 0 here by design β€” that's the point: the meta-RL skill is *inference*,
and only agents that actually try get credit.
`adaptation_score` is the implicit signal: late-half mean reward minus
early-half mean, gated by absolute late-half quality. Per-step rewards are
profile-weighted, so a high late-half mean means the agent figured out the
hidden weights and started exploiting them.
> **Why we changed the grader.** Five GRPO iterations under the v1 grader
> kept tying with heuristic. Reading the model's reasoning showed it was
> doing real inference β€” but inference didn't lift the score because the
> v1 grader didn't measure inference. Adding `belief_accuracy` (Ξ” +0.20
> weight) fixed the structural mismatch. See [`docs/iterations.md`](docs/iterations.md)
> for the full journey.
## Training: Algorithm Distillation
We train via [Algorithm Distillation](https://arxiv.org/abs/2210.14215) β€” a
frontier teacher plays episodes, writes down its reasoning, and the student
imitates the trajectories. Two stages:
**Stage 1 β€” Teacher rollouts.** gpt-5.4 (Azure AI Foundry) plays 30+ episodes
of RhythmEnv. At each step it outputs a `<reasoning>` block + `S M W ACTION`
answer line. Each episode produces 28 (state, response) tuples. ~$3 per 30
episodes via Azure pay-as-you-go.
```bash
python scripts/generate_teacher_trajectories.py \
--seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3
```
**Stage 2 β€” SFT prime.** Qwen 2.5-3B (Unsloth + 4-bit + LoRA r=16) is
fine-tuned on the teacher's full trajectories. The student learns BOTH the
output format and the reasoning pattern. ~25 min on a HF Jobs `a10g-large`
(~$2-3).
```bash
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
-e TEACHER_DATA_REPO=InosLihka/rhythm-env-teacher-trajectories \
-e MODEL_REPO_SUFFIX=sft-v1 \
-d scripts/sft_on_hf.py
```
**Stage 3 β€” GRPO refine on top of SFT.** Run GRPO with the existing 4-layer
reward stack starting from the SFT'd checkpoint (lr 1e-5, beta 0.1 KL anchor).
This lifts OOD generalization by another **+0.023** and discrete-3 by +0.013
without regressing in-dist. The GRPO-refined model is uploaded to
[`InosLihka/rhythm-env-meta-trained-sft-grpo-v1`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1).
The bulk of the improvement still comes from SFT (Stage 2); GRPO refine is
the polish.
```bash
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
-e MODEL_NAME=InosLihka/rhythm-env-meta-trained-sft-v3 \
-e MAX_STEPS=200 -e LEARNING_RATE=1e-5 -e BETA=0.1 \
-e MODEL_REPO_SUFFIX=sft-grpo-v1 \
-d scripts/train_on_hf.py
```
### Why algorithm distillation, not GRPO from scratch
We tried 5 GRPO iterations from scratch on Qwen 2.5-3B before switching
recipes. They all matched heuristic but never beat it.
The literature was unambiguous on why: small models (≀3B) need a teacher to
bootstrap reasoning skills. Pure GRPO from scratch produces shallow,
non-generalizing behavior at this scale β€” every successful 3B reasoning
recipe (DeepSeek-R1-Distill, PRIME-RL, BREAD) uses SFT-prime or trajectory
distillation.
Once we knew that, the answer was: **use a strong teacher (gpt-5.4) we
already have access to, distill its reasoning into Qwen, ship.**
The `training/train.py` GRPO script is preserved for completeness and as
the optional Stage 3, but it isn't on the critical path of the headline
result. See [`docs/iterations.md`](docs/iterations.md) for the full journey
and what each GRPO iteration taught us.
## Reproducing the headline numbers
There are two reproduction paths depending on how much time and budget you
have. Both produce the numbers in the *Headline result* table above.
### Fast path (~10–20 min, $0): re-run eval against the published checkpoint
This is the path most reviewers should take. The trained model is already
on the Hub. Download it once, then run `inference_eval.py` (which expects a
local path) against all three eval conditions.
```bash
# Prereqs: Python 3.10+, ~12 GB free disk, GPU strongly recommended (CPU works but is slow).
pip install -e .
export HF_TOKEN=... # any read-scoped HF token; the model is public.
# 1. Snapshot the trained checkpoint locally.
hf download InosLihka/rhythm-env-meta-trained-sft-v3 \
--local-dir ./models/sft-v3
# 2. Run all 3 eval conditions (discrete-3, continuous-in-dist, continuous-OOD).
python training/inference_eval.py \
--model_path ./models/sft-v3 \
--num_episodes 5 \
--output_file eval_results_v2.json
```
Expected output: `eval_results_v2.json` whose per-condition averages match
[the published JSON](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json)
(distilled student 0.574 / 0.536 / 0.507 across continuous-in-dist /
continuous-OOD / discrete-3) within Β±0.02 stochastic noise.
> If you'd rather not run a GPU locally, `scripts/eval_on_hf.py` does the
> same thing as a HF Jobs run: it snapshots the model on a remote a10g/a100,
> runs `inference_eval.py`, and uploads the resulting JSON back to the
> model repo. See the docstring in that script for the submit command.
### Full path (~1.5 hr, ~$5–6 in API + GPU credits): retrain from scratch
```bash
# 1. Teacher rollouts (Azure OpenAI gpt-5.4, ~$3 for 30 episodes).
cp .env.example .env # fill in AZURE_OPENAI_* + HF_TOKEN
python scripts/generate_teacher_trajectories.py \
--seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3
# 2. Push teacher data to a HF dataset repo of your choice.
python scripts/upload_teacher_data.py \
--files data/teacher_30ep.jsonl \
--repo <your_user>/rhythm-env-teacher-trajectories
# 3. SFT distill on HF Jobs (~25 min on a10g-large, ~$2–3).
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
-e TEACHER_DATA_REPO=<your_user>/rhythm-env-teacher-trajectories \
-e MODEL_REPO_SUFFIX=sft-v3-repro \
-d scripts/sft_on_hf.py
# 4. Eval the new checkpoint via the Fast Path above:
hf download <your_user>/rhythm-env-meta-trained-sft-v3-repro \
--local-dir ./models/sft-v3-repro
python training/inference_eval.py \
--model_path ./models/sft-v3-repro \
--num_episodes 5 \
--output_file eval_results_v2.json
```
Stage 3 (optional GRPO refine on top of SFT) is the same `scripts/train_on_hf.py`
command shown earlier in the *Training: Algorithm Distillation* section.
## Setup Instructions
### Local Development
```bash
cd rhythm_env
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Docker
```bash
docker build -t rhythm-env:latest .
docker run -p 8000:8000 rhythm-env:latest
```
### Running the Baseline
```bash
# Heuristic only (no API key needed):
python inference.py
# With LLM:
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your-token"
python inference.py
```
## API Endpoints
| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/reset` | Start a new episode. Optional kwargs: `seed`, `profile=<name>` (one of the 3 reference profiles). Default samples a fresh continuous profile. |
| `POST` | `/step` | Execute an action (`action_type`) |
| `GET` | `/state` | Get current state (includes hidden profile name for debugging) |
| `GET` | `/health` | Health check |
| `GET` | `/metadata` | Environment metadata |
| `GET` | `/schema` | Action/observation JSON schemas |
## Why It Matters
This environment is a training ground for **truly personalized AI**. The product vision: wearables (HRV, sleep score) feed meter proxies, your calendar gets parsed into action types, and every Accept/Ignore tap on a recommendation is a reward signal. A small model trained in RhythmEnv already knows the *structure* of personality inference β€” so it personalizes to a real user in 5–10 interactions instead of 50+.
```
User installs app β†’ wearables feed meter proxies
Calendar events β†’ mapped to action types
Accept/Ignore taps β†’ reward signal
Agent learns who you are β†’ recommendations adapt
```
No setup. No personality quiz. The agent figures you out. See [sim-to-real architecture](docs/references/sim_to_real_deployment.md) for the full deployment pipeline.
## Validation
```bash
# Step 1: HF Space live βœ“ (returns HTTP 200 on /reset)
# Step 2: Docker build β†’ docker build server/
# Step 3: openenv check βœ“ openenv validate passes
./scripts/validate-submission.sh https://InosLihka-rhythm-env.hf.space .
```
## License
BSD 3-Clause License