--- title: RhythmEnv emoji: ๐ŸŽฏ colorFrom: blue colorTo: purple sdk: docker app_port: 8000 tags: - openenv --- # RhythmEnv โ€” Meta-RL Life Simulator An OpenEnv environment where an LLM agent learns *how to learn a person*. Each episode samples a different hidden personality from a continuous parameter space โ€” the agent must infer who it's helping from rewards alone, then adapt its strategy mid-episode. This is **meta-reinforcement learning** for personalization: the agent isn't trained to optimize one person's life, it's trained to acquire the *skill of figuring out a new person* from a handful of interactions. ## Submission links (for judges) - **HF Space (the environment)**: https://huggingface.co/spaces/InosLihka/rhythm_env - **Training notebook**: [training/RhythmEnv_GRPO_Training.ipynb](training/RhythmEnv_GRPO_Training.ipynb) - **Blog post**: [BLOG.md](BLOG.md) โ€” *Teaching an AI to Know You (Without Asking)* - **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3 - **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories - **Detailed results**: [docs/results.md](docs/results.md) - **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md) - **Training plots**: [plots/](plots/) (also embedded below) ## Headline result A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on all three eval conditions**: | Condition | Random | Heuristic | **Distilled Qwen 3B** | **+ GRPO refine** | belief_MAE | |---|---|---|---|---|---| | **continuous in-distribution** | 0.393 | 0.463 | **0.574** *(+0.111)* | 0.573 | 0.213 | | **continuous OOD (generalization)** | 0.393 | 0.455 | 0.536 *(+0.081)* | **0.559** *(+0.104)* | 0.263 | | discrete-3-profiles (legacy) | 0.426 | 0.455 | 0.507 *(+0.052)* | **0.520** *(+0.065)* | 0.430 | The student's **belief_MAE of 0.213 in-distribution matches the gpt-5.4 teacher (0.196)** โ€” the inference skill transferred nearly perfectly via SFT-prime. On OOD profiles the agent never saw, it still beats heuristic by +0.081, proving generalization (not memorization). A subsequent GRPO refine on top of the SFT'd student lifted **OOD generalization by another +0.023 (4% relative)** and discrete-3 by +0.013, with no in-dist regression. The GRPO-refined model is at [`InosLihka/rhythm-env-meta-trained-sft-grpo-v1`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1). For reference, the gpt-5.4 teacher (upper bound) hits 0.611 in-dist / 0.621 OOD on a 150-episode reeval. Full numbers in [docs/results.md](docs/results.md). Eval JSONs: [SFT v3](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json) ยท [SFT v3 + GRPO](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1/blob/main/eval_results_v2.json). ![v3 baseline vs trained across conditions](plots/sft_v3_baseline_vs_trained.png) ![SFT v3 vs SFT+GRPO comparison](plots/sft_grpo_comparison.png) ## Training evidence **SFT v3 loss curve** โ€” distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 โ†’ 0.083 over 525 steps and stays converged. No overfitting. ![SFT v3 loss](plots/sft_v3_training_loss.png) **Reward curve** โ€” mean per-step env reward over training (real env-replay reward, with ยฑ1 std band). Climbs steadily as the agent learns profile-aware play. ![Reward curve](plots/grpo_iter2_reward_curve.png) **Reward components** โ€” all 4 reward layers overlaid (format_valid, action_legal, env_reward, belief_accuracy). Lets you read what each layer is contributing as training progresses. ![Reward components](plots/grpo_iter2_reward_components.png) **Belief-accuracy curve** โ€” the meta-RL signal. Rolling mean of how close the agent's emitted belief vector is to the true profile vector, per step. ![Belief accuracy](plots/grpo_iter2_belief_accuracy.png) Numbers source: `eval_results_v2.json` in the [trained model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3). ## Why a Life Simulator? Personal AI assistants give generic advice. They don't know *you*. RhythmEnv trains an agent on a continuous distribution of simulated personalities so that, when it meets a real user, it already knows the *structure* of personality inference โ€” it personalizes in ~5 interactions instead of 50+. Every sampled person has a hidden "DNA" โ€” a vector of preferences (social tolerance, morning energy, work motivation) plus action modifiers โ€” drawn from distributions. The agent sees only the resulting meter changes and rewards. To do well, it must infer the hidden vector and adapt. This is **Theme 3.2 (Personalized Tasks)** + **Theme 2 (Long-Horizon Planning)** โ€” the agent plans across a full week while inferring a hidden personality from observation alone. ## Quick Start ```bash pip install openenv-core ``` ```python import asyncio from rhythm_env import RhythmEnv, RhythmAction, ActionType async def main(): async with RhythmEnv(base_url="https://InosLihka-rhythm-env.hf.space") as env: result = await env.reset(seed=0) print(f"Vitality: {result.observation.vitality}") print(f"Day: {result.observation.day}, Slot: {result.observation.slot}") result = await env.step(RhythmAction(action_type=ActionType.DEEP_WORK)) print(f"Reward: {result.reward}") asyncio.run(main()) ``` ## The 5 Life Meters All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receives a heavy penalty. | Meter | What It Represents | Increases With | Decreases With | |-------|-------------------|----------------|----------------| | **Vitality** | Physical energy, sleep | Sleep, Exercise | Work, Socializing | | **Cognition** | Focus, mental clarity | Sleep, Meditate | Deep Work, Binge Watch | | **Progress** | Career/skill growth | Deep Work, Learn, Admin | Binge Watch (slightly) | | **Serenity** | Inner peace vs stress | Meditate, Me Time, Exercise | Deep Work, Admin | | **Connection** | Relationship health | Family Time, Socialize | Passive decay every step | **Key interactions**: - Low Vitality reduces the effectiveness of ALL positive actions (global multiplier) - Connection decays passively โ€” you must actively maintain relationships - Meters interact non-linearly: a crash in one often cascades to others ## Action Space (10 Actions) | Category | Action | Primary Effect | |----------|--------|---------------| | **Productivity** | `DEEP_WORK` | High Progress, drains Vitality + Cognition | | | `ADMIN_WORK` | Moderate Progress, low drain | | | `LEARN` | Progress + slight Cognition drain | | **Recovery** | `SLEEP` | Strong Vitality + Cognition recovery | | | `EXERCISE` | Vitality + Serenity boost | | | `MEDITATE` | Strong Serenity + Cognition boost | | **Social** | `FAMILY_TIME` | Strong Connection, costs Vitality | | | `SOCIALIZE` | Connection + mild Serenity | | **Leisure** | `ME_TIME` | Serenity + mild Cognition recovery | | | `BINGE_WATCH` | Mild Serenity, drains Cognition (trap action) | ## Episode Structure - **1 episode = 1 week** = 7 days ร— 4 slots/day = **28 steps** - **Time slots**: Morning (0), Afternoon (1), Evening (2), Night (3) - **Time-of-day effects**: Morning boosts cognitive gains (+20%), Night penalizes them (-40%) - **Random events** (~8% per step): Prod Crash, Family Emergency, Illness, Good News - **Deterministic** given seed: same seed โ†’ same episode trajectory ## The Meta-Learning Setup (Core Innovation) ### What the Agent Sees Each Step - All 5 meter values + per-meter deltas from the last action - Current day, slot, timestep - Active random event (if any) - **Rolling 7-step history of (action, reward, deltas, *anomalies*)** โ€” see below - Total scalar reward ### The anomaly signal (the cleanest inference channel) For every past step in the rolling history, the agent sees both the actual per-meter delta *and* a per-meter **anomaly** = `actual_delta โˆ’ expected_delta_under_neutral_profile`. A neutral profile is the average person; the anomaly therefore tells the agent **how much this specific user's response deviated from the average user's response to the same action**. That deviation is what encodes the hidden personality. Concrete example: the agent does `SOCIALIZE` and observes `vitality_delta = -0.18`. Under a neutral profile the expected change is `-0.06`, so `vitality_anomaly = -0.12` โ€” the user lost twice as much energy as average. Strong evidence the hidden `social_vitality_multiplier` is high (introvert: socializing is costly). The agent should down-weight social actions and reach for solo recovery instead. This is why a tiny model can learn meta-RL inference here: the env hands it a clean, well-typed deviation-from-baseline signal at every step. See [`models.py` (StepRecord)](models.py) for the exact field layout. ### What the Agent Does NOT See - **The hidden personality vector** โ€” sampled per episode, controls everything below - **Reward weight decomposition** โ€” same meter changes produce different rewards for different people - **Action modifiers** โ€” social drain, cognitive bonuses, shame spirals vary continuously ### Continuous Personality Space Each `reset()` samples a fresh personality from parameter distributions: | Parameter | Distribution | What it controls | Concrete intuition | |---|---|---|---| | `social_vitality_multiplier` | U(0.2, 3.0) | how much vitality `SOCIALIZE` / `FAMILY_TIME` drain | **0.2** = "people energize me" (extrovert). **3.0** = "every party leaves me wrecked" (introvert). | | `morning_cognition_bonus` | U(0.4, 2.0) or none | bonus on cognitive gains in morning slots | High = sharp at 8am, useless after 6pm (morning person). | | `evening_night_cognition_bonus` | U(0.6, 1.8) or none | same, evening/night | High = comes alive after dark (night owl). | | `work_vitality_recovery` | U(0, 0.06) | `DEEP_WORK` *adds* vitality instead of draining | High = work is fuel, not cost (workaholic). | | `progress_serenity_bonus` | U(0, 0.10) | calmness gained from career progress | High = "I am at peace when I'm shipping." | | `solo_serenity_bonus` | U(0, 0.10) | `ME_TIME` extra calm gain | High = recharges by being alone (introvert). | | `social_connection_multiplier` | U(1, 2) | how strongly social actions build Connection | High = relationships compound fast (extrovert fluency). | | `social_serenity_bonus` | U(0, 0.06) | extra calm from social actions | High = extrovert peace-from-people. | | `binge_shame` | Bernoulli(0.5) | does `BINGE_WATCH` cost extra serenity afterwards | True = guilt spiral, False = guilt-free. | | `connection_decay_rate` | U(0.005, 0.02) | passive Connection drop per step | High = relationships need active maintenance. | | `vitality_decay_rate` | U(0, 0.04) | passive Vitality drop per step | High = always low-energy. | | `event_impact_multiplier` | U(0.5, 1.0) | how hard random events (Prod Crash etc.) hit | High = brittle to setbacks. | | `stress_tolerance` | U(0.15, 0.30) | meter level where stress-spiral penalty fires | Low = falls apart easily. | | `reward_weights` | Dirichlet (biased to non-vit/cog) | which meter changes are *valuable to this person* | One person scores high on `progress` gains; another scores high on `connection`. | This produces an effectively infinite personality space โ€” memorization is impossible, the agent must learn the *skill* of inference. The three named reference profiles below are concrete points in this space, useful as test anchors. For the exact sampling logic see [`server/rhythm_environment.py`](server/rhythm_environment.py) (`sample_profile`). ### Three reference profiles The env exposes 3 named personalities as anchor points in the continuous space. Useful for tests and reproducible eval. Reach them via `profile=` on `reset()`: - **Introvert Morning Person** โ†’ belief vector โ‰ˆ `[0.0 social, 1.0 morning, 0.07 work]` - **Extrovert Night Owl** โ†’ belief vector โ‰ˆ `[1.0 social, 0.20 morning, 0.02 work]` - **Workaholic Stoic** โ†’ belief vector โ‰ˆ `[0.36 social, 0.50 morning, 1.0 work]` ### The Action+Belief Output Format Each step the agent outputs a brief reasoning block followed by an answer line: ``` Last step's socialize gave Vโˆ’0.12 (anomaly โˆ’0.06, much worse than neutral) โ€” high social drain, suggests low S. Morning DEEP_WORK earlier gave bonus cognition (+0.04) โ†’ high M. With low S + high M, MEDITATE is the recovery play that fits. 2 8 5 MEDITATE ``` `S M W ACTION_NAME` is the contract. Three belief digits (0-9) representing the agent's current belief about the user: - **S** = social preference (0=hates social, 9=loves social) - **M** = morning preference (0=night owl, 9=morning person) - **W** = work preference (0=avoids work, 9=workaholic) Belief-first ordering matters: in causal-LM generation, tokens generated earlier condition tokens generated later, so the action is causally conditioned on the belief โ€” making the belief functionally useful rather than a post-hoc afterthought. The reasoning block isn't required for parseability (parser searches for the last `S M W ACTION` match), but the SFT-distilled student learns to emit it because the teacher did. ### The Discovery Challenge The agent must: 1. **Probe** โ€” try different actions in early steps to see how the person responds 2. **Infer** โ€” update its belief vector each step based on observed rewards 3. **Adapt** โ€” late in the episode, exploit the belief by choosing actions matching the inferred personality ## Reward Architecture (4-layer training stack) | Layer | Function | Range | Purpose | |---|---|---|---| | 1 | `format_valid` | -2 to +1 | parseable as ACTION + 3 belief digits | | 2 | `action_legal` | -1 to +0.5 | action is one of 10 valid types | | 3 | `env_reward` | -3 to ~+1.5 | actual env reward via seed-based replay | | 4 | `belief_accuracy` | -0.5 to +0.5 | cosine-MAE vs true profile vector | **Per-step env reward** = `sum(meter_delta ร— hidden_weight) ร— 15` โ€” weights are sampled per profile. **Critical threshold**: any meter < 0.1 โ†’ -0.30 penalty. **Final grader (v2 โ€” measures inference, not just reflex)** โ€” `final_score โˆˆ [0, 1]`: ``` score = 0.15 ร— crash_free + 0.20 ร— progress + 0.10 ร— connection + 0.25 ร— adaptation_score + 0.10 ร— efficiency + 0.20 ร— belief_accuracy ``` `belief_accuracy` is `1 โˆ’ MAE` between the agent's last-emitted belief and the true profile vector. Heuristic / random baselines emit no belief and score 0 here by design โ€” that's the point: the meta-RL skill is *inference*, and only agents that actually try get credit. `adaptation_score` is the implicit signal: late-half mean reward minus early-half mean, gated by absolute late-half quality. Per-step rewards are profile-weighted, so a high late-half mean means the agent figured out the hidden weights and started exploiting them. > **Why we changed the grader.** Five GRPO iterations under the v1 grader > kept tying with heuristic. Reading the model's reasoning showed it was > doing real inference โ€” but inference didn't lift the score because the > v1 grader didn't measure inference. Adding `belief_accuracy` (ฮ” +0.20 > weight) fixed the structural mismatch. See [`docs/iterations.md`](docs/iterations.md) > for the full journey. ## Training: Algorithm Distillation We train via [Algorithm Distillation](https://arxiv.org/abs/2210.14215) โ€” a frontier teacher plays episodes, writes down its reasoning, and the student imitates the trajectories. Two stages: **Stage 1 โ€” Teacher rollouts.** gpt-5.4 (Azure AI Foundry) plays 30+ episodes of RhythmEnv. At each step it outputs a `` block + `S M W ACTION` answer line. Each episode produces 28 (state, response) tuples. ~$3 per 30 episodes via Azure pay-as-you-go. ```bash python scripts/generate_teacher_trajectories.py \ --seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3 ``` **Stage 2 โ€” SFT prime.** Qwen 2.5-3B (Unsloth + 4-bit + LoRA r=16) is fine-tuned on the teacher's full trajectories. The student learns BOTH the output format and the reasoning pattern. ~25 min on a HF Jobs `a10g-large` (~$2-3). ```bash hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \ -e TEACHER_DATA_REPO=InosLihka/rhythm-env-teacher-trajectories \ -e MODEL_REPO_SUFFIX=sft-v1 \ -d scripts/sft_on_hf.py ``` **Stage 3 โ€” GRPO refine on top of SFT.** Run GRPO with the existing 4-layer reward stack starting from the SFT'd checkpoint (lr 1e-5, beta 0.1 KL anchor). This lifts OOD generalization by another **+0.023** and discrete-3 by +0.013 without regressing in-dist. The GRPO-refined model is uploaded to [`InosLihka/rhythm-env-meta-trained-sft-grpo-v1`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1). The bulk of the improvement still comes from SFT (Stage 2); GRPO refine is the polish. ```bash hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \ -e MODEL_NAME=InosLihka/rhythm-env-meta-trained-sft-v3 \ -e MAX_STEPS=200 -e LEARNING_RATE=1e-5 -e BETA=0.1 \ -e MODEL_REPO_SUFFIX=sft-grpo-v1 \ -d scripts/train_on_hf.py ``` ### Why algorithm distillation, not GRPO from scratch We tried 5 GRPO iterations from scratch on Qwen 2.5-3B before switching recipes. They all matched heuristic but never beat it. The literature was unambiguous on why: small models (โ‰ค3B) need a teacher to bootstrap reasoning skills. Pure GRPO from scratch produces shallow, non-generalizing behavior at this scale โ€” every successful 3B reasoning recipe (DeepSeek-R1-Distill, PRIME-RL, BREAD) uses SFT-prime or trajectory distillation. Once we knew that, the answer was: **use a strong teacher (gpt-5.4) we already have access to, distill its reasoning into Qwen, ship.** The `training/train.py` GRPO script is preserved for completeness and as the optional Stage 3, but it isn't on the critical path of the headline result. See [`docs/iterations.md`](docs/iterations.md) for the full journey and what each GRPO iteration taught us. ## Reproducing the headline numbers There are two reproduction paths depending on how much time and budget you have. Both produce the numbers in the *Headline result* table above. ### Fast path (~10โ€“20 min, $0): re-run eval against the published checkpoint This is the path most reviewers should take. The trained model is already on the Hub. Download it once, then run `inference_eval.py` (which expects a local path) against all three eval conditions. ```bash # Prereqs: Python 3.10+, ~12 GB free disk, GPU strongly recommended (CPU works but is slow). pip install -e . export HF_TOKEN=... # any read-scoped HF token; the model is public. # 1. Snapshot the trained checkpoint locally. hf download InosLihka/rhythm-env-meta-trained-sft-v3 \ --local-dir ./models/sft-v3 # 2. Run all 3 eval conditions (discrete-3, continuous-in-dist, continuous-OOD). python training/inference_eval.py \ --model_path ./models/sft-v3 \ --num_episodes 5 \ --output_file eval_results_v2.json ``` Expected output: `eval_results_v2.json` whose per-condition averages match [the published JSON](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json) (distilled student 0.574 / 0.536 / 0.507 across continuous-in-dist / continuous-OOD / discrete-3) within ยฑ0.02 stochastic noise. > If you'd rather not run a GPU locally, `scripts/eval_on_hf.py` does the > same thing as a HF Jobs run: it snapshots the model on a remote a10g/a100, > runs `inference_eval.py`, and uploads the resulting JSON back to the > model repo. See the docstring in that script for the submit command. ### Full path (~1.5 hr, ~$5โ€“6 in API + GPU credits): retrain from scratch ```bash # 1. Teacher rollouts (Azure OpenAI gpt-5.4, ~$3 for 30 episodes). cp .env.example .env # fill in AZURE_OPENAI_* + HF_TOKEN python scripts/generate_teacher_trajectories.py \ --seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3 # 2. Push teacher data to a HF dataset repo of your choice. python scripts/upload_teacher_data.py \ --files data/teacher_30ep.jsonl \ --repo /rhythm-env-teacher-trajectories # 3. SFT distill on HF Jobs (~25 min on a10g-large, ~$2โ€“3). hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \ -e TEACHER_DATA_REPO=/rhythm-env-teacher-trajectories \ -e MODEL_REPO_SUFFIX=sft-v3-repro \ -d scripts/sft_on_hf.py # 4. Eval the new checkpoint via the Fast Path above: hf download /rhythm-env-meta-trained-sft-v3-repro \ --local-dir ./models/sft-v3-repro python training/inference_eval.py \ --model_path ./models/sft-v3-repro \ --num_episodes 5 \ --output_file eval_results_v2.json ``` Stage 3 (optional GRPO refine on top of SFT) is the same `scripts/train_on_hf.py` command shown earlier in the *Training: Algorithm Distillation* section. ## Setup Instructions ### Local Development ```bash cd rhythm_env pip install -e . uvicorn server.app:app --host 0.0.0.0 --port 8000 ``` ### Docker ```bash docker build -t rhythm-env:latest . docker run -p 8000:8000 rhythm-env:latest ``` ### Running the Baseline ```bash # Heuristic only (no API key needed): python inference.py # With LLM: export API_BASE_URL="https://router.huggingface.co/v1" export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" export HF_TOKEN="your-token" python inference.py ``` ## API Endpoints | Method | Endpoint | Description | |--------|----------|-------------| | `POST` | `/reset` | Start a new episode. Optional kwargs: `seed`, `profile=` (one of the 3 reference profiles). Default samples a fresh continuous profile. | | `POST` | `/step` | Execute an action (`action_type`) | | `GET` | `/state` | Get current state (includes hidden profile name for debugging) | | `GET` | `/health` | Health check | | `GET` | `/metadata` | Environment metadata | | `GET` | `/schema` | Action/observation JSON schemas | ## Why It Matters This environment is a training ground for **truly personalized AI**. The product vision: wearables (HRV, sleep score) feed meter proxies, your calendar gets parsed into action types, and every Accept/Ignore tap on a recommendation is a reward signal. A small model trained in RhythmEnv already knows the *structure* of personality inference โ€” so it personalizes to a real user in 5โ€“10 interactions instead of 50+. ``` User installs app โ†’ wearables feed meter proxies Calendar events โ†’ mapped to action types Accept/Ignore taps โ†’ reward signal Agent learns who you are โ†’ recommendations adapt ``` No setup. No personality quiz. The agent figures you out. See [sim-to-real architecture](docs/references/sim_to_real_deployment.md) for the full deployment pipeline. ## Validation ```bash # Step 1: HF Space live โœ“ (returns HTTP 200 on /reset) # Step 2: Docker build โ†’ docker build server/ # Step 3: openenv check โœ“ openenv validate passes ./scripts/validate-submission.sh https://InosLihka-rhythm-env.hf.space . ``` ## License BSD 3-Clause License