Spaces:
Sleeping
Sleeping
| title: RhythmEnv | |
| emoji: π― | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| # RhythmEnv β Meta-RL Life Simulator | |
| An OpenEnv environment where an LLM agent learns *how to learn a person*. Each episode samples a different hidden personality from a continuous parameter space β the agent must infer who it's helping from rewards alone, then adapt its strategy mid-episode. | |
| This is **meta-reinforcement learning** for personalization: the agent isn't trained to optimize one person's life, it's trained to acquire the *skill of figuring out a new person* from a handful of interactions. | |
| ## Submission links (for judges) | |
| - **HF Space (the environment)**: https://huggingface.co/spaces/InosLihka/rhythm_env | |
| - **Training notebook**: [training/RhythmEnv_GRPO_Training.ipynb](training/RhythmEnv_GRPO_Training.ipynb) | |
| - **Blog post**: [BLOG.md](BLOG.md) β *Teaching an AI to Know You (Without Asking)* | |
| - **Trained model (Algorithm Distillation)**: https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3 | |
| - **Teacher trajectories dataset**: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories | |
| - **Detailed results**: [docs/results.md](docs/results.md) | |
| - **Iteration journey + lessons**: [docs/iterations.md](docs/iterations.md) | |
| - **Training plots**: [plots/](plots/) (also embedded below) | |
| ## Headline result | |
| A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on all three eval conditions**: | |
| | Condition | Random | Heuristic | **Distilled Qwen 3B** | **+ GRPO refine** | belief_MAE | | |
| |---|---|---|---|---|---| | |
| | **continuous in-distribution** | 0.393 | 0.463 | **0.574** *(+0.111)* | 0.573 | 0.213 | | |
| | **continuous OOD (generalization)** | 0.393 | 0.455 | 0.536 *(+0.081)* | **0.559** *(+0.104)* | 0.263 | | |
| | discrete-3-profiles (legacy) | 0.426 | 0.455 | 0.507 *(+0.052)* | **0.520** *(+0.065)* | 0.430 | | |
| The student's **belief_MAE of 0.213 in-distribution matches the gpt-5.4 teacher (0.196)** β the inference skill transferred nearly perfectly via SFT-prime. On OOD profiles the agent never saw, it still beats heuristic by +0.081, proving generalization (not memorization). | |
| A subsequent GRPO refine on top of the SFT'd student lifted **OOD generalization by another +0.023 (4% relative)** and discrete-3 by +0.013, with no in-dist regression. The GRPO-refined model is at [`InosLihka/rhythm-env-meta-trained-sft-grpo-v1`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1). | |
| For reference, the gpt-5.4 teacher (upper bound) hits 0.611 in-dist / 0.621 OOD on a 150-episode reeval. Full numbers in [docs/results.md](docs/results.md). Eval JSONs: [SFT v3](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json) Β· [SFT v3 + GRPO](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1/blob/main/eval_results_v2.json). | |
|  | |
|  | |
| ## Training evidence | |
| **SFT v3 loss curve** β distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 β 0.083 over 525 steps and stays converged. No overfitting. | |
|  | |
| **Reward curve** β mean per-step env reward over training (real env-replay reward, with Β±1 std band). Climbs steadily as the agent learns profile-aware play. | |
|  | |
| **Reward components** β all 4 reward layers overlaid (format_valid, action_legal, env_reward, belief_accuracy). Lets you read what each layer is contributing as training progresses. | |
|  | |
| **Belief-accuracy curve** β the meta-RL signal. Rolling mean of how close the agent's emitted belief vector is to the true profile vector, per step. | |
|  | |
| Numbers source: `eval_results_v2.json` in the [trained model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3). | |
| ## Why a Life Simulator? | |
| Personal AI assistants give generic advice. They don't know *you*. RhythmEnv trains an agent on a continuous distribution of simulated personalities so that, when it meets a real user, it already knows the *structure* of personality inference β it personalizes in ~5 interactions instead of 50+. | |
| Every sampled person has a hidden "DNA" β a vector of preferences (social tolerance, morning energy, work motivation) plus action modifiers β drawn from distributions. The agent sees only the resulting meter changes and rewards. To do well, it must infer the hidden vector and adapt. | |
| This is **Theme 3.2 (Personalized Tasks)** + **Theme 2 (Long-Horizon Planning)** β the agent plans across a full week while inferring a hidden personality from observation alone. | |
| ## Quick Start | |
| ```bash | |
| pip install openenv-core | |
| ``` | |
| ```python | |
| import asyncio | |
| from rhythm_env import RhythmEnv, RhythmAction, ActionType | |
| async def main(): | |
| async with RhythmEnv(base_url="https://InosLihka-rhythm-env.hf.space") as env: | |
| result = await env.reset(seed=0) | |
| print(f"Vitality: {result.observation.vitality}") | |
| print(f"Day: {result.observation.day}, Slot: {result.observation.slot}") | |
| result = await env.step(RhythmAction(action_type=ActionType.DEEP_WORK)) | |
| print(f"Reward: {result.reward}") | |
| asyncio.run(main()) | |
| ``` | |
| ## The 5 Life Meters | |
| All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receives a heavy penalty. | |
| | Meter | What It Represents | Increases With | Decreases With | | |
| |-------|-------------------|----------------|----------------| | |
| | **Vitality** | Physical energy, sleep | Sleep, Exercise | Work, Socializing | | |
| | **Cognition** | Focus, mental clarity | Sleep, Meditate | Deep Work, Binge Watch | | |
| | **Progress** | Career/skill growth | Deep Work, Learn, Admin | Binge Watch (slightly) | | |
| | **Serenity** | Inner peace vs stress | Meditate, Me Time, Exercise | Deep Work, Admin | | |
| | **Connection** | Relationship health | Family Time, Socialize | Passive decay every step | | |
| **Key interactions**: | |
| - Low Vitality reduces the effectiveness of ALL positive actions (global multiplier) | |
| - Connection decays passively β you must actively maintain relationships | |
| - Meters interact non-linearly: a crash in one often cascades to others | |
| ## Action Space (10 Actions) | |
| | Category | Action | Primary Effect | | |
| |----------|--------|---------------| | |
| | **Productivity** | `DEEP_WORK` | High Progress, drains Vitality + Cognition | | |
| | | `ADMIN_WORK` | Moderate Progress, low drain | | |
| | | `LEARN` | Progress + slight Cognition drain | | |
| | **Recovery** | `SLEEP` | Strong Vitality + Cognition recovery | | |
| | | `EXERCISE` | Vitality + Serenity boost | | |
| | | `MEDITATE` | Strong Serenity + Cognition boost | | |
| | **Social** | `FAMILY_TIME` | Strong Connection, costs Vitality | | |
| | | `SOCIALIZE` | Connection + mild Serenity | | |
| | **Leisure** | `ME_TIME` | Serenity + mild Cognition recovery | | |
| | | `BINGE_WATCH` | Mild Serenity, drains Cognition (trap action) | | |
| ## Episode Structure | |
| - **1 episode = 1 week** = 7 days Γ 4 slots/day = **28 steps** | |
| - **Time slots**: Morning (0), Afternoon (1), Evening (2), Night (3) | |
| - **Time-of-day effects**: Morning boosts cognitive gains (+20%), Night penalizes them (-40%) | |
| - **Random events** (~8% per step): Prod Crash, Family Emergency, Illness, Good News | |
| - **Deterministic** given seed: same seed β same episode trajectory | |
| ## The Meta-Learning Setup (Core Innovation) | |
| ### What the Agent Sees Each Step | |
| - All 5 meter values + per-meter deltas from the last action | |
| - Current day, slot, timestep | |
| - Active random event (if any) | |
| - **Rolling 7-step history of (action, reward, deltas, *anomalies*)** β see below | |
| - Total scalar reward | |
| ### The anomaly signal (the cleanest inference channel) | |
| For every past step in the rolling history, the agent sees both the actual | |
| per-meter delta *and* a per-meter **anomaly** = `actual_delta β expected_delta_under_neutral_profile`. | |
| A neutral profile is the average person; the anomaly therefore tells the | |
| agent **how much this specific user's response deviated from the average | |
| user's response to the same action**. That deviation is what encodes the | |
| hidden personality. | |
| Concrete example: the agent does `SOCIALIZE` and observes `vitality_delta = -0.18`. | |
| Under a neutral profile the expected change is `-0.06`, so | |
| `vitality_anomaly = -0.12` β the user lost twice as much energy as average. | |
| Strong evidence the hidden `social_vitality_multiplier` is high | |
| (introvert: socializing is costly). The agent should down-weight social | |
| actions and reach for solo recovery instead. | |
| This is why a tiny model can learn meta-RL inference here: the env hands it | |
| a clean, well-typed deviation-from-baseline signal at every step. See | |
| [`models.py` (StepRecord)](models.py) for the exact field layout. | |
| ### What the Agent Does NOT See | |
| - **The hidden personality vector** β sampled per episode, controls everything below | |
| - **Reward weight decomposition** β same meter changes produce different rewards for different people | |
| - **Action modifiers** β social drain, cognitive bonuses, shame spirals vary continuously | |
| ### Continuous Personality Space | |
| Each `reset()` samples a fresh personality from parameter distributions: | |
| | Parameter | Distribution | What it controls | Concrete intuition | | |
| |---|---|---|---| | |
| | `social_vitality_multiplier` | U(0.2, 3.0) | how much vitality `SOCIALIZE` / `FAMILY_TIME` drain | **0.2** = "people energize me" (extrovert). **3.0** = "every party leaves me wrecked" (introvert). | | |
| | `morning_cognition_bonus` | U(0.4, 2.0) or none | bonus on cognitive gains in morning slots | High = sharp at 8am, useless after 6pm (morning person). | | |
| | `evening_night_cognition_bonus` | U(0.6, 1.8) or none | same, evening/night | High = comes alive after dark (night owl). | | |
| | `work_vitality_recovery` | U(0, 0.06) | `DEEP_WORK` *adds* vitality instead of draining | High = work is fuel, not cost (workaholic). | | |
| | `progress_serenity_bonus` | U(0, 0.10) | calmness gained from career progress | High = "I am at peace when I'm shipping." | | |
| | `solo_serenity_bonus` | U(0, 0.10) | `ME_TIME` extra calm gain | High = recharges by being alone (introvert). | | |
| | `social_connection_multiplier` | U(1, 2) | how strongly social actions build Connection | High = relationships compound fast (extrovert fluency). | | |
| | `social_serenity_bonus` | U(0, 0.06) | extra calm from social actions | High = extrovert peace-from-people. | | |
| | `binge_shame` | Bernoulli(0.5) | does `BINGE_WATCH` cost extra serenity afterwards | True = guilt spiral, False = guilt-free. | | |
| | `connection_decay_rate` | U(0.005, 0.02) | passive Connection drop per step | High = relationships need active maintenance. | | |
| | `vitality_decay_rate` | U(0, 0.04) | passive Vitality drop per step | High = always low-energy. | | |
| | `event_impact_multiplier` | U(0.5, 1.0) | how hard random events (Prod Crash etc.) hit | High = brittle to setbacks. | | |
| | `stress_tolerance` | U(0.15, 0.30) | meter level where stress-spiral penalty fires | Low = falls apart easily. | | |
| | `reward_weights` | Dirichlet (biased to non-vit/cog) | which meter changes are *valuable to this person* | One person scores high on `progress` gains; another scores high on `connection`. | | |
| This produces an effectively infinite personality space β memorization is impossible, the agent must learn the *skill* of inference. The three named reference profiles below are concrete points in this space, useful as test anchors. For the exact sampling logic see [`server/rhythm_environment.py`](server/rhythm_environment.py) (`sample_profile`). | |
| ### Three reference profiles | |
| The env exposes 3 named personalities as anchor points in the continuous space. | |
| Useful for tests and reproducible eval. Reach them via `profile=<name>` on `reset()`: | |
| - **Introvert Morning Person** β belief vector β `[0.0 social, 1.0 morning, 0.07 work]` | |
| - **Extrovert Night Owl** β belief vector β `[1.0 social, 0.20 morning, 0.02 work]` | |
| - **Workaholic Stoic** β belief vector β `[0.36 social, 0.50 morning, 1.0 work]` | |
| ### The Action+Belief Output Format | |
| Each step the agent outputs a brief reasoning block followed by an answer line: | |
| ``` | |
| <reasoning> | |
| Last step's socialize gave Vβ0.12 (anomaly β0.06, much worse than neutral) β | |
| high social drain, suggests low S. Morning DEEP_WORK earlier gave bonus | |
| cognition (+0.04) β high M. With low S + high M, MEDITATE is the recovery | |
| play that fits. | |
| </reasoning> | |
| 2 8 5 MEDITATE | |
| ``` | |
| `S M W ACTION_NAME` is the contract. Three belief digits (0-9) representing | |
| the agent's current belief about the user: | |
| - **S** = social preference (0=hates social, 9=loves social) | |
| - **M** = morning preference (0=night owl, 9=morning person) | |
| - **W** = work preference (0=avoids work, 9=workaholic) | |
| Belief-first ordering matters: in causal-LM generation, tokens generated | |
| earlier condition tokens generated later, so the action is causally | |
| conditioned on the belief β making the belief functionally useful rather | |
| than a post-hoc afterthought. The reasoning block isn't required for | |
| parseability (parser searches for the last `S M W ACTION` match), but the | |
| SFT-distilled student learns to emit it because the teacher did. | |
| ### The Discovery Challenge | |
| The agent must: | |
| 1. **Probe** β try different actions in early steps to see how the person responds | |
| 2. **Infer** β update its belief vector each step based on observed rewards | |
| 3. **Adapt** β late in the episode, exploit the belief by choosing actions matching the inferred personality | |
| ## Reward Architecture (4-layer training stack) | |
| | Layer | Function | Range | Purpose | | |
| |---|---|---|---| | |
| | 1 | `format_valid` | -2 to +1 | parseable as ACTION + 3 belief digits | | |
| | 2 | `action_legal` | -1 to +0.5 | action is one of 10 valid types | | |
| | 3 | `env_reward` | -3 to ~+1.5 | actual env reward via seed-based replay | | |
| | 4 | `belief_accuracy` | -0.5 to +0.5 | cosine-MAE vs true profile vector | | |
| **Per-step env reward** = `sum(meter_delta Γ hidden_weight) Γ 15` β weights are sampled per profile. | |
| **Critical threshold**: any meter < 0.1 β -0.30 penalty. | |
| **Final grader (v2 β measures inference, not just reflex)** β `final_score β [0, 1]`: | |
| ``` | |
| score = 0.15 Γ crash_free + 0.20 Γ progress + 0.10 Γ connection | |
| + 0.25 Γ adaptation_score + 0.10 Γ efficiency + 0.20 Γ belief_accuracy | |
| ``` | |
| `belief_accuracy` is `1 β MAE` between the agent's last-emitted belief and | |
| the true profile vector. Heuristic / random baselines emit no belief and | |
| score 0 here by design β that's the point: the meta-RL skill is *inference*, | |
| and only agents that actually try get credit. | |
| `adaptation_score` is the implicit signal: late-half mean reward minus | |
| early-half mean, gated by absolute late-half quality. Per-step rewards are | |
| profile-weighted, so a high late-half mean means the agent figured out the | |
| hidden weights and started exploiting them. | |
| > **Why we changed the grader.** Five GRPO iterations under the v1 grader | |
| > kept tying with heuristic. Reading the model's reasoning showed it was | |
| > doing real inference β but inference didn't lift the score because the | |
| > v1 grader didn't measure inference. Adding `belief_accuracy` (Ξ +0.20 | |
| > weight) fixed the structural mismatch. See [`docs/iterations.md`](docs/iterations.md) | |
| > for the full journey. | |
| ## Training: Algorithm Distillation | |
| We train via [Algorithm Distillation](https://arxiv.org/abs/2210.14215) β a | |
| frontier teacher plays episodes, writes down its reasoning, and the student | |
| imitates the trajectories. Two stages: | |
| **Stage 1 β Teacher rollouts.** gpt-5.4 (Azure AI Foundry) plays 30+ episodes | |
| of RhythmEnv. At each step it outputs a `<reasoning>` block + `S M W ACTION` | |
| answer line. Each episode produces 28 (state, response) tuples. ~$3 per 30 | |
| episodes via Azure pay-as-you-go. | |
| ```bash | |
| python scripts/generate_teacher_trajectories.py \ | |
| --seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3 | |
| ``` | |
| **Stage 2 β SFT prime.** Qwen 2.5-3B (Unsloth + 4-bit + LoRA r=16) is | |
| fine-tuned on the teacher's full trajectories. The student learns BOTH the | |
| output format and the reasoning pattern. ~25 min on a HF Jobs `a10g-large` | |
| (~$2-3). | |
| ```bash | |
| hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \ | |
| -e TEACHER_DATA_REPO=InosLihka/rhythm-env-teacher-trajectories \ | |
| -e MODEL_REPO_SUFFIX=sft-v1 \ | |
| -d scripts/sft_on_hf.py | |
| ``` | |
| **Stage 3 β GRPO refine on top of SFT.** Run GRPO with the existing 4-layer | |
| reward stack starting from the SFT'd checkpoint (lr 1e-5, beta 0.1 KL anchor). | |
| This lifts OOD generalization by another **+0.023** and discrete-3 by +0.013 | |
| without regressing in-dist. The GRPO-refined model is uploaded to | |
| [`InosLihka/rhythm-env-meta-trained-sft-grpo-v1`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1). | |
| The bulk of the improvement still comes from SFT (Stage 2); GRPO refine is | |
| the polish. | |
| ```bash | |
| hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \ | |
| -e MODEL_NAME=InosLihka/rhythm-env-meta-trained-sft-v3 \ | |
| -e MAX_STEPS=200 -e LEARNING_RATE=1e-5 -e BETA=0.1 \ | |
| -e MODEL_REPO_SUFFIX=sft-grpo-v1 \ | |
| -d scripts/train_on_hf.py | |
| ``` | |
| ### Why algorithm distillation, not GRPO from scratch | |
| We tried 5 GRPO iterations from scratch on Qwen 2.5-3B before switching | |
| recipes. They all matched heuristic but never beat it. | |
| The literature was unambiguous on why: small models (β€3B) need a teacher to | |
| bootstrap reasoning skills. Pure GRPO from scratch produces shallow, | |
| non-generalizing behavior at this scale β every successful 3B reasoning | |
| recipe (DeepSeek-R1-Distill, PRIME-RL, BREAD) uses SFT-prime or trajectory | |
| distillation. | |
| Once we knew that, the answer was: **use a strong teacher (gpt-5.4) we | |
| already have access to, distill its reasoning into Qwen, ship.** | |
| The `training/train.py` GRPO script is preserved for completeness and as | |
| the optional Stage 3, but it isn't on the critical path of the headline | |
| result. See [`docs/iterations.md`](docs/iterations.md) for the full journey | |
| and what each GRPO iteration taught us. | |
| ## Reproducing the headline numbers | |
| There are two reproduction paths depending on how much time and budget you | |
| have. Both produce the numbers in the *Headline result* table above. | |
| ### Fast path (~10β20 min, $0): re-run eval against the published checkpoint | |
| This is the path most reviewers should take. The trained model is already | |
| on the Hub. Download it once, then run `inference_eval.py` (which expects a | |
| local path) against all three eval conditions. | |
| ```bash | |
| # Prereqs: Python 3.10+, ~12 GB free disk, GPU strongly recommended (CPU works but is slow). | |
| pip install -e . | |
| export HF_TOKEN=... # any read-scoped HF token; the model is public. | |
| # 1. Snapshot the trained checkpoint locally. | |
| hf download InosLihka/rhythm-env-meta-trained-sft-v3 \ | |
| --local-dir ./models/sft-v3 | |
| # 2. Run all 3 eval conditions (discrete-3, continuous-in-dist, continuous-OOD). | |
| python training/inference_eval.py \ | |
| --model_path ./models/sft-v3 \ | |
| --num_episodes 5 \ | |
| --output_file eval_results_v2.json | |
| ``` | |
| Expected output: `eval_results_v2.json` whose per-condition averages match | |
| [the published JSON](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json) | |
| (distilled student 0.574 / 0.536 / 0.507 across continuous-in-dist / | |
| continuous-OOD / discrete-3) within Β±0.02 stochastic noise. | |
| > If you'd rather not run a GPU locally, `scripts/eval_on_hf.py` does the | |
| > same thing as a HF Jobs run: it snapshots the model on a remote a10g/a100, | |
| > runs `inference_eval.py`, and uploads the resulting JSON back to the | |
| > model repo. See the docstring in that script for the submit command. | |
| ### Full path (~1.5 hr, ~$5β6 in API + GPU credits): retrain from scratch | |
| ```bash | |
| # 1. Teacher rollouts (Azure OpenAI gpt-5.4, ~$3 for 30 episodes). | |
| cp .env.example .env # fill in AZURE_OPENAI_* + HF_TOKEN | |
| python scripts/generate_teacher_trajectories.py \ | |
| --seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3 | |
| # 2. Push teacher data to a HF dataset repo of your choice. | |
| python scripts/upload_teacher_data.py \ | |
| --files data/teacher_30ep.jsonl \ | |
| --repo <your_user>/rhythm-env-teacher-trajectories | |
| # 3. SFT distill on HF Jobs (~25 min on a10g-large, ~$2β3). | |
| hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \ | |
| -e TEACHER_DATA_REPO=<your_user>/rhythm-env-teacher-trajectories \ | |
| -e MODEL_REPO_SUFFIX=sft-v3-repro \ | |
| -d scripts/sft_on_hf.py | |
| # 4. Eval the new checkpoint via the Fast Path above: | |
| hf download <your_user>/rhythm-env-meta-trained-sft-v3-repro \ | |
| --local-dir ./models/sft-v3-repro | |
| python training/inference_eval.py \ | |
| --model_path ./models/sft-v3-repro \ | |
| --num_episodes 5 \ | |
| --output_file eval_results_v2.json | |
| ``` | |
| Stage 3 (optional GRPO refine on top of SFT) is the same `scripts/train_on_hf.py` | |
| command shown earlier in the *Training: Algorithm Distillation* section. | |
| ## Setup Instructions | |
| ### Local Development | |
| ```bash | |
| cd rhythm_env | |
| pip install -e . | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker build -t rhythm-env:latest . | |
| docker run -p 8000:8000 rhythm-env:latest | |
| ``` | |
| ### Running the Baseline | |
| ```bash | |
| # Heuristic only (no API key needed): | |
| python inference.py | |
| # With LLM: | |
| export API_BASE_URL="https://router.huggingface.co/v1" | |
| export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" | |
| export HF_TOKEN="your-token" | |
| python inference.py | |
| ``` | |
| ## API Endpoints | |
| | Method | Endpoint | Description | | |
| |--------|----------|-------------| | |
| | `POST` | `/reset` | Start a new episode. Optional kwargs: `seed`, `profile=<name>` (one of the 3 reference profiles). Default samples a fresh continuous profile. | | |
| | `POST` | `/step` | Execute an action (`action_type`) | | |
| | `GET` | `/state` | Get current state (includes hidden profile name for debugging) | | |
| | `GET` | `/health` | Health check | | |
| | `GET` | `/metadata` | Environment metadata | | |
| | `GET` | `/schema` | Action/observation JSON schemas | | |
| ## Why It Matters | |
| This environment is a training ground for **truly personalized AI**. The product vision: wearables (HRV, sleep score) feed meter proxies, your calendar gets parsed into action types, and every Accept/Ignore tap on a recommendation is a reward signal. A small model trained in RhythmEnv already knows the *structure* of personality inference β so it personalizes to a real user in 5β10 interactions instead of 50+. | |
| ``` | |
| User installs app β wearables feed meter proxies | |
| Calendar events β mapped to action types | |
| Accept/Ignore taps β reward signal | |
| Agent learns who you are β recommendations adapt | |
| ``` | |
| No setup. No personality quiz. The agent figures you out. See [sim-to-real architecture](docs/references/sim_to_real_deployment.md) for the full deployment pipeline. | |
| ## Validation | |
| ```bash | |
| # Step 1: HF Space live β (returns HTTP 200 on /reset) | |
| # Step 2: Docker build β docker build server/ | |
| # Step 3: openenv check β openenv validate passes | |
| ./scripts/validate-submission.sh https://InosLihka-rhythm-env.hf.space . | |
| ``` | |
| ## License | |
| BSD 3-Clause License | |