Spaces:

InosLihka
/

rhythm_env

Sleeping

App Files Files Community

rhythm_env / README.md

InosLihka

Clarify documentation: anomaly signal explainer, GRPO scope notes

361aed7 6 days ago

preview code

raw

history blame contribute delete

23.2 kB

metadata

title: RhythmEnv
emoji: 🎯
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 8000
tags:
  - openenv

RhythmEnv — Meta-RL Life Simulator

An OpenEnv environment where an LLM agent learns how to learn a person. Each episode samples a different hidden personality from a continuous parameter space — the agent must infer who it's helping from rewards alone, then adapt its strategy mid-episode.

This is meta-reinforcement learning for personalization: the agent isn't trained to optimize one person's life, it's trained to acquire the skill of figuring out a new person from a handful of interactions.

Submission links (for judges)

HF Space (the environment): https://huggingface.co/spaces/InosLihka/rhythm_env
Training notebook: training/RhythmEnv_GRPO_Training.ipynb
Blog post: BLOG.md — Teaching an AI to Know You (Without Asking)
Trained model (Algorithm Distillation): https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3
Teacher trajectories dataset: https://huggingface.co/datasets/InosLihka/rhythm-env-teacher-trajectories
Detailed results: docs/results.md
Iteration journey + lessons: docs/iterations.md
Training plots: plots/ (also embedded below)

Headline result

A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, beats the heuristic baseline on all three eval conditions:

Condition	Random	Heuristic	Distilled Qwen 3B	+ GRPO refine	belief_MAE
continuous in-distribution	0.393	0.463	0.574 (+0.111)	0.573	0.213
continuous OOD (generalization)	0.393	0.455	0.536 (+0.081)	0.559 (+0.104)	0.263
discrete-3-profiles (legacy)	0.426	0.455	0.507 (+0.052)	0.520 (+0.065)	0.430

The student's belief_MAE of 0.213 in-distribution matches the gpt-5.4 teacher (0.196) — the inference skill transferred nearly perfectly via SFT-prime. On OOD profiles the agent never saw, it still beats heuristic by +0.081, proving generalization (not memorization).

A subsequent GRPO refine on top of the SFT'd student lifted OOD generalization by another +0.023 (4% relative) and discrete-3 by +0.013, with no in-dist regression. The GRPO-refined model is at InosLihka/rhythm-env-meta-trained-sft-grpo-v1.

For reference, the gpt-5.4 teacher (upper bound) hits 0.611 in-dist / 0.621 OOD on a 150-episode reeval. Full numbers in docs/results.md. Eval JSONs: SFT v3 · SFT v3 + GRPO.

Training evidence

SFT v3 loss curve — distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 → 0.083 over 525 steps and stays converged. No overfitting.

Reward curve — mean per-step env reward over training (real env-replay reward, with ±1 std band). Climbs steadily as the agent learns profile-aware play.

Reward components — all 4 reward layers overlaid (format_valid, action_legal, env_reward, belief_accuracy). Lets you read what each layer is contributing as training progresses.

Belief-accuracy curve — the meta-RL signal. Rolling mean of how close the agent's emitted belief vector is to the true profile vector, per step.

Numbers source: eval_results_v2.json in the trained model repo.

Why a Life Simulator?

Personal AI assistants give generic advice. They don't know you. RhythmEnv trains an agent on a continuous distribution of simulated personalities so that, when it meets a real user, it already knows the structure of personality inference — it personalizes in ~5 interactions instead of 50+.

Every sampled person has a hidden "DNA" — a vector of preferences (social tolerance, morning energy, work motivation) plus action modifiers — drawn from distributions. The agent sees only the resulting meter changes and rewards. To do well, it must infer the hidden vector and adapt.

This is Theme 3.2 (Personalized Tasks) + Theme 2 (Long-Horizon Planning) — the agent plans across a full week while inferring a hidden personality from observation alone.

Quick Start

pip install openenv-core

import asyncio
from rhythm_env import RhythmEnv, RhythmAction, ActionType

async def main():
    async with RhythmEnv(base_url="https://InosLihka-rhythm-env.hf.space") as env:
        result = await env.reset(seed=0)
        print(f"Vitality: {result.observation.vitality}")
        print(f"Day: {result.observation.day}, Slot: {result.observation.slot}")

        result = await env.step(RhythmAction(action_type=ActionType.DEEP_WORK))
        print(f"Reward: {result.reward}")

asyncio.run(main())

The 5 Life Meters

All meters range from 0.0 to 1.0. If any meter drops below 0.1, the agent receives a heavy penalty.

Meter	What It Represents	Increases With	Decreases With
Vitality	Physical energy, sleep	Sleep, Exercise	Work, Socializing
Cognition	Focus, mental clarity	Sleep, Meditate	Deep Work, Binge Watch
Progress	Career/skill growth	Deep Work, Learn, Admin	Binge Watch (slightly)
Serenity	Inner peace vs stress	Meditate, Me Time, Exercise	Deep Work, Admin
Connection	Relationship health	Family Time, Socialize	Passive decay every step

Key interactions:

Low Vitality reduces the effectiveness of ALL positive actions (global multiplier)
Connection decays passively — you must actively maintain relationships
Meters interact non-linearly: a crash in one often cascades to others

Action Space (10 Actions)

Category	Action	Primary Effect
Productivity	`DEEP_WORK`	High Progress, drains Vitality + Cognition
	`ADMIN_WORK`	Moderate Progress, low drain
	`LEARN`	Progress + slight Cognition drain
Recovery	`SLEEP`	Strong Vitality + Cognition recovery
	`EXERCISE`	Vitality + Serenity boost
	`MEDITATE`	Strong Serenity + Cognition boost
Social	`FAMILY_TIME`	Strong Connection, costs Vitality
	`SOCIALIZE`	Connection + mild Serenity
Leisure	`ME_TIME`	Serenity + mild Cognition recovery
	`BINGE_WATCH`	Mild Serenity, drains Cognition (trap action)

Episode Structure

1 episode = 1 week = 7 days × 4 slots/day = 28 steps
Time slots: Morning (0), Afternoon (1), Evening (2), Night (3)
Time-of-day effects: Morning boosts cognitive gains (+20%), Night penalizes them (-40%)
Random events (~8% per step): Prod Crash, Family Emergency, Illness, Good News
Deterministic given seed: same seed → same episode trajectory

The Meta-Learning Setup (Core Innovation)

What the Agent Sees Each Step

All 5 meter values + per-meter deltas from the last action
Current day, slot, timestep
Active random event (if any)
Rolling 7-step history of (action, reward, deltas, anomalies) — see below
Total scalar reward

The anomaly signal (the cleanest inference channel)

For every past step in the rolling history, the agent sees both the actual per-meter delta and a per-meter anomaly = actual_delta − expected_delta_under_neutral_profile.

A neutral profile is the average person; the anomaly therefore tells the agent how much this specific user's response deviated from the average user's response to the same action. That deviation is what encodes the hidden personality.

Concrete example: the agent does SOCIALIZE and observes vitality_delta = -0.18. Under a neutral profile the expected change is -0.06, so vitality_anomaly = -0.12 — the user lost twice as much energy as average. Strong evidence the hidden social_vitality_multiplier is high (introvert: socializing is costly). The agent should down-weight social actions and reach for solo recovery instead.

This is why a tiny model can learn meta-RL inference here: the env hands it a clean, well-typed deviation-from-baseline signal at every step. See models.py (StepRecord) for the exact field layout.

What the Agent Does NOT See

The hidden personality vector — sampled per episode, controls everything below
Reward weight decomposition — same meter changes produce different rewards for different people
Action modifiers — social drain, cognitive bonuses, shame spirals vary continuously

Continuous Personality Space

Each reset() samples a fresh personality from parameter distributions:

Parameter	Distribution	What it controls	Concrete intuition
`social_vitality_multiplier`	U(0.2, 3.0)	how much vitality `SOCIALIZE` / `FAMILY_TIME` drain	0.2 = "people energize me" (extrovert). 3.0 = "every party leaves me wrecked" (introvert).
`morning_cognition_bonus`	U(0.4, 2.0) or none	bonus on cognitive gains in morning slots	High = sharp at 8am, useless after 6pm (morning person).
`evening_night_cognition_bonus`	U(0.6, 1.8) or none	same, evening/night	High = comes alive after dark (night owl).
`work_vitality_recovery`	U(0, 0.06)	`DEEP_WORK` adds vitality instead of draining	High = work is fuel, not cost (workaholic).
`progress_serenity_bonus`	U(0, 0.10)	calmness gained from career progress	High = "I am at peace when I'm shipping."
`solo_serenity_bonus`	U(0, 0.10)	`ME_TIME` extra calm gain	High = recharges by being alone (introvert).
`social_connection_multiplier`	U(1, 2)	how strongly social actions build Connection	High = relationships compound fast (extrovert fluency).
`social_serenity_bonus`	U(0, 0.06)	extra calm from social actions	High = extrovert peace-from-people.
`binge_shame`	Bernoulli(0.5)	does `BINGE_WATCH` cost extra serenity afterwards	True = guilt spiral, False = guilt-free.
`connection_decay_rate`	U(0.005, 0.02)	passive Connection drop per step	High = relationships need active maintenance.
`vitality_decay_rate`	U(0, 0.04)	passive Vitality drop per step	High = always low-energy.
`event_impact_multiplier`	U(0.5, 1.0)	how hard random events (Prod Crash etc.) hit	High = brittle to setbacks.
`stress_tolerance`	U(0.15, 0.30)	meter level where stress-spiral penalty fires	Low = falls apart easily.
`reward_weights`	Dirichlet (biased to non-vit/cog)	which meter changes are valuable to this person	One person scores high on `progress` gains; another scores high on `connection`.

This produces an effectively infinite personality space — memorization is impossible, the agent must learn the skill of inference. The three named reference profiles below are concrete points in this space, useful as test anchors. For the exact sampling logic see server/rhythm_environment.py (sample_profile).

Three reference profiles

The env exposes 3 named personalities as anchor points in the continuous space. Useful for tests and reproducible eval. Reach them via profile=<name> on reset():

Introvert Morning Person → belief vector ≈ [0.0 social, 1.0 morning, 0.07 work]
Extrovert Night Owl → belief vector ≈ [1.0 social, 0.20 morning, 0.02 work]
Workaholic Stoic → belief vector ≈ [0.36 social, 0.50 morning, 1.0 work]

The Action+Belief Output Format

Each step the agent outputs a brief reasoning block followed by an answer line:

<reasoning>
Last step's socialize gave V−0.12 (anomaly −0.06, much worse than neutral) —
high social drain, suggests low S. Morning DEEP_WORK earlier gave bonus
cognition (+0.04) → high M. With low S + high M, MEDITATE is the recovery
play that fits.
</reasoning>
2 8 5 MEDITATE

S M W ACTION_NAME is the contract. Three belief digits (0-9) representing the agent's current belief about the user:

S = social preference (0=hates social, 9=loves social)
M = morning preference (0=night owl, 9=morning person)
W = work preference (0=avoids work, 9=workaholic)

Belief-first ordering matters: in causal-LM generation, tokens generated earlier condition tokens generated later, so the action is causally conditioned on the belief — making the belief functionally useful rather than a post-hoc afterthought. The reasoning block isn't required for parseability (parser searches for the last S M W ACTION match), but the SFT-distilled student learns to emit it because the teacher did.

The Discovery Challenge

The agent must:

Probe — try different actions in early steps to see how the person responds
Infer — update its belief vector each step based on observed rewards
Adapt — late in the episode, exploit the belief by choosing actions matching the inferred personality

Reward Architecture (4-layer training stack)

Layer	Function	Range	Purpose
1	`format_valid`	-2 to +1	parseable as ACTION + 3 belief digits
2	`action_legal`	-1 to +0.5	action is one of 10 valid types
3	`env_reward`	-3 to ~+1.5	actual env reward via seed-based replay
4	`belief_accuracy`	-0.5 to +0.5	cosine-MAE vs true profile vector

Per-step env reward = sum(meter_delta × hidden_weight) × 15 — weights are sampled per profile.

Critical threshold: any meter < 0.1 → -0.30 penalty.

Final grader (v2 — measures inference, not just reflex) — final_score ∈ [0, 1]:

score = 0.15 × crash_free + 0.20 × progress + 0.10 × connection
      + 0.25 × adaptation_score + 0.10 × efficiency + 0.20 × belief_accuracy

belief_accuracy is 1 − MAE between the agent's last-emitted belief and the true profile vector. Heuristic / random baselines emit no belief and score 0 here by design — that's the point: the meta-RL skill is inference, and only agents that actually try get credit.

adaptation_score is the implicit signal: late-half mean reward minus early-half mean, gated by absolute late-half quality. Per-step rewards are profile-weighted, so a high late-half mean means the agent figured out the hidden weights and started exploiting them.

Why we changed the grader. Five GRPO iterations under the v1 grader kept tying with heuristic. Reading the model's reasoning showed it was doing real inference — but inference didn't lift the score because the v1 grader didn't measure inference. Adding belief_accuracy (Δ +0.20 weight) fixed the structural mismatch. See docs/iterations.md for the full journey.

Training: Algorithm Distillation

We train via Algorithm Distillation — a frontier teacher plays episodes, writes down its reasoning, and the student imitates the trajectories. Two stages:

Stage 1 — Teacher rollouts. gpt-5.4 (Azure AI Foundry) plays 30+ episodes of RhythmEnv. At each step it outputs a <reasoning> block + S M W ACTION answer line. Each episode produces 28 (state, response) tuples. ~$3 per 30 episodes via Azure pay-as-you-go.

python scripts/generate_teacher_trajectories.py \
    --seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3

Stage 2 — SFT prime. Qwen 2.5-3B (Unsloth + 4-bit + LoRA r=16) is fine-tuned on the teacher's full trajectories. The student learns BOTH the output format and the reasoning pattern. ~~25 min on a HF Jobs a10g-large (~~$2-3).

hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
    -e TEACHER_DATA_REPO=InosLihka/rhythm-env-teacher-trajectories \
    -e MODEL_REPO_SUFFIX=sft-v1 \
    -d scripts/sft_on_hf.py

Stage 3 — GRPO refine on top of SFT. Run GRPO with the existing 4-layer reward stack starting from the SFT'd checkpoint (lr 1e-5, beta 0.1 KL anchor). This lifts OOD generalization by another +0.023 and discrete-3 by +0.013 without regressing in-dist. The GRPO-refined model is uploaded to InosLihka/rhythm-env-meta-trained-sft-grpo-v1. The bulk of the improvement still comes from SFT (Stage 2); GRPO refine is the polish.

hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
    -e MODEL_NAME=InosLihka/rhythm-env-meta-trained-sft-v3 \
    -e MAX_STEPS=200 -e LEARNING_RATE=1e-5 -e BETA=0.1 \
    -e MODEL_REPO_SUFFIX=sft-grpo-v1 \
    -d scripts/train_on_hf.py

Why algorithm distillation, not GRPO from scratch

We tried 5 GRPO iterations from scratch on Qwen 2.5-3B before switching recipes. They all matched heuristic but never beat it.

The literature was unambiguous on why: small models (≤3B) need a teacher to bootstrap reasoning skills. Pure GRPO from scratch produces shallow, non-generalizing behavior at this scale — every successful 3B reasoning recipe (DeepSeek-R1-Distill, PRIME-RL, BREAD) uses SFT-prime or trajectory distillation.

Once we knew that, the answer was: use a strong teacher (gpt-5.4) we already have access to, distill its reasoning into Qwen, ship.

The training/train.py GRPO script is preserved for completeness and as the optional Stage 3, but it isn't on the critical path of the headline result. See docs/iterations.md for the full journey and what each GRPO iteration taught us.

Reproducing the headline numbers

There are two reproduction paths depending on how much time and budget you have. Both produce the numbers in the Headline result table above.

Fast path (~10–20 min, $0): re-run eval against the published checkpoint

This is the path most reviewers should take. The trained model is already on the Hub. Download it once, then run inference_eval.py (which expects a local path) against all three eval conditions.

# Prereqs: Python 3.10+, ~12 GB free disk, GPU strongly recommended (CPU works but is slow).
pip install -e .
export HF_TOKEN=...   # any read-scoped HF token; the model is public.

# 1. Snapshot the trained checkpoint locally.
hf download InosLihka/rhythm-env-meta-trained-sft-v3 \
    --local-dir ./models/sft-v3

# 2. Run all 3 eval conditions (discrete-3, continuous-in-dist, continuous-OOD).
python training/inference_eval.py \
    --model_path ./models/sft-v3 \
    --num_episodes 5 \
    --output_file eval_results_v2.json

Expected output: eval_results_v2.json whose per-condition averages match the published JSON (distilled student 0.574 / 0.536 / 0.507 across continuous-in-dist / continuous-OOD / discrete-3) within ±0.02 stochastic noise.

If you'd rather not run a GPU locally, scripts/eval_on_hf.py does the same thing as a HF Jobs run: it snapshots the model on a remote a10g/a100, runs inference_eval.py, and uploads the resulting JSON back to the model repo. See the docstring in that script for the submit command.

Full path (~1.5 hr, ~$5–6 in API + GPU credits): retrain from scratch

# 1. Teacher rollouts (Azure OpenAI gpt-5.4, ~$3 for 30 episodes).
cp .env.example .env   # fill in AZURE_OPENAI_* + HF_TOKEN
python scripts/generate_teacher_trajectories.py \
    --seeds 0-29 --output data/teacher_30ep.jsonl --concurrency 3

# 2. Push teacher data to a HF dataset repo of your choice.
python scripts/upload_teacher_data.py \
    --files data/teacher_30ep.jsonl \
    --repo <your_user>/rhythm-env-teacher-trajectories

# 3. SFT distill on HF Jobs (~25 min on a10g-large, ~$2–3).
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
    -e TEACHER_DATA_REPO=<your_user>/rhythm-env-teacher-trajectories \
    -e MODEL_REPO_SUFFIX=sft-v3-repro \
    -d scripts/sft_on_hf.py

# 4. Eval the new checkpoint via the Fast Path above:
hf download <your_user>/rhythm-env-meta-trained-sft-v3-repro \
    --local-dir ./models/sft-v3-repro
python training/inference_eval.py \
    --model_path ./models/sft-v3-repro \
    --num_episodes 5 \
    --output_file eval_results_v2.json

Stage 3 (optional GRPO refine on top of SFT) is the same scripts/train_on_hf.py command shown earlier in the Training: Algorithm Distillation section.

Setup Instructions

Local Development

cd rhythm_env
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000

Docker

docker build -t rhythm-env:latest .
docker run -p 8000:8000 rhythm-env:latest

Running the Baseline

# Heuristic only (no API key needed):
python inference.py

# With LLM:
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your-token"
python inference.py

API Endpoints

Method	Endpoint	Description
`POST`	`/reset`	Start a new episode. Optional kwargs: `seed`, `profile=<name>` (one of the 3 reference profiles). Default samples a fresh continuous profile.
`POST`	`/step`	Execute an action (`action_type`)
`GET`	`/state`	Get current state (includes hidden profile name for debugging)
`GET`	`/health`	Health check
`GET`	`/metadata`	Environment metadata
`GET`	`/schema`	Action/observation JSON schemas

Why It Matters

This environment is a training ground for truly personalized AI. The product vision: wearables (HRV, sleep score) feed meter proxies, your calendar gets parsed into action types, and every Accept/Ignore tap on a recommendation is a reward signal. A small model trained in RhythmEnv already knows the structure of personality inference — so it personalizes to a real user in 5–10 interactions instead of 50+.

User installs app → wearables feed meter proxies
Calendar events → mapped to action types
Accept/Ignore taps → reward signal
Agent learns who you are → recommendations adapt

No setup. No personality quiz. The agent figures you out. See sim-to-real architecture for the full deployment pipeline.

Validation

# Step 1: HF Space live  ✓  (returns HTTP 200 on /reset)
# Step 2: Docker build   →  docker build server/
# Step 3: openenv check  ✓  openenv validate passes

./scripts/validate-submission.sh https://InosLihka-rhythm-env.hf.space .

License

BSD 3-Clause License