Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka Claude Sonnet 4.6 commited on 13 days ago

Commit

5fbafee

1 Parent(s): 2c6ee11

refactor: rewrite blog around product vision; fix UI for Gradio 6

Blog now opens with the concrete product scenario (watch/calendar/sleep
tracker inputs, Accept-or-Ignore as rewards) and frames RhythmEnv as the
training curriculum for the inference skill the product needs.

UI: replace ASCII meter bars with HTML progress bars, add week calendar
grid and live matplotlib meter trajectory chart, merge configure+play
into one tab. Fix Gradio 6 incompatibilities (theme kwarg, show_copy_button).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (2) hide show

docs/blog_post.md +53 -67
ui/app.py +297 -243

docs/blog_post.md CHANGED Viewed

@@ -1,123 +1,109 @@
 # Teaching an AI to Know You (Without Asking)
-Ask someone how they'd build a personal AI assistant, and they'll say: give it a personality quiz. A preferences form. Maybe a settings page where you pick "introvert" or "morning person" from a dropdown.
-Sounds reasonable. But it's the wrong approach entirely.
-Think about the people who actually know you well — a close friend, a partner, a sibling. None of them sat you down with a questionnaire. They figured you out by *watching*. They noticed that you get irritable after too many social events. That you do your best thinking before noon. That skipping exercise makes you anxious by Wednesday.
-They learned your hidden patterns through trial, error, and feedback. What if we could train an AI the same way?
-## The problem with frontier models doing this
-A capable frontier model can already do decent personalized planning if you describe yourself in the prompt. Tell GPT-4 "I'm an introvert who peaks in the morning," and it'll give you reasonable advice. The problem is that approach doesn't scale:
-- You have to tell it who you are every single time
-- It can't observe your actual behavioral responses to recommendations
-- It runs in the cloud, costs per query, and can't be always-on or private
-- Most users can't accurately describe their own patterns anyway
-What we actually need is a small model — something that can run cheaply, frequently, and eventually on-device — that builds up a model of you from *how you respond*, not from what you say about yourself. That's the gap RhythmEnv is designed to train for.
-I work on AI at Microsoft. One thing I kept running into building assistant features was the gap between what users *say* they want and what actually helps them. People are bad at introspecting their own patterns. The introvert who says "I don't mind meetings" because they've normalized the drain. The workaholic who checks "I value work-life balance" because they know they should.
-Preference forms capture what people believe about themselves. Behavior reveals what's actually true.
-## The Trait Decomposition
-Here's how I think about modeling humans. Start with **traits** — atomic behavioral properties that describe how a person responds to activities:
-How much does socializing physically drain you? When does your brain work best? Does leisure make you feel guilty, or recharge you? Does productivity give you calm, or just tire you out? How far can your stress drop before everything starts spiraling?
-No single trait defines a person. It's the *combination* that creates a personality. An introvert morning person has high social energy cost, early cognitive peak, and strong solo recharge. An extrovert night owl has the opposite: low social cost, late cognitive peak, socializing actually gives them serenity.
-Same list of traits. Different values. Different person.
-## The "Good Day" Definition
-Here's the part that took me a while to articulate: two people can do the exact same activities and have completely different days. Not because the activities are different, but because they *value different outcomes*.
-In RhythmEnv, each person has hidden **reward weights** — a definition of what a good week means to them:
-- The introvert values **serenity** above all (60% weight). A week where they maintained inner peace and made some progress is a great week. Connection barely registers.
-- The extrovert values **connection** above all (75% weight). A week full of meaningful social interactions is a great week, even if they didn't get much work done.
-- The workaholic values **progress** above all (70% weight). Deep productive work is the whole point. Everything else is secondary.
-The agent sees the same 5 meters. Takes the same 10 actions. Gets a scalar reward. But the reward is secretly computed using these hidden weights. Same action, same meter changes, completely different reward signal depending on who you're helping.
-## The Environment
-RhythmEnv simulates one week in a person's life. 7 days, 4 time slots each, 28 decisions total. Each decision picks an activity: deep work, exercise, sleep, meditation, family time, socializing, and so on. Ten options.
-Five life meters track the person's state — picture them like fuel gauges on a dashboard:
-- **Vitality** — physical energy. Sleep and exercise fill it up. Work drains it.
 - **Cognition** — mental sharpness. Peaks in the morning for some, evening for others.
 - **Progress** — career momentum. Only goes up when you work.
 - **Serenity** — inner calm. Meditation helps. Overwork kills it.
-- **Connection** — relationship health. Decays passively every time slot. If you don't actively socialize, it quietly drops on its own.
-After every action, meters shift. The agent sees the new meter values and gets a reward. That reward is the hidden weighted sum of what changed — and the weights are different for every person type.
-## Why Identical Actions Produce Different Results
-The trait modifiers change how actions physically affect the person, not just how rewards are computed.
-Tell the introvert to socialize: their vitality drops 3× faster than normal. Their body physically rejects it. Tell the extrovert the same thing: barely any drain. They could socialize all day.
-Tell the introvert to meditate: they get a bonus +0.10 serenity on top of the base effect. Alone time is their recharge mechanism. Tell the workaholic the same thing: their serenity *drops* by 0.10, because idle activities make them anxious.
-Tell the workaholic to do deep work: they recover +0.06 vitality — productive work energizes them. Tell the introvert to do deep work in the morning: their progress and cognition gains are doubled. Same action, completely different physiological response.
-These aren't arbitrary. They're modeled after real behavioral patterns. The introvert's social drain, the workaholic's anxiety from idleness, the night owl's morning penalty — these are things people recognize in themselves but rarely articulate.
-## What the Agent Must Figure Out
-The agent sees meters, time of day, and reward. It doesn't see which profile is active, the trait values, or how the reward is being computed.
-After a few actions, the patterns start showing. "I socialized and my vitality crashed — this person drains from socializing." "I meditated and got a huge reward — serenity must be heavily weighted." "Deep work in the morning gave double progress — this person peaks early."
-The trained agent should probe early, infer the person type by observing unexpected meter changes, then adapt its strategy for the rest of the week. An agent that discovers it's helping an introvert should meditate more and socialize less. One that discovers a workaholic should maximize productive hours and cut idle time.
-This is the skill we're training: *behavioral inference under partial observability*. Detect the hidden pattern from how the environment responds to your actions, then plan accordingly.
-## Why Simulation Is the Right Starting Point
-Everything in RhythmEnv is simulated — the person doesn't exist, the meters aren't biometric readings, the profiles are synthetic. That's intentional, and it's not a limitation.
-Robotics RL trains in simulation first. The simulator is the curriculum, not the deployment target. The skill the model learns — detecting behavioral signatures from differential responses to the same action — is real and transferable. In a production version, the "meters" become observable proxies: calendar acceptance patterns, response latency after social-heavy days, a simple end-of-day wellness rating. The agent that learns to infer hidden reward weights from simulation learns the *structure* of the problem. The specific medium can change.
-## The Training Pipeline
-Training uses GRPO — Group Relative Policy Optimization. For each game state, generate multiple candidate actions, score them all against the real environment, update the model to prefer the ones that scored higher. The environment *is* the critic.
-The per-step reward signal is strongly differentiated by profile. At the same starting state — Monday morning, all meters at 0.7 — the best action is completely different:
-| Profile | Best Action | Reward | Worst Action | Reward |
-|---------|-------------|--------|--------------|--------|
-| Introvert | MEDITATE | +1.76 | SOCIALIZE | +0.03 |
-| Extrovert | FAMILY_TIME | +2.63 | ME_TIME | −0.42 |
-| Workaholic | DEEP_WORK | +1.57 | ME_TIME | −0.27 |
-The model is Qwen 2.5-3B with 4-bit quantization and LoRA — small enough to train on a free Colab T4, small enough to eventually run at the edge. The goal isn't matching GPT-4's general reasoning. It's teaching a lightweight model a specific skill it doesn't have out of the box: infer who you're helping from how they respond, not from what they tell you.
-## What I'm Hoping To See
-The heuristic baseline — fixed rules, no profile adaptation, treats everyone the same — scores around 0.76–0.82. Sleep at night, work in the morning, socialize when connection drops. Reasonable advice for anyone.
-The trained agent should do better by doing something qualitatively different: the introvert's week should look nothing like the extrovert's week. Not just higher scores — genuinely differentiated action sequences per profile. That's the signal that inference is happening, not just pattern matching.
-The bigger goal is a learning curve that works in the other direction too. In a real product, the first few interactions are the model probing — making recommendations and observing how the user responds. After a handful of exchanges, it should have enough signal to know whether it's dealing with someone who needs serenity protected, or someone who needs to be pushed into more productive hours.
-No questionnaire. No settings page. Just attention, inference, and adjustment.
-That's what personal AI should actually feel like.
 ---
 **Links:**
 - [Live Environment on HF Spaces](https://huggingface.co/spaces/InosLihka/rhythm_env)
 - [Training Notebook (Colab)](../training/RhythmEnv_GRPO_Training.ipynb)
-- [Source Code & README](https://huggingface.co/spaces/InosLihka/rhythm_env)
 *Built for the Meta OpenEnv Hackathon Grand Finale, April 2026.*

 # Teaching an AI to Know You (Without Asking)
+Imagine this. It's 2pm. You had deep work blocked on your calendar. Your AI assistant sends you a nudge:
+> *"I know you planned Deep Work now, but your focus metrics just dropped below 20%. If you push through, you'll likely spend 3 hours on something that would take 1 hour at peak. Take a 20-minute rest first — I'll remind you when your window opens."*
+You tap Accept or Ignore. Either way, the agent just learned something about you.
+That's the product vision. But there's a problem nobody has solved cleanly: how does the AI know that rest-then-work is the right call *for you specifically*, and not just generically good advice?
+## The gap that everyone papers over
+Most AI assistants give the same advice to everyone. They know best practices — sleep enough, work in the morning, don't skip exercise. That's useful for nobody who isn't already average.
+The people who give you genuinely good advice about your life have learned you over time. A great EA, a close friend, a good coach — none of them sat you down with a questionnaire. They watched how you responded to things. They noticed that you're wrecked after back-to-back meetings even when you say you're fine. That you do your sharpest thinking before anyone else is online. That skipping one workout makes you irritable by Wednesday.
+I work on AI at Microsoft. One thing I kept running into building assistant features was the gap between what users *say* they want and what actually helps them. People are bad at introspecting their own patterns. The introvert who says "I don't mind meetings" because they've normalised the drain. The workaholic who checks "I value work-life balance" because they know they should.
+Preference forms capture what people believe about themselves. Behaviour reveals what's actually true.
+## The real-world input problem
+You wouldn't manually type "I am at 40% energy." That's a chore nobody does.
+The real input comes from devices you already carry. Your watch sends resting heart rate and HRV — that's Vitality and Serenity. Your calendar sends meeting density and deadline proximity — that's Progress pressure. Your sleep tracker sends last night's data — that's Cognition. Your phone knows whether you've been social or isolated.
+The agent never asks how you feel. It reads what your devices already know.
+And the reward signal? It comes from you, passively. Every time the agent makes a recommendation and you Accept or Ignore it, that choice is data. Accept means "yes, that was the right read." Ignore means "you got something wrong about me." Over hundreds of those micro-interactions, the agent builds a precise model of who you are — not the person you describe yourself to be.
+## The foundational problem: teaching the inference skill to a small model
+Here's the hard part. A frontier model like GPT-4 can already do decent personalised planning if you describe yourself in the prompt. But that doesn't work at scale:
+- You have to describe yourself every single session
+- The model can't observe your actual responses to its recommendations
+- It runs in the cloud, costs per query, can't be always-on or private
+- Most users can't accurately describe their own patterns anyway
+What the real product needs is a small model — one that can run cheaply, close to you, eventually on-device — that builds up a model of you from *how you respond*, not from what you say about yourself.
+That's the inference skill we're training. **RhythmEnv is the curriculum.**
+## How the training environment works
+RhythmEnv simulates one week in a person's life — 7 days, 4 time slots each, 28 decisions. Each decision is an activity: deep work, exercise, sleep, meditation, family time, socialising. Ten options.
+Five meters track the person's state:
+- **Vitality** — physical energy. Sleep fills it. Work drains it.
 - **Cognition** — mental sharpness. Peaks in the morning for some, evening for others.
 - **Progress** — career momentum. Only goes up when you work.
 - **Serenity** — inner calm. Meditation helps. Overwork kills it.
+- **Connection** — relationship health. Decays passively every time slot. Ignore it and it quietly drops.
+Hidden underneath is a personality profile. The agent can't see it. It controls both *what the person values* (their hidden reward weights) and *how actions physically affect them* (their hidden trait modifiers).
+Three profiles, wildly different hidden mechanics:
+The **introvert morning person** values serenity above everything (60% of their score). Socialising drains their vitality three times faster than the base rate. Meditating gives them a bonus +0.10 serenity on top of the base effect. Deep work in the morning gives double progress. The agent discovers: *mornings are sacred, social events are costly, alone time heals.*
+The **extrovert night owl** values connection above everything (75%). Socialising barely costs any vitality — they could do it all day. Deep work in the morning gives only 40% of expected output. The same work in the evening gives 1.8× output. The agent discovers: *protect the mornings for rest, do cognitive work at night, keep socialising high.*
+The **workaholic stoic** values progress above everything (70%). Productive work actually *recovers* vitality for them — output is energising. Idle activities like leisure or passive rest drain their serenity — the guilt is real. The agent discovers: *keep working, rest only when vitality is critical, never let idle time accumulate.*
+## What the agent must figure out
+The agent sees meters, time of day, and a reward signal. It doesn't see the profile, the trait values, or the reward weights.
+Same action, same starting state — completely different reward depending on who you're helping:
+| Profile | DEEP_WORK reward (step 1) |
+|---|---|
+| Workaholic | +1.57 |
+| Introvert | +0.32 |
+| Extrovert | −0.39 |
+The extrovert gets a *negative* reward from deep work first thing — because it gives zero connection, and connection is 75% of their score.
+A good agent should probe in the first few steps, read the unexpected meter changes, infer the hidden profile, and adapt its strategy for the rest of the week. This is the same skill the real product needs: detect who you are from how you respond, not from what you tell me.
+## The training pipeline
+We train using GRPO — Group Relative Policy Optimization. For each game state, generate multiple candidate actions, score them all against the real environment, update the model to prefer the higher-scoring ones. The environment is the critic.
+The model is Qwen 2.5-3B with 4-bit quantization and LoRA. Small enough to train on a free Colab T4. Small enough to eventually run at the edge.
+The heuristic baseline — fixed rules, treats everyone the same — scores around 0.76–0.82. Sleep when vitality is low. Meditate when serenity drops. Socialise when connection falls. Reasonable advice for anyone. Wrong advice for someone specifically.
+A trained agent that discovers the hidden personality should beat the heuristic by doing something qualitatively different: the introvert's week should look nothing like the extrovert's week. That differentiation is the signal that real inference is happening.
+## Why simulation first
+Everything here is simulated. The person doesn't exist. The meters aren't biometric readings. This is standard practice — robotics RL trains in simulation before deploying on hardware. The simulator is the curriculum, not the product.
+The inference skill transfers. An agent that learns to detect "this person's vitality drops 3× faster from social events than expected" from simulated reward signals learns the *structure* of the problem. When the medium changes — when vitality comes from HRV instead of a formula — the skill of reading differential responses still applies.
+The Accept/Ignore loop in the real product is the same reward signal, made human. Every time you ignore a recommendation, you're telling the agent: "you read me wrong." Every Accept says: "that was right." Over enough interactions, the model converges on your hidden profile without you ever having to describe it.
+No questionnaire. No settings page. Just devices watching, signals flowing, and an agent that gets better at knowing you every week.
 ---
 **Links:**
 - [Live Environment on HF Spaces](https://huggingface.co/spaces/InosLihka/rhythm_env)
 - [Training Notebook (Colab)](../training/RhythmEnv_GRPO_Training.ipynb)
+- [Source Code](https://huggingface.co/spaces/InosLihka/rhythm_env)
 *Built for the Meta OpenEnv Hackathon Grand Finale, April 2026.*

ui/app.py CHANGED Viewed

@@ -1,5 +1,5 @@
 """
-RhythmEnv Visual Explorer — Life Simulator
 Run: python ui/app.py
 """
@@ -7,127 +7,191 @@ import sys
 import os
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 import gradio as gr
 from server.rhythm_environment import (
     RhythmEnvironment, MAX_STEPS, METERS, ACTION_EFFECTS, PROFILES
 )
 from models import RhythmAction, ActionType
-SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
-DAY_NAMES  = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
 PROFILE_NAMES = ["introvert_morning", "extrovert_night_owl", "workaholic_stoic", "random"]
 ACTION_NAMES  = [at.value.upper() for at in ActionType]
 # ---------------------------------------------------------------------------
-# Display helpers
 # ---------------------------------------------------------------------------
-def _meter_bar(value: float, label: str, width: int = 20) -> str:
-    filled = int(round(value * width))
-    bar = "█" * filled + "░" * (width - filled)
-    indicator = "🔴" if value < 0.2 else ("🟡" if value < 0.4 else "🟢")
-    return f"{indicator} {label:<12} [{bar}] {value:.2f}"
-def format_meters(obs) -> str:
-    day_name  = DAY_NAMES[obs.day]   if obs.day  < 7 else f"Day {obs.day + 1}"
     slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
-    event_line = f"\n  ⚡ Event: {obs.active_event}" if obs.active_event else ""
-    header = (
-        f"📅  {day_name} {slot_name}  |  "
-        f"Step {obs.timestep}/{MAX_STEPS}  |  "
-        f"{obs.remaining_steps} steps left{event_line}"
     )
-    bars = [
-        _meter_bar(obs.vitality,   "Vitality"),
-        _meter_bar(obs.cognition,  "Cognition"),
-        _meter_bar(obs.progress,   "Progress"),
-        _meter_bar(obs.serenity,   "Serenity"),
-        _meter_bar(obs.connection, "Connection"),
-    ]
-    return header + "\n\n" + "\n".join(bars)
-def format_reward_breakdown(breakdown: dict) -> str:
-    if not breakdown:
-        return "—"
-    lines = []
-    for k, v in breakdown.items():
-        sign = "+" if v >= 0 else ""
-        lines.append(f"  {k:<25} {sign}{v:.4f}")
-    return "\n".join(lines)
-def show_action_effects() -> str:
-    header = f"{'Action':<15}" + "".join(f"  {m[:3]:>6}" for m in METERS)
-    lines  = [header, "─" * 52]
-    for action, effects in ACTION_EFFECTS.items():
-        row = f"{action:<15}"
-        for m in METERS:
-            row += f"  {effects[m]:>+6.2f}"
-        lines.append(row)
-    return "\n".join(lines)
-def show_profiles() -> str:
-    lines = []
-    for p in PROFILES:
-        name    = p["name"]
-        weights = p["reward_weights"]
-        lines += [f"\n{'═'*52}", f"  {name.upper()}", f"{'═'*52}"]
-        lines.append("  Reward weights (hidden from agent):")
-        for m, w in weights.items():
-            bar = "█" * int(w * 20)
-            lines.append(f"    {m:<12}  {bar:<20}  {w:.0%}")
-        lines.append("\n  Key hidden modifiers:")
-        if p.get("morning_cognition_bonus"):
-            lines.append(f"    • Morning: cognition/progress ×{p['morning_cognition_bonus']}  (HV1 peak)")
-        if p.get("evening_night_cognition_bonus"):
-            lines.append(f"    • Evening/Night: cognition/progress ×{p['evening_night_cognition_bonus']}  (HV1 peak)")
-        if p.get("morning_penalty"):
-            lines.append(f"    • Morning: cognition/progress ×{p['morning_penalty']}  (HV1 penalty)")
-        sv = p.get("social_vitality_multiplier", 1.0)
-        if sv != 1.0:
-            lines.append(f"    • Social vitality drain ×{sv}  (HV2 social cost)")
-        if p.get("binge_shame"):
-            lines.append(f"    • Binge watch: shame spiral −0.15 serenity −0.06 cognition  (HV3)")
-        if p.get("progress_serenity_bonus"):
-            lines.append(f"    • Work gives serenity +{p['progress_serenity_bonus']}  (HV3 meaning)")
-        if p.get("idle_serenity_decay"):
-            lines.append(f"    • Idle drains serenity −{p['idle_serenity_decay']}  (HV3 guilt)")
-        if p.get("work_vitality_recovery"):
-            lines.append(f"    • Work recovers vitality +{p['work_vitality_recovery']}  (HV2 energized)")
-        if p.get("solo_serenity_bonus"):
-            lines.append(f"    • Solo time (me_time/meditate) gives serenity +{p['solo_serenity_bonus']}  (HV3 recharge)")
-        scm = p.get("social_connection_multiplier", 1.0)
-        if scm != 1.0:
-            lines.append(f"    • Social connection ×{scm}  (HV2 amplifier)")
-        lines.append(f"    • Connection passive decay: −{p['connection_decay_rate']}/step")
-    return "\n".join(lines)
 # ---------------------------------------------------------------------------
-# Global env state (single-session)
 # ---------------------------------------------------------------------------
-_env      = None
-_last_obs = None
-_step_log = []
-def get_env():
-    global _env
-    if _env is None:
-        _env = RhythmEnvironment()
-    return _env
 # ---------------------------------------------------------------------------
-# Tab 1 — Configure & Reset
 # ---------------------------------------------------------------------------
 def reset_episode(profile_name: str, seed_str: str):
-    global _last_obs, _step_log
     try:
         seed = int(seed_str.strip()) if seed_str.strip() else 42
@@ -135,126 +199,149 @@ def reset_episode(profile_name: str, seed_str: str):
         seed = 42
     env = get_env()
-    if profile_name == "random":
-        _last_obs = env.reset(seed=seed)
-    else:
-        _last_obs = env.reset(seed=seed, profile=profile_name)
-    _step_log = []
-    meters_text = format_meters(_last_obs)
-    log = (
-        f"Episode started.\n"
-        f"  Profile : {env._profile['name']}\n"
-        f"  Seed    : {seed}\n\n"
-        "Choose an action and press Take Step, or use an auto-run button."
     )
-    return meters_text, log, "—", "—", False
-# ---------------------------------------------------------------------------
-# Tab 2 — Run Episode
-# ---------------------------------------------------------------------------
 def take_action(action_str: str):
-    global _last_obs, _step_log
     if _last_obs is None:
-        return "⚠️ Go to Configure & Reset and start an episode first.", "—", "—", "—", False
     if _last_obs.done:
-        return "Episode is done. Reset to start a new one.", "—", "—", "—", True
-    env = get_env()
-    action_type = ActionType(action_str.lower())
-    obs = env.step(RhythmAction(action_type=action_type))
     _last_obs = obs
-    sign = "+" if obs.reward >= 0 else ""
-    step_line = (
-        f"Step {obs.timestep:>2} | {action_str:<15} | "
-        f"reward {sign}{obs.reward:.4f} | "
-        f"V:{obs.vitality:.2f} Co:{obs.cognition:.2f} "
-        f"P:{obs.progress:.2f} S:{obs.serenity:.2f} Cn:{obs.connection:.2f}"
-    )
-    if obs.active_event:
-        step_line += f"  ⚡{obs.active_event}"
     if obs.done:
         final = obs.reward_breakdown.get("final_score", 0.0)
-        step_line += f"\n{'─'*72}\n✅ DONE — Final score: {final:.4f}"
-    _step_log.append(step_line)
-    score_text = (
         f"Final: {obs.reward_breakdown.get('final_score', 0.0):.4f}"
         if obs.done else f"Step reward: {obs.reward:+.4f}"
     )
     return (
-        format_meters(obs),
-        "\n".join(_step_log),
-        format_reward_breakdown(obs.reward_breakdown),
-        score_text,
-        obs.done,
     )
-def _run_auto_episode(profile_name: str, seed_str: str, strategy: str):
-    """Shared runner for heuristic and random auto-episodes."""
-    global _last_obs, _step_log
     import random as _random
     try:
         seed = int(seed_str.strip()) if seed_str.strip() else 42
     except ValueError:
         seed = 42
-    from training.inference_eval import heuristic_action
     rng = _random.Random(seed + 999)
     all_actions = list(ActionType)
     env = get_env()
     obs = env.reset(seed=seed) if profile_name == "random" else env.reset(seed=seed, profile=profile_name)
-    _last_obs = obs
-    _step_log = []
-    logs = []
-    total_reward = 0.0
     while not obs.done:
-        if strategy == "heuristic":
-            action_type = heuristic_action(obs)
-        else:
-            action_type = rng.choice(all_actions)
         obs = env.step(RhythmAction(action_type=action_type))
         _last_obs = obs
-        total_reward += obs.reward
-        sign = "+" if obs.reward >= 0 else ""
-        line = (
-            f"Step {obs.timestep:>2} | {action_type.value.upper():<15} | "
-            f"reward {sign}{obs.reward:.4f} | "
-            f"V:{obs.vitality:.2f} Co:{obs.cognition:.2f} "
-            f"P:{obs.progress:.2f} S:{obs.serenity:.2f} Cn:{obs.connection:.2f}"
-        )
-        if obs.active_event:
-            line += f"  ⚡{obs.active_event}"
-        logs.append(line)
     final = obs.reward_breakdown.get("final_score", 0.0)
-    logs += [f"{'─'*72}", f"✅ DONE — Final score: {final:.4f}  |  Total reward: {total_reward:.2f}"]
-    _step_log = logs
     return (
-        format_meters(obs),
-        "\n".join(logs),
-        format_reward_breakdown(obs.reward_breakdown),
-        f"Final: {final:.4f}  |  Total: {total_reward:.2f}",
     )
-def run_heuristic_episode(profile_name, seed_str):
-    return _run_auto_episode(profile_name, seed_str, "heuristic")
-def run_random_episode(profile_name, seed_str):
-    return _run_auto_episode(profile_name, seed_str, "random")
 # ---------------------------------------------------------------------------
@@ -265,108 +352,70 @@ with gr.Blocks(title="RhythmEnv — Life Simulator") as demo:
     gr.Markdown(
         "# RhythmEnv — Life Simulator\n"
-        "**Can an LLM discover your personality through experience?**\n\n"
         "Balance 5 life meters across a 7-day week. "
-        "Hidden personality profiles secretly control how actions affect you — "
-        "the agent must discover those patterns through reward signals alone."
     )
     with gr.Tabs():
-        # ── Tab 1: Configure & Reset ──────────────────────────────────────
-        with gr.TabItem("Configure & Reset"):
-            gr.Markdown("### Start a new episode")
             with gr.Row():
                 profile_dd = gr.Dropdown(
-                    choices=PROFILE_NAMES,
-                    value="introvert_morning",
-                    label="Hidden Profile  (normally agent cannot see this — demo only)",
                 )
-                seed_in = gr.Textbox(label="Seed", value="42", scale=1)
-            reset_btn = gr.Button("▶ Reset Episode", variant="primary")
-            gr.Markdown("---")
             gr.Markdown(
-                "| Profile | Core Trait | What agent must discover |\n"
                 "|---|---|---|\n"
-                "| `introvert_morning` | Recharges alone, peaks at dawn | "
-                "Social drain is ×3; morning deep work gives ×2 progress |\n"
-                "| `extrovert_night_owl` | Energised by people, peaks at night | "
-                "Morning is a penalty zone; social gives ×2 connection |\n"
-                "| `workaholic_stoic` | Finds meaning in output, resilient | "
-                "Idle time drains serenity; work actually recovers vitality |"
             )
-            meters_out   = gr.Textbox(label="Current State",          lines=9,  interactive=False)
-            log_out      = gr.Textbox(label="Log",                    lines=5,  interactive=False)
-            breakdown_t1 = gr.Textbox(label="Last Reward Breakdown",  lines=6,  interactive=False)
-            score_t1     = gr.Textbox(label="Score",                              interactive=False)
-            done_flag    = gr.Checkbox(visible=False)
-            reset_btn.click(
-                reset_episode,
-                inputs=[profile_dd, seed_in],
-                outputs=[meters_out, log_out, breakdown_t1, score_t1, done_flag],
-            )
-        # ── Tab 2: Run Episode ────────────────────────────────────────────
-        with gr.TabItem("Run Episode"):
-            gr.Markdown("### Manual control  —  or use the auto-run buttons for a full episode")
             with gr.Row():
                 action_dd = gr.Dropdown(
-                    choices=ACTION_NAMES,
-                    value="DEEP_WORK",
-                    label="Action",
                 )
-                step_btn = gr.Button("▶ Take Step", variant="primary", scale=1)
             with gr.Row():
-                profile_dd2 = gr.Dropdown(
-                    choices=PROFILE_NAMES,
-                    value="introvert_morning",
-                    label="Profile (for auto-run)",
-                )
-                seed_in2 = gr.Textbox(label="Seed (for auto-run)", value="42", scale=1)
-            with gr.Row():
-                heuristic_btn = gr.Button("▶▶ Full Episode — Heuristic Baseline")
-                random_btn    = gr.Button("▶▶ Full Episode — Random Baseline")
-            meters_display    = gr.Textbox(label="Meters",                     lines=9,  interactive=False)
-            score_display     = gr.Textbox(label="Step Reward / Final Score",             interactive=False)
-            log_display       = gr.Textbox(label="Step Log",                   lines=20, interactive=False)
-            breakdown_display = gr.Textbox(label="Last Reward Breakdown",      lines=8,  interactive=False)
-            step_btn.click(
-                take_action,
-                inputs=[action_dd],
-                outputs=[meters_display, log_display, breakdown_display, score_display, done_flag],
-            )
-            heuristic_btn.click(
-                run_heuristic_episode,
-                inputs=[profile_dd2, seed_in2],
-                outputs=[meters_display, log_display, breakdown_display, score_display],
-            )
-            random_btn.click(
-                run_random_episode,
-                inputs=[profile_dd2, seed_in2],
-                outputs=[meters_display, log_display, breakdown_display, score_display],
             )
-        # ── Tab 3: Profiles & Hidden Variables ───────────────────────────
-        with gr.TabItem("Profiles & Hidden Variables"):
             gr.Markdown("### Action Effect Matrix")
             gr.Markdown(
                 "Base delta per action on each meter. "
-                "Profile modifiers are applied on top — invisibly."
-            )
-            gr.Textbox(
-                value=show_action_effects(),
-                lines=14, interactive=False, label="",
             )
             gr.Markdown("### Hidden Personality Profiles")
             gr.Markdown(
@@ -374,11 +423,16 @@ with gr.Blocks(title="RhythmEnv — Life Simulator") as demo:
                 "It must infer the active profile through reward patterns — "
                 "the core learning challenge of RhythmEnv."
             )
-            gr.Textbox(
-                value=show_profiles(),
-                lines=55, interactive=False, label="",
-            )
 if __name__ == "__main__":
-    demo.launch(server_port=7862, share=False, theme=gr.themes.Monochrome())

 """
+RhythmEnv Visual Explorer — Life Simulator v2
 Run: python ui/app.py
 """
 import os
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import matplotlib.patches as mpatches
 import gradio as gr
 from server.rhythm_environment import (
     RhythmEnvironment, MAX_STEPS, METERS, ACTION_EFFECTS, PROFILES
 )
 from models import RhythmAction, ActionType
+SLOT_NAMES  = ["Morning", "Afternoon", "Evening", "Night"]
+SLOT_ICONS  = ["🌅", "☀️", "🌆", "🌙"]
+DAY_NAMES   = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
 PROFILE_NAMES = ["introvert_morning", "extrovert_night_owl", "workaholic_stoic", "random"]
 ACTION_NAMES  = [at.value.upper() for at in ActionType]
+METER_COLORS = {
+    "vitality":   "#3b82f6",
+    "cognition":  "#8b5cf6",
+    "progress":   "#22c55e",
+    "serenity":   "#14b8a6",
+    "connection": "#f97316",
+}
 # ---------------------------------------------------------------------------
+# Global session state
 # ---------------------------------------------------------------------------
+_env             = None
+_last_obs        = None
+_step_log        = []
+_meter_history   = []   # list of {meter: value} per step
+_completed_slots = []   # (day, slot) pairs already acted on
+def get_env():
+    global _env
+    if _env is None:
+        _env = RhythmEnvironment()
+    return _env
+# ---------------------------------------------------------------------------
+# HTML — colored meter bars
+# ---------------------------------------------------------------------------
+def _bar_color(v: float) -> str:
+    if v < 0.20:
+        return "#ef4444"
+    if v < 0.40:
+        return "#f59e0b"
+    return "#22c55e"
+def format_meters_html(obs) -> str:
+    day_name  = DAY_NAMES[obs.day]   if obs.day  < 7 else f"Day {obs.day+1}"
     slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
+    event_bit = (
+        f'<span style="color:#f59e0b;margin-left:8px">⚡ {obs.active_event}</span>'
+        if obs.active_event else ""
     )
+    html = f"""
+    <div style="background:#f9fafb;border-radius:10px;padding:14px 16px;font-family:monospace">
+      <div style="font-size:13px;color:#6b7280;margin-bottom:10px">
+        📅 <b>{day_name} {slot_name}</b>
+        &nbsp;·&nbsp; Step {obs.timestep}/{MAX_STEPS}
+        &nbsp;·&nbsp; {obs.remaining_steps} steps left
+        {event_bit}
+      </div>
+    """
+    for meter in METERS:
+        val   = getattr(obs, meter)
+        pct   = int(val * 100)
+        color = _bar_color(val)
+        dot   = METER_COLORS[meter]
+        html += f"""
+      <div style="display:flex;align-items:center;gap:8px;margin:5px 0">
+        <span style="width:10px;height:10px;border-radius:50%;background:{dot};display:inline-block;flex-shrink:0"></span>
+        <span style="width:80px;font-size:12px;color:#374151">{meter.capitalize()}</span>
+        <div style="flex:1;background:#e5e7eb;border-radius:6px;height:16px;overflow:hidden;max-width:260px">
+          <div style="width:{pct}%;background:{color};height:16px;border-radius:6px;transition:width 0.25s"></div>
+        </div>
+        <span style="width:36px;font-size:12px;color:#374151;text-align:right">{val:.2f}</span>
+      </div>"""
+    html += "\n    </div>"
+    return html
+# ---------------------------------------------------------------------------
+# HTML — week calendar grid
+# ---------------------------------------------------------------------------
+def format_week_grid(obs) -> str:
+    html = """
+    <div style="background:#f9fafb;border-radius:10px;padding:12px 16px;font-family:monospace;margin-top:8px">
+      <div style="font-size:12px;color:#6b7280;margin-bottom:8px">Week Progress</div>
+      <table style="border-collapse:separate;border-spacing:3px;width:100%">
+        <tr>
+          <td style="width:24px"></td>"""
+    for day in DAY_NAMES:
+        html += f'<td style="text-align:center;font-size:11px;color:#9ca3af;padding:1px 3px">{day}</td>'
+    html += "</tr>"
+    current_step = obs.timestep   # 0-based: next step to take
+    # timestep goes 0→27; obs.timestep is the step about to be taken
+    # slots completed = those < current_step
+    for slot_idx, icon in enumerate(SLOT_ICONS):
+        html += f'<tr><td style="font-size:12px;text-align:center">{icon}</td>'
+        for day_idx in range(7):
+            step_num = day_idx * 4 + slot_idx
+            if step_num < current_step:
+                cell = "✅"
+                bg   = "#d1fae5"
+            elif step_num == current_step and not obs.done:
+                cell = "🔵"
+                bg   = "#dbeafe"
+            else:
+                cell = "·"
+                bg   = "transparent"
+            html += f'<td style="text-align:center;background:{bg};border-radius:3px;padding:1px 3px;font-size:13px">{cell}</td>'
+        html += "</tr>"
+    html += "</table></div>"
+    return html
 # ---------------------------------------------------------------------------
+# Matplotlib — meter trajectory chart
 # ---------------------------------------------------------------------------
+def make_chart(history: list) -> plt.Figure:
+    fig, ax = plt.subplots(figsize=(7, 3.5))
+    fig.patch.set_facecolor("#f9fafb")
+    ax.set_facecolor("#f9fafb")
+    if history:
+        steps = list(range(len(history)))
+        for meter, color in METER_COLORS.items():
+            vals = [h[meter] for h in history]
+            ax.plot(steps, vals, color=color, linewidth=2.0, label=meter.capitalize(), solid_capstyle="round")
+        ax.axhline(y=0.20, color="#ef4444", linestyle="--", linewidth=0.8, alpha=0.4)
+    patches = [mpatches.Patch(color=c, label=m.capitalize()) for m, c in METER_COLORS.items()]
+    ax.legend(handles=patches, loc="upper right", fontsize=8, ncol=2,
+              framealpha=0.7, edgecolor="#e5e7eb")
+    ax.set_xlim(0, MAX_STEPS)
+    ax.set_ylim(-0.02, 1.08)
+    ax.set_xlabel("Step  (1 step = 1 time slot)", fontsize=9, color="#6b7280")
+    ax.set_ylabel("Meter value", fontsize=9, color="#6b7280")
+    ax.set_title("Life Meters Over the Week", fontsize=11, color="#374151", pad=8)
+    ax.tick_params(labelsize=8, colors="#9ca3af")
+    for spine in ax.spines.values():
+        spine.set_edgecolor("#e5e7eb")
+    ax.grid(True, alpha=0.3, color="#d1d5db")
+    plt.tight_layout(pad=1.2)
+    return fig
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _snap(obs):
+    return {m: getattr(obs, m) for m in METERS}
+def _step_line(obs, action_name: str) -> str:
+    sign = "+" if obs.reward >= 0 else ""
+    day  = DAY_NAMES[obs.day]   if obs.day  < 7 else f"D{obs.day}"
+    slot = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"S{obs.slot}"
+    line = f"Step {obs.timestep:>2}  [{day} {slot}]  {action_name:<15}  {sign}{obs.reward:.3f}"
+    if obs.active_event:
+        line += f"  ⚡{obs.active_event}"
+    return line
 # ---------------------------------------------------------------------------
+# Tab 1 callbacks
 # ---------------------------------------------------------------------------
+OUTPUTS_COUNT = 5   # meters_html, week_grid, chart, log, score
 def reset_episode(profile_name: str, seed_str: str):
+    global _last_obs, _step_log, _meter_history
     try:
         seed = int(seed_str.strip()) if seed_str.strip() else 42
         seed = 42
     env = get_env()
+    _last_obs = env.reset(seed=seed) if profile_name == "random" else env.reset(seed=seed, profile=profile_name)
+    _step_log     = [f"▶ Profile: {env._profile['name']}  |  Seed: {seed}  |  28 steps to go"]
+    _meter_history = [_snap(_last_obs)]
+    return (
+        format_meters_html(_last_obs),
+        format_week_grid(_last_obs),
+        make_chart(_meter_history),
+        "\n".join(_step_log),
+        "—",
     )
 def take_action(action_str: str):
+    global _last_obs, _step_log, _meter_history
     if _last_obs is None:
+        return "⚠️ Reset the episode first.", "", make_chart([]), "—", "—"
     if _last_obs.done:
+        return (
+            format_meters_html(_last_obs),
+            format_week_grid(_last_obs),
+            make_chart(_meter_history),
+            "\n".join(_step_log[-22:]),
+            "Episode done — press Reset to play again.",
+        )
+    env  = get_env()
+    obs  = env.step(RhythmAction(action_type=ActionType(action_str.lower())))
     _last_obs = obs
+    _meter_history.append(_snap(obs))
+    _step_log.append(_step_line(obs, action_str))
     if obs.done:
         final = obs.reward_breakdown.get("final_score", 0.0)
+        _step_log.append("─" * 52)
+        _step_log.append(f"✅  Final score: {final:.4f}")
+    score = (
         f"Final: {obs.reward_breakdown.get('final_score', 0.0):.4f}"
         if obs.done else f"Step reward: {obs.reward:+.4f}"
     )
     return (
+        format_meters_html(obs),
+        format_week_grid(obs),
+        make_chart(_meter_history),
+        "\n".join(_step_log[-22:]),
+        score,
     )
+def _run_auto(profile_name: str, seed_str: str, strategy: str):
+    global _last_obs, _step_log, _meter_history
     import random as _random
+    from training.inference_eval import heuristic_action
     try:
         seed = int(seed_str.strip()) if seed_str.strip() else 42
     except ValueError:
         seed = 42
     rng = _random.Random(seed + 999)
     all_actions = list(ActionType)
     env = get_env()
     obs = env.reset(seed=seed) if profile_name == "random" else env.reset(seed=seed, profile=profile_name)
+    _last_obs     = obs
+    _step_log     = [f"▶ Auto-run ({strategy})  |  Profile: {env._profile['name']}  |  Seed: {seed}"]
+    _meter_history = [_snap(obs)]
     while not obs.done:
+        action_type = heuristic_action(obs) if strategy == "heuristic" else rng.choice(all_actions)
         obs = env.step(RhythmAction(action_type=action_type))
         _last_obs = obs
+        _meter_history.append(_snap(obs))
+        _step_log.append(_step_line(obs, action_type.value.upper()))
     final = obs.reward_breakdown.get("final_score", 0.0)
+    _step_log += ["─" * 52, f"✅  Final score: {final:.4f}"]
     return (
+        format_meters_html(obs),
+        format_week_grid(obs),
+        make_chart(_meter_history),
+        "\n".join(_step_log[-25:]),
+        f"Final: {final:.4f}",
     )
+def run_heuristic(p, s): return _run_auto(p, s, "heuristic")
+def run_random(p, s):    return _run_auto(p, s, "random")
+# ---------------------------------------------------------------------------
+# Reference tab helpers
+# ---------------------------------------------------------------------------
+def show_action_effects() -> str:
+    header = f"{'Action':<15}" + "".join(f"  {m[:3]:>6}" for m in METERS)
+    lines  = [header, "─" * 52]
+    for action, effects in ACTION_EFFECTS.items():
+        row = f"{action:<15}"
+        for m in METERS:
+            row += f"  {effects[m]:>+6.2f}"
+        lines.append(row)
+    return "\n".join(lines)
+def show_profiles() -> str:
+    lines = []
+    for p in PROFILES:
+        weights = p["reward_weights"]
+        lines += [f"\n{'═'*52}", f"  {p['name'].upper()}", f"{'═'*52}"]
+        lines.append("  Reward weights (hidden from agent):")
+        for m, w in weights.items():
+            bar = "█" * int(w * 20)
+            lines.append(f"    {m:<12}  {bar:<20}  {w:.0%}")
+        lines.append("\n  Key hidden modifiers:")
+        if p.get("morning_cognition_bonus"):
+            lines.append(f"    • Morning: cognition/progress ×{p['morning_cognition_bonus']}  (peak window)")
+        if p.get("evening_night_cognition_bonus"):
+            lines.append(f"    • Evening/Night: cognition/progress ×{p['evening_night_cognition_bonus']}  (peak zone)")
+        if p.get("morning_penalty"):
+            lines.append(f"    • Morning: cognition/progress ×{p['morning_penalty']}  (groggy zone)")
+        sv = p.get("social_vitality_multiplier", 1.0)
+        if sv != 1.0:
+            lines.append(f"    • Social vitality drain ×{sv}")
+        if p.get("binge_shame"):
+            lines.append(f"    • Binge watch: shame spiral −0.15 serenity")
+        if p.get("progress_serenity_bonus"):
+            lines.append(f"    • Work gives serenity +{p['progress_serenity_bonus']}  (meaning)")
+        if p.get("idle_serenity_decay"):
+            lines.append(f"    • Idle drains serenity −{p['idle_serenity_decay']}  (guilt)")
+        if p.get("work_vitality_recovery"):
+            lines.append(f"    • Work recovers vitality +{p['work_vitality_recovery']}  (energized)")
+        if p.get("solo_serenity_bonus"):
+            lines.append(f"    • Solo time gives serenity +{p['solo_serenity_bonus']}  (recharge)")
+        scm = p.get("social_connection_multiplier", 1.0)
+        if scm != 1.0:
+            lines.append(f"    • Social connection ×{scm}")
+        lines.append(f"    • Connection passive decay: −{p['connection_decay_rate']}/step")
+    return "\n".join(lines)
 # ---------------------------------------------------------------------------
     gr.Markdown(
         "# RhythmEnv — Life Simulator\n"
+        "**Can a lightweight AI learn who you are — without being told?**\n\n"
         "Balance 5 life meters across a 7-day week. "
+        "A hidden personality profile secretly changes how every action affects you. "
+        "The agent must infer who you are from reward signals alone."
     )
     with gr.Tabs():
+        # ── Tab 1: Play ───────────────────────────────────────────────────────
+        with gr.TabItem("▶  Play"):
             with gr.Row():
                 profile_dd = gr.Dropdown(
+                    choices=PROFILE_NAMES, value="introvert_morning",
+                    label="Hidden Profile  (visible here for demo — agent cannot see this)",
+                    scale=3,
                 )
+                seed_in   = gr.Textbox(label="Seed", value="42", scale=1)
+                reset_btn = gr.Button("⟳  Reset", variant="primary", scale=1)
             gr.Markdown(
+                "| Profile | Core trait | What the agent must discover |\n"
                 "|---|---|---|\n"
+                "| `introvert_morning` | Recharges alone, peaks at dawn |"
+                " Social drain ×3 · Morning deep work gives ×2 progress |\n"
+                "| `extrovert_night_owl` | Energised by people, peaks at night |"
+                " Morning is a penalty zone · Social gives ×2 connection |\n"
+                "| `workaholic_stoic` | Finds meaning in output, resilient |"
+                " Idle time drains serenity · Work recovers vitality |"
             )
+            with gr.Row():
+                with gr.Column(scale=2):
+                    meters_html    = gr.HTML()
+                    week_grid_html = gr.HTML()
+                    score_display  = gr.Textbox(label="Score", interactive=False, lines=1)
+                with gr.Column(scale=3):
+                    chart_display = gr.Plot(label="Meter Trajectories")
             with gr.Row():
                 action_dd = gr.Dropdown(
+                    choices=ACTION_NAMES, value="DEEP_WORK",
+                    label="Choose action", scale=4,
                 )
+                step_btn = gr.Button("▶  Take Step", variant="primary", scale=1)
             with gr.Row():
+                heuristic_btn = gr.Button("▶▶  Full Episode — Heuristic Baseline")
+                random_btn    = gr.Button("▶▶  Full Episode — Random Baseline")
+            log_display = gr.Textbox(
+                label="Step Log  (last 22 steps)",
+                lines=10, interactive=False,
             )
+        # ── Tab 2: Environment Reference ─────────────────────────────────────
+        with gr.TabItem("📋  Environment Reference"):
             gr.Markdown("### Action Effect Matrix")
             gr.Markdown(
                 "Base delta per action on each meter. "
+                "Profile modifiers and time-of-day multipliers are applied on top — invisibly."
             )
+            gr.Textbox(value=show_action_effects(), lines=14, interactive=False, label="")
             gr.Markdown("### Hidden Personality Profiles")
             gr.Markdown(
                 "It must infer the active profile through reward patterns — "
                 "the core learning challenge of RhythmEnv."
             )
+            gr.Textbox(value=show_profiles(), lines=55, interactive=False, label="")
+    # ── Wire up ──────────────────────────────────────────────────────────────
+    _out = [meters_html, week_grid_html, chart_display, log_display, score_display]
+    reset_btn.click(reset_episode, inputs=[profile_dd, seed_in],  outputs=_out)
+    step_btn.click(take_action,    inputs=[action_dd],             outputs=_out)
+    heuristic_btn.click(run_heuristic, inputs=[profile_dd, seed_in], outputs=_out)
+    random_btn.click(run_random,       inputs=[profile_dd, seed_in], outputs=_out)
 if __name__ == "__main__":
+    demo.launch(server_port=7862, share=False, theme=gr.themes.Soft())