InosLihka Claude Sonnet 4.6 commited on
Commit
5fbafee
Β·
1 Parent(s): 2c6ee11

refactor: rewrite blog around product vision; fix UI for Gradio 6

Browse files

Blog now opens with the concrete product scenario (watch/calendar/sleep
tracker inputs, Accept-or-Ignore as rewards) and frames RhythmEnv as the
training curriculum for the inference skill the product needs.

UI: replace ASCII meter bars with HTML progress bars, add week calendar
grid and live matplotlib meter trajectory chart, merge configure+play
into one tab. Fix Gradio 6 incompatibilities (theme kwarg, show_copy_button).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (2) hide show
  1. docs/blog_post.md +53 -67
  2. ui/app.py +297 -243
docs/blog_post.md CHANGED
@@ -1,123 +1,109 @@
1
  # Teaching an AI to Know You (Without Asking)
2
 
3
- Ask someone how they'd build a personal AI assistant, and they'll say: give it a personality quiz. A preferences form. Maybe a settings page where you pick "introvert" or "morning person" from a dropdown.
4
 
5
- Sounds reasonable. But it's the wrong approach entirely.
6
 
7
- Think about the people who actually know you well β€” a close friend, a partner, a sibling. None of them sat you down with a questionnaire. They figured you out by *watching*. They noticed that you get irritable after too many social events. That you do your best thinking before noon. That skipping exercise makes you anxious by Wednesday.
8
 
9
- They learned your hidden patterns through trial, error, and feedback. What if we could train an AI the same way?
10
 
11
- ## The problem with frontier models doing this
12
 
13
- A capable frontier model can already do decent personalized planning if you describe yourself in the prompt. Tell GPT-4 "I'm an introvert who peaks in the morning," and it'll give you reasonable advice. The problem is that approach doesn't scale:
14
 
15
- - You have to tell it who you are every single time
16
- - It can't observe your actual behavioral responses to recommendations
17
- - It runs in the cloud, costs per query, and can't be always-on or private
18
- - Most users can't accurately describe their own patterns anyway
19
-
20
- What we actually need is a small model β€” something that can run cheaply, frequently, and eventually on-device β€” that builds up a model of you from *how you respond*, not from what you say about yourself. That's the gap RhythmEnv is designed to train for.
21
 
22
- I work on AI at Microsoft. One thing I kept running into building assistant features was the gap between what users *say* they want and what actually helps them. People are bad at introspecting their own patterns. The introvert who says "I don't mind meetings" because they've normalized the drain. The workaholic who checks "I value work-life balance" because they know they should.
23
 
24
- Preference forms capture what people believe about themselves. Behavior reveals what's actually true.
25
 
26
- ## The Trait Decomposition
27
 
28
- Here's how I think about modeling humans. Start with **traits** β€” atomic behavioral properties that describe how a person responds to activities:
29
 
30
- How much does socializing physically drain you? When does your brain work best? Does leisure make you feel guilty, or recharge you? Does productivity give you calm, or just tire you out? How far can your stress drop before everything starts spiraling?
31
 
32
- No single trait defines a person. It's the *combination* that creates a personality. An introvert morning person has high social energy cost, early cognitive peak, and strong solo recharge. An extrovert night owl has the opposite: low social cost, late cognitive peak, socializing actually gives them serenity.
33
 
34
- Same list of traits. Different values. Different person.
35
 
36
- ## The "Good Day" Definition
37
 
38
- Here's the part that took me a while to articulate: two people can do the exact same activities and have completely different days. Not because the activities are different, but because they *value different outcomes*.
39
 
40
- In RhythmEnv, each person has hidden **reward weights** β€” a definition of what a good week means to them:
 
 
 
41
 
42
- - The introvert values **serenity** above all (60% weight). A week where they maintained inner peace and made some progress is a great week. Connection barely registers.
43
- - The extrovert values **connection** above all (75% weight). A week full of meaningful social interactions is a great week, even if they didn't get much work done.
44
- - The workaholic values **progress** above all (70% weight). Deep productive work is the whole point. Everything else is secondary.
45
 
46
- The agent sees the same 5 meters. Takes the same 10 actions. Gets a scalar reward. But the reward is secretly computed using these hidden weights. Same action, same meter changes, completely different reward signal depending on who you're helping.
47
 
48
- ## The Environment
49
 
50
- RhythmEnv simulates one week in a person's life. 7 days, 4 time slots each, 28 decisions total. Each decision picks an activity: deep work, exercise, sleep, meditation, family time, socializing, and so on. Ten options.
51
 
52
- Five life meters track the person's state β€” picture them like fuel gauges on a dashboard:
53
 
54
- - **Vitality** β€” physical energy. Sleep and exercise fill it up. Work drains it.
55
  - **Cognition** β€” mental sharpness. Peaks in the morning for some, evening for others.
56
  - **Progress** β€” career momentum. Only goes up when you work.
57
  - **Serenity** β€” inner calm. Meditation helps. Overwork kills it.
58
- - **Connection** β€” relationship health. Decays passively every time slot. If you don't actively socialize, it quietly drops on its own.
59
-
60
- After every action, meters shift. The agent sees the new meter values and gets a reward. That reward is the hidden weighted sum of what changed β€” and the weights are different for every person type.
61
-
62
- ## Why Identical Actions Produce Different Results
63
-
64
- The trait modifiers change how actions physically affect the person, not just how rewards are computed.
65
-
66
- Tell the introvert to socialize: their vitality drops 3Γ— faster than normal. Their body physically rejects it. Tell the extrovert the same thing: barely any drain. They could socialize all day.
67
-
68
- Tell the introvert to meditate: they get a bonus +0.10 serenity on top of the base effect. Alone time is their recharge mechanism. Tell the workaholic the same thing: their serenity *drops* by 0.10, because idle activities make them anxious.
69
 
70
- Tell the workaholic to do deep work: they recover +0.06 vitality β€” productive work energizes them. Tell the introvert to do deep work in the morning: their progress and cognition gains are doubled. Same action, completely different physiological response.
71
 
72
- These aren't arbitrary. They're modeled after real behavioral patterns. The introvert's social drain, the workaholic's anxiety from idleness, the night owl's morning penalty β€” these are things people recognize in themselves but rarely articulate.
73
 
74
- ## What the Agent Must Figure Out
75
 
76
- The agent sees meters, time of day, and reward. It doesn't see which profile is active, the trait values, or how the reward is being computed.
77
 
78
- After a few actions, the patterns start showing. "I socialized and my vitality crashed β€” this person drains from socializing." "I meditated and got a huge reward β€” serenity must be heavily weighted." "Deep work in the morning gave double progress β€” this person peaks early."
79
 
80
- The trained agent should probe early, infer the person type by observing unexpected meter changes, then adapt its strategy for the rest of the week. An agent that discovers it's helping an introvert should meditate more and socialize less. One that discovers a workaholic should maximize productive hours and cut idle time.
81
 
82
- This is the skill we're training: *behavioral inference under partial observability*. Detect the hidden pattern from how the environment responds to your actions, then plan accordingly.
83
 
84
- ## Why Simulation Is the Right Starting Point
85
 
86
- Everything in RhythmEnv is simulated β€” the person doesn't exist, the meters aren't biometric readings, the profiles are synthetic. That's intentional, and it's not a limitation.
 
 
 
 
87
 
88
- Robotics RL trains in simulation first. The simulator is the curriculum, not the deployment target. The skill the model learns β€” detecting behavioral signatures from differential responses to the same action β€” is real and transferable. In a production version, the "meters" become observable proxies: calendar acceptance patterns, response latency after social-heavy days, a simple end-of-day wellness rating. The agent that learns to infer hidden reward weights from simulation learns the *structure* of the problem. The specific medium can change.
89
 
90
- ## The Training Pipeline
91
 
92
- Training uses GRPO β€” Group Relative Policy Optimization. For each game state, generate multiple candidate actions, score them all against the real environment, update the model to prefer the ones that scored higher. The environment *is* the critic.
93
 
94
- The per-step reward signal is strongly differentiated by profile. At the same starting state β€” Monday morning, all meters at 0.7 β€” the best action is completely different:
95
 
96
- | Profile | Best Action | Reward | Worst Action | Reward |
97
- |---------|-------------|--------|--------------|--------|
98
- | Introvert | MEDITATE | +1.76 | SOCIALIZE | +0.03 |
99
- | Extrovert | FAMILY_TIME | +2.63 | ME_TIME | βˆ’0.42 |
100
- | Workaholic | DEEP_WORK | +1.57 | ME_TIME | βˆ’0.27 |
101
 
102
- The model is Qwen 2.5-3B with 4-bit quantization and LoRA β€” small enough to train on a free Colab T4, small enough to eventually run at the edge. The goal isn't matching GPT-4's general reasoning. It's teaching a lightweight model a specific skill it doesn't have out of the box: infer who you're helping from how they respond, not from what they tell you.
103
 
104
- ## What I'm Hoping To See
105
 
106
- The heuristic baseline β€” fixed rules, no profile adaptation, treats everyone the same β€” scores around 0.76–0.82. Sleep at night, work in the morning, socialize when connection drops. Reasonable advice for anyone.
107
 
108
- The trained agent should do better by doing something qualitatively different: the introvert's week should look nothing like the extrovert's week. Not just higher scores β€” genuinely differentiated action sequences per profile. That's the signal that inference is happening, not just pattern matching.
109
 
110
- The bigger goal is a learning curve that works in the other direction too. In a real product, the first few interactions are the model probing β€” making recommendations and observing how the user responds. After a handful of exchanges, it should have enough signal to know whether it's dealing with someone who needs serenity protected, or someone who needs to be pushed into more productive hours.
111
 
112
- No questionnaire. No settings page. Just attention, inference, and adjustment.
113
 
114
- That's what personal AI should actually feel like.
115
 
116
  ---
117
 
118
  **Links:**
119
  - [Live Environment on HF Spaces](https://huggingface.co/spaces/InosLihka/rhythm_env)
120
  - [Training Notebook (Colab)](../training/RhythmEnv_GRPO_Training.ipynb)
121
- - [Source Code & README](https://huggingface.co/spaces/InosLihka/rhythm_env)
122
 
123
  *Built for the Meta OpenEnv Hackathon Grand Finale, April 2026.*
 
1
  # Teaching an AI to Know You (Without Asking)
2
 
3
+ Imagine this. It's 2pm. You had deep work blocked on your calendar. Your AI assistant sends you a nudge:
4
 
5
+ > *"I know you planned Deep Work now, but your focus metrics just dropped below 20%. If you push through, you'll likely spend 3 hours on something that would take 1 hour at peak. Take a 20-minute rest first β€” I'll remind you when your window opens."*
6
 
7
+ You tap Accept or Ignore. Either way, the agent just learned something about you.
8
 
9
+ That's the product vision. But there's a problem nobody has solved cleanly: how does the AI know that rest-then-work is the right call *for you specifically*, and not just generically good advice?
10
 
11
+ ## The gap that everyone papers over
12
 
13
+ Most AI assistants give the same advice to everyone. They know best practices β€” sleep enough, work in the morning, don't skip exercise. That's useful for nobody who isn't already average.
14
 
15
+ The people who give you genuinely good advice about your life have learned you over time. A great EA, a close friend, a good coach β€” none of them sat you down with a questionnaire. They watched how you responded to things. They noticed that you're wrecked after back-to-back meetings even when you say you're fine. That you do your sharpest thinking before anyone else is online. That skipping one workout makes you irritable by Wednesday.
 
 
 
 
 
16
 
17
+ I work on AI at Microsoft. One thing I kept running into building assistant features was the gap between what users *say* they want and what actually helps them. People are bad at introspecting their own patterns. The introvert who says "I don't mind meetings" because they've normalised the drain. The workaholic who checks "I value work-life balance" because they know they should.
18
 
19
+ Preference forms capture what people believe about themselves. Behaviour reveals what's actually true.
20
 
21
+ ## The real-world input problem
22
 
23
+ You wouldn't manually type "I am at 40% energy." That's a chore nobody does.
24
 
25
+ The real input comes from devices you already carry. Your watch sends resting heart rate and HRV β€” that's Vitality and Serenity. Your calendar sends meeting density and deadline proximity β€” that's Progress pressure. Your sleep tracker sends last night's data β€” that's Cognition. Your phone knows whether you've been social or isolated.
26
 
27
+ The agent never asks how you feel. It reads what your devices already know.
28
 
29
+ And the reward signal? It comes from you, passively. Every time the agent makes a recommendation and you Accept or Ignore it, that choice is data. Accept means "yes, that was the right read." Ignore means "you got something wrong about me." Over hundreds of those micro-interactions, the agent builds a precise model of who you are β€” not the person you describe yourself to be.
30
 
31
+ ## The foundational problem: teaching the inference skill to a small model
32
 
33
+ Here's the hard part. A frontier model like GPT-4 can already do decent personalised planning if you describe yourself in the prompt. But that doesn't work at scale:
34
 
35
+ - You have to describe yourself every single session
36
+ - The model can't observe your actual responses to its recommendations
37
+ - It runs in the cloud, costs per query, can't be always-on or private
38
+ - Most users can't accurately describe their own patterns anyway
39
 
40
+ What the real product needs is a small model β€” one that can run cheaply, close to you, eventually on-device β€” that builds up a model of you from *how you respond*, not from what you say about yourself.
 
 
41
 
42
+ That's the inference skill we're training. **RhythmEnv is the curriculum.**
43
 
44
+ ## How the training environment works
45
 
46
+ RhythmEnv simulates one week in a person's life β€” 7 days, 4 time slots each, 28 decisions. Each decision is an activity: deep work, exercise, sleep, meditation, family time, socialising. Ten options.
47
 
48
+ Five meters track the person's state:
49
 
50
+ - **Vitality** β€” physical energy. Sleep fills it. Work drains it.
51
  - **Cognition** β€” mental sharpness. Peaks in the morning for some, evening for others.
52
  - **Progress** β€” career momentum. Only goes up when you work.
53
  - **Serenity** β€” inner calm. Meditation helps. Overwork kills it.
54
+ - **Connection** β€” relationship health. Decays passively every time slot. Ignore it and it quietly drops.
 
 
 
 
 
 
 
 
 
 
55
 
56
+ Hidden underneath is a personality profile. The agent can't see it. It controls both *what the person values* (their hidden reward weights) and *how actions physically affect them* (their hidden trait modifiers).
57
 
58
+ Three profiles, wildly different hidden mechanics:
59
 
60
+ The **introvert morning person** values serenity above everything (60% of their score). Socialising drains their vitality three times faster than the base rate. Meditating gives them a bonus +0.10 serenity on top of the base effect. Deep work in the morning gives double progress. The agent discovers: *mornings are sacred, social events are costly, alone time heals.*
61
 
62
+ The **extrovert night owl** values connection above everything (75%). Socialising barely costs any vitality β€” they could do it all day. Deep work in the morning gives only 40% of expected output. The same work in the evening gives 1.8Γ— output. The agent discovers: *protect the mornings for rest, do cognitive work at night, keep socialising high.*
63
 
64
+ The **workaholic stoic** values progress above everything (70%). Productive work actually *recovers* vitality for them β€” output is energising. Idle activities like leisure or passive rest drain their serenity β€” the guilt is real. The agent discovers: *keep working, rest only when vitality is critical, never let idle time accumulate.*
65
 
66
+ ## What the agent must figure out
67
 
68
+ The agent sees meters, time of day, and a reward signal. It doesn't see the profile, the trait values, or the reward weights.
69
 
70
+ Same action, same starting state β€” completely different reward depending on who you're helping:
71
 
72
+ | Profile | DEEP_WORK reward (step 1) |
73
+ |---|---|
74
+ | Workaholic | +1.57 |
75
+ | Introvert | +0.32 |
76
+ | Extrovert | βˆ’0.39 |
77
 
78
+ The extrovert gets a *negative* reward from deep work first thing β€” because it gives zero connection, and connection is 75% of their score.
79
 
80
+ A good agent should probe in the first few steps, read the unexpected meter changes, infer the hidden profile, and adapt its strategy for the rest of the week. This is the same skill the real product needs: detect who you are from how you respond, not from what you tell me.
81
 
82
+ ## The training pipeline
83
 
84
+ We train using GRPO β€” Group Relative Policy Optimization. For each game state, generate multiple candidate actions, score them all against the real environment, update the model to prefer the higher-scoring ones. The environment is the critic.
85
 
86
+ The model is Qwen 2.5-3B with 4-bit quantization and LoRA. Small enough to train on a free Colab T4. Small enough to eventually run at the edge.
 
 
 
 
87
 
88
+ The heuristic baseline β€” fixed rules, treats everyone the same β€” scores around 0.76–0.82. Sleep when vitality is low. Meditate when serenity drops. Socialise when connection falls. Reasonable advice for anyone. Wrong advice for someone specifically.
89
 
90
+ A trained agent that discovers the hidden personality should beat the heuristic by doing something qualitatively different: the introvert's week should look nothing like the extrovert's week. That differentiation is the signal that real inference is happening.
91
 
92
+ ## Why simulation first
93
 
94
+ Everything here is simulated. The person doesn't exist. The meters aren't biometric readings. This is standard practice β€” robotics RL trains in simulation before deploying on hardware. The simulator is the curriculum, not the product.
95
 
96
+ The inference skill transfers. An agent that learns to detect "this person's vitality drops 3Γ— faster from social events than expected" from simulated reward signals learns the *structure* of the problem. When the medium changes β€” when vitality comes from HRV instead of a formula β€” the skill of reading differential responses still applies.
97
 
98
+ The Accept/Ignore loop in the real product is the same reward signal, made human. Every time you ignore a recommendation, you're telling the agent: "you read me wrong." Every Accept says: "that was right." Over enough interactions, the model converges on your hidden profile without you ever having to describe it.
99
 
100
+ No questionnaire. No settings page. Just devices watching, signals flowing, and an agent that gets better at knowing you every week.
101
 
102
  ---
103
 
104
  **Links:**
105
  - [Live Environment on HF Spaces](https://huggingface.co/spaces/InosLihka/rhythm_env)
106
  - [Training Notebook (Colab)](../training/RhythmEnv_GRPO_Training.ipynb)
107
+ - [Source Code](https://huggingface.co/spaces/InosLihka/rhythm_env)
108
 
109
  *Built for the Meta OpenEnv Hackathon Grand Finale, April 2026.*
ui/app.py CHANGED
@@ -1,5 +1,5 @@
1
  """
2
- RhythmEnv Visual Explorer β€” Life Simulator
3
  Run: python ui/app.py
4
  """
5
 
@@ -7,127 +7,191 @@ import sys
7
  import os
8
  sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
9
 
 
 
 
 
 
10
  import gradio as gr
11
  from server.rhythm_environment import (
12
  RhythmEnvironment, MAX_STEPS, METERS, ACTION_EFFECTS, PROFILES
13
  )
14
  from models import RhythmAction, ActionType
15
 
16
- SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
17
- DAY_NAMES = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
 
18
  PROFILE_NAMES = ["introvert_morning", "extrovert_night_owl", "workaholic_stoic", "random"]
19
  ACTION_NAMES = [at.value.upper() for at in ActionType]
20
 
 
 
 
 
 
 
 
 
21
  # ---------------------------------------------------------------------------
22
- # Display helpers
23
  # ---------------------------------------------------------------------------
24
 
25
- def _meter_bar(value: float, label: str, width: int = 20) -> str:
26
- filled = int(round(value * width))
27
- bar = "β–ˆ" * filled + "β–‘" * (width - filled)
28
- indicator = "πŸ”΄" if value < 0.2 else ("🟑" if value < 0.4 else "🟒")
29
- return f"{indicator} {label:<12} [{bar}] {value:.2f}"
 
 
 
 
 
 
 
 
 
 
30
 
 
 
 
 
 
 
31
 
32
- def format_meters(obs) -> str:
33
- day_name = DAY_NAMES[obs.day] if obs.day < 7 else f"Day {obs.day + 1}"
34
  slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
35
- event_line = f"\n ⚑ Event: {obs.active_event}" if obs.active_event else ""
36
- header = (
37
- f"πŸ“… {day_name} {slot_name} | "
38
- f"Step {obs.timestep}/{MAX_STEPS} | "
39
- f"{obs.remaining_steps} steps left{event_line}"
40
  )
41
- bars = [
42
- _meter_bar(obs.vitality, "Vitality"),
43
- _meter_bar(obs.cognition, "Cognition"),
44
- _meter_bar(obs.progress, "Progress"),
45
- _meter_bar(obs.serenity, "Serenity"),
46
- _meter_bar(obs.connection, "Connection"),
47
- ]
48
- return header + "\n\n" + "\n".join(bars)
49
-
50
-
51
- def format_reward_breakdown(breakdown: dict) -> str:
52
- if not breakdown:
53
- return "β€”"
54
- lines = []
55
- for k, v in breakdown.items():
56
- sign = "+" if v >= 0 else ""
57
- lines.append(f" {k:<25} {sign}{v:.4f}")
58
- return "\n".join(lines)
59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- def show_action_effects() -> str:
62
- header = f"{'Action':<15}" + "".join(f" {m[:3]:>6}" for m in METERS)
63
- lines = [header, "─" * 52]
64
- for action, effects in ACTION_EFFECTS.items():
65
- row = f"{action:<15}"
66
- for m in METERS:
67
- row += f" {effects[m]:>+6.2f}"
68
- lines.append(row)
69
- return "\n".join(lines)
70
 
 
 
 
71
 
72
- def show_profiles() -> str:
73
- lines = []
74
- for p in PROFILES:
75
- name = p["name"]
76
- weights = p["reward_weights"]
77
- lines += [f"\n{'═'*52}", f" {name.upper()}", f"{'═'*52}"]
78
- lines.append(" Reward weights (hidden from agent):")
79
- for m, w in weights.items():
80
- bar = "β–ˆ" * int(w * 20)
81
- lines.append(f" {m:<12} {bar:<20} {w:.0%}")
82
- lines.append("\n Key hidden modifiers:")
83
- if p.get("morning_cognition_bonus"):
84
- lines.append(f" β€’ Morning: cognition/progress Γ—{p['morning_cognition_bonus']} (HV1 peak)")
85
- if p.get("evening_night_cognition_bonus"):
86
- lines.append(f" β€’ Evening/Night: cognition/progress Γ—{p['evening_night_cognition_bonus']} (HV1 peak)")
87
- if p.get("morning_penalty"):
88
- lines.append(f" β€’ Morning: cognition/progress Γ—{p['morning_penalty']} (HV1 penalty)")
89
- sv = p.get("social_vitality_multiplier", 1.0)
90
- if sv != 1.0:
91
- lines.append(f" β€’ Social vitality drain Γ—{sv} (HV2 social cost)")
92
- if p.get("binge_shame"):
93
- lines.append(f" β€’ Binge watch: shame spiral βˆ’0.15 serenity βˆ’0.06 cognition (HV3)")
94
- if p.get("progress_serenity_bonus"):
95
- lines.append(f" β€’ Work gives serenity +{p['progress_serenity_bonus']} (HV3 meaning)")
96
- if p.get("idle_serenity_decay"):
97
- lines.append(f" β€’ Idle drains serenity βˆ’{p['idle_serenity_decay']} (HV3 guilt)")
98
- if p.get("work_vitality_recovery"):
99
- lines.append(f" β€’ Work recovers vitality +{p['work_vitality_recovery']} (HV2 energized)")
100
- if p.get("solo_serenity_bonus"):
101
- lines.append(f" β€’ Solo time (me_time/meditate) gives serenity +{p['solo_serenity_bonus']} (HV3 recharge)")
102
- scm = p.get("social_connection_multiplier", 1.0)
103
- if scm != 1.0:
104
- lines.append(f" β€’ Social connection Γ—{scm} (HV2 amplifier)")
105
- lines.append(f" β€’ Connection passive decay: βˆ’{p['connection_decay_rate']}/step")
106
- return "\n".join(lines)
107
 
108
 
109
  # ---------------------------------------------------------------------------
110
- # Global env state (single-session)
111
  # ---------------------------------------------------------------------------
112
 
113
- _env = None
114
- _last_obs = None
115
- _step_log = []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
 
117
 
118
- def get_env():
119
- global _env
120
- if _env is None:
121
- _env = RhythmEnvironment()
122
- return _env
 
 
 
 
 
 
 
 
 
 
123
 
124
 
125
  # ---------------------------------------------------------------------------
126
- # Tab 1 β€” Configure & Reset
127
  # ---------------------------------------------------------------------------
128
 
 
 
129
  def reset_episode(profile_name: str, seed_str: str):
130
- global _last_obs, _step_log
131
 
132
  try:
133
  seed = int(seed_str.strip()) if seed_str.strip() else 42
@@ -135,126 +199,149 @@ def reset_episode(profile_name: str, seed_str: str):
135
  seed = 42
136
 
137
  env = get_env()
 
 
 
 
138
 
139
- if profile_name == "random":
140
- _last_obs = env.reset(seed=seed)
141
- else:
142
- _last_obs = env.reset(seed=seed, profile=profile_name)
143
-
144
- _step_log = []
145
- meters_text = format_meters(_last_obs)
146
- log = (
147
- f"Episode started.\n"
148
- f" Profile : {env._profile['name']}\n"
149
- f" Seed : {seed}\n\n"
150
- "Choose an action and press Take Step, or use an auto-run button."
151
  )
152
- return meters_text, log, "β€”", "β€”", False
153
 
154
 
155
- # ---------------------------------------------------------------------------
156
- # Tab 2 β€” Run Episode
157
- # ---------------------------------------------------------------------------
158
-
159
  def take_action(action_str: str):
160
- global _last_obs, _step_log
161
 
162
  if _last_obs is None:
163
- return "⚠️ Go to Configure & Reset and start an episode first.", "β€”", "β€”", "β€”", False
164
  if _last_obs.done:
165
- return "Episode is done. Reset to start a new one.", "β€”", "β€”", "β€”", True
 
 
 
 
 
 
166
 
167
- env = get_env()
168
- action_type = ActionType(action_str.lower())
169
- obs = env.step(RhythmAction(action_type=action_type))
170
  _last_obs = obs
 
 
171
 
172
- sign = "+" if obs.reward >= 0 else ""
173
- step_line = (
174
- f"Step {obs.timestep:>2} | {action_str:<15} | "
175
- f"reward {sign}{obs.reward:.4f} | "
176
- f"V:{obs.vitality:.2f} Co:{obs.cognition:.2f} "
177
- f"P:{obs.progress:.2f} S:{obs.serenity:.2f} Cn:{obs.connection:.2f}"
178
- )
179
- if obs.active_event:
180
- step_line += f" ⚑{obs.active_event}"
181
  if obs.done:
182
  final = obs.reward_breakdown.get("final_score", 0.0)
183
- step_line += f"\n{'─'*72}\nβœ… DONE β€” Final score: {final:.4f}"
184
- _step_log.append(step_line)
185
 
186
- score_text = (
187
  f"Final: {obs.reward_breakdown.get('final_score', 0.0):.4f}"
188
  if obs.done else f"Step reward: {obs.reward:+.4f}"
189
  )
190
  return (
191
- format_meters(obs),
192
- "\n".join(_step_log),
193
- format_reward_breakdown(obs.reward_breakdown),
194
- score_text,
195
- obs.done,
196
  )
197
 
198
 
199
- def _run_auto_episode(profile_name: str, seed_str: str, strategy: str):
200
- """Shared runner for heuristic and random auto-episodes."""
201
- global _last_obs, _step_log
202
  import random as _random
 
203
 
204
  try:
205
  seed = int(seed_str.strip()) if seed_str.strip() else 42
206
  except ValueError:
207
  seed = 42
208
 
209
- from training.inference_eval import heuristic_action
210
-
211
  rng = _random.Random(seed + 999)
212
  all_actions = list(ActionType)
213
-
214
  env = get_env()
 
215
  obs = env.reset(seed=seed) if profile_name == "random" else env.reset(seed=seed, profile=profile_name)
216
- _last_obs = obs
217
- _step_log = []
218
- logs = []
219
- total_reward = 0.0
220
 
221
  while not obs.done:
222
- if strategy == "heuristic":
223
- action_type = heuristic_action(obs)
224
- else:
225
- action_type = rng.choice(all_actions)
226
  obs = env.step(RhythmAction(action_type=action_type))
227
  _last_obs = obs
228
- total_reward += obs.reward
229
- sign = "+" if obs.reward >= 0 else ""
230
- line = (
231
- f"Step {obs.timestep:>2} | {action_type.value.upper():<15} | "
232
- f"reward {sign}{obs.reward:.4f} | "
233
- f"V:{obs.vitality:.2f} Co:{obs.cognition:.2f} "
234
- f"P:{obs.progress:.2f} S:{obs.serenity:.2f} Cn:{obs.connection:.2f}"
235
- )
236
- if obs.active_event:
237
- line += f" ⚑{obs.active_event}"
238
- logs.append(line)
239
 
240
  final = obs.reward_breakdown.get("final_score", 0.0)
241
- logs += [f"{'─'*72}", f"βœ… DONE β€” Final score: {final:.4f} | Total reward: {total_reward:.2f}"]
242
- _step_log = logs
243
 
244
  return (
245
- format_meters(obs),
246
- "\n".join(logs),
247
- format_reward_breakdown(obs.reward_breakdown),
248
- f"Final: {final:.4f} | Total: {total_reward:.2f}",
 
249
  )
250
 
251
 
252
- def run_heuristic_episode(profile_name, seed_str):
253
- return _run_auto_episode(profile_name, seed_str, "heuristic")
254
 
255
 
256
- def run_random_episode(profile_name, seed_str):
257
- return _run_auto_episode(profile_name, seed_str, "random")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
258
 
259
 
260
  # ---------------------------------------------------------------------------
@@ -265,108 +352,70 @@ with gr.Blocks(title="RhythmEnv β€” Life Simulator") as demo:
265
 
266
  gr.Markdown(
267
  "# RhythmEnv β€” Life Simulator\n"
268
- "**Can an LLM discover your personality through experience?**\n\n"
269
  "Balance 5 life meters across a 7-day week. "
270
- "Hidden personality profiles secretly control how actions affect you β€” "
271
- "the agent must discover those patterns through reward signals alone."
272
  )
273
 
274
  with gr.Tabs():
275
 
276
- # ── Tab 1: Configure & Reset ──────────────────────────────────────
277
- with gr.TabItem("Configure & Reset"):
278
- gr.Markdown("### Start a new episode")
279
 
280
  with gr.Row():
281
  profile_dd = gr.Dropdown(
282
- choices=PROFILE_NAMES,
283
- value="introvert_morning",
284
- label="Hidden Profile (normally agent cannot see this β€” demo only)",
285
  )
286
- seed_in = gr.Textbox(label="Seed", value="42", scale=1)
287
-
288
- reset_btn = gr.Button("β–Ά Reset Episode", variant="primary")
289
 
290
- gr.Markdown("---")
291
  gr.Markdown(
292
- "| Profile | Core Trait | What agent must discover |\n"
293
  "|---|---|---|\n"
294
- "| `introvert_morning` | Recharges alone, peaks at dawn | "
295
- "Social drain is Γ—3; morning deep work gives Γ—2 progress |\n"
296
- "| `extrovert_night_owl` | Energised by people, peaks at night | "
297
- "Morning is a penalty zone; social gives Γ—2 connection |\n"
298
- "| `workaholic_stoic` | Finds meaning in output, resilient | "
299
- "Idle time drains serenity; work actually recovers vitality |"
300
  )
301
 
302
- meters_out = gr.Textbox(label="Current State", lines=9, interactive=False)
303
- log_out = gr.Textbox(label="Log", lines=5, interactive=False)
304
- breakdown_t1 = gr.Textbox(label="Last Reward Breakdown", lines=6, interactive=False)
305
- score_t1 = gr.Textbox(label="Score", interactive=False)
306
- done_flag = gr.Checkbox(visible=False)
307
-
308
- reset_btn.click(
309
- reset_episode,
310
- inputs=[profile_dd, seed_in],
311
- outputs=[meters_out, log_out, breakdown_t1, score_t1, done_flag],
312
- )
313
 
314
- # ── Tab 2: Run Episode ────────────────────────────────────────────
315
- with gr.TabItem("Run Episode"):
316
- gr.Markdown("### Manual control β€” or use the auto-run buttons for a full episode")
317
 
318
  with gr.Row():
319
  action_dd = gr.Dropdown(
320
- choices=ACTION_NAMES,
321
- value="DEEP_WORK",
322
- label="Action",
323
  )
324
- step_btn = gr.Button("β–Ά Take Step", variant="primary", scale=1)
325
 
326
  with gr.Row():
327
- profile_dd2 = gr.Dropdown(
328
- choices=PROFILE_NAMES,
329
- value="introvert_morning",
330
- label="Profile (for auto-run)",
331
- )
332
- seed_in2 = gr.Textbox(label="Seed (for auto-run)", value="42", scale=1)
333
 
334
- with gr.Row():
335
- heuristic_btn = gr.Button("β–Άβ–Ά Full Episode β€” Heuristic Baseline")
336
- random_btn = gr.Button("β–Άβ–Ά Full Episode β€” Random Baseline")
337
-
338
- meters_display = gr.Textbox(label="Meters", lines=9, interactive=False)
339
- score_display = gr.Textbox(label="Step Reward / Final Score", interactive=False)
340
- log_display = gr.Textbox(label="Step Log", lines=20, interactive=False)
341
- breakdown_display = gr.Textbox(label="Last Reward Breakdown", lines=8, interactive=False)
342
-
343
- step_btn.click(
344
- take_action,
345
- inputs=[action_dd],
346
- outputs=[meters_display, log_display, breakdown_display, score_display, done_flag],
347
- )
348
- heuristic_btn.click(
349
- run_heuristic_episode,
350
- inputs=[profile_dd2, seed_in2],
351
- outputs=[meters_display, log_display, breakdown_display, score_display],
352
- )
353
- random_btn.click(
354
- run_random_episode,
355
- inputs=[profile_dd2, seed_in2],
356
- outputs=[meters_display, log_display, breakdown_display, score_display],
357
  )
358
 
359
- # ── Tab 3: Profiles & Hidden Variables ───────────────────────────
360
- with gr.TabItem("Profiles & Hidden Variables"):
361
  gr.Markdown("### Action Effect Matrix")
362
  gr.Markdown(
363
  "Base delta per action on each meter. "
364
- "Profile modifiers are applied on top β€” invisibly."
365
- )
366
- gr.Textbox(
367
- value=show_action_effects(),
368
- lines=14, interactive=False, label="",
369
  )
 
370
 
371
  gr.Markdown("### Hidden Personality Profiles")
372
  gr.Markdown(
@@ -374,11 +423,16 @@ with gr.Blocks(title="RhythmEnv β€” Life Simulator") as demo:
374
  "It must infer the active profile through reward patterns β€” "
375
  "the core learning challenge of RhythmEnv."
376
  )
377
- gr.Textbox(
378
- value=show_profiles(),
379
- lines=55, interactive=False, label="",
380
- )
 
 
 
 
 
381
 
382
 
383
  if __name__ == "__main__":
384
- demo.launch(server_port=7862, share=False, theme=gr.themes.Monochrome())
 
1
  """
2
+ RhythmEnv Visual Explorer β€” Life Simulator v2
3
  Run: python ui/app.py
4
  """
5
 
 
7
  import os
8
  sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
9
 
10
+ import matplotlib
11
+ matplotlib.use("Agg")
12
+ import matplotlib.pyplot as plt
13
+ import matplotlib.patches as mpatches
14
+
15
  import gradio as gr
16
  from server.rhythm_environment import (
17
  RhythmEnvironment, MAX_STEPS, METERS, ACTION_EFFECTS, PROFILES
18
  )
19
  from models import RhythmAction, ActionType
20
 
21
+ SLOT_NAMES = ["Morning", "Afternoon", "Evening", "Night"]
22
+ SLOT_ICONS = ["πŸŒ…", "β˜€οΈ", "πŸŒ†", "πŸŒ™"]
23
+ DAY_NAMES = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
24
  PROFILE_NAMES = ["introvert_morning", "extrovert_night_owl", "workaholic_stoic", "random"]
25
  ACTION_NAMES = [at.value.upper() for at in ActionType]
26
 
27
+ METER_COLORS = {
28
+ "vitality": "#3b82f6",
29
+ "cognition": "#8b5cf6",
30
+ "progress": "#22c55e",
31
+ "serenity": "#14b8a6",
32
+ "connection": "#f97316",
33
+ }
34
+
35
  # ---------------------------------------------------------------------------
36
+ # Global session state
37
  # ---------------------------------------------------------------------------
38
 
39
+ _env = None
40
+ _last_obs = None
41
+ _step_log = []
42
+ _meter_history = [] # list of {meter: value} per step
43
+ _completed_slots = [] # (day, slot) pairs already acted on
44
+
45
+ def get_env():
46
+ global _env
47
+ if _env is None:
48
+ _env = RhythmEnvironment()
49
+ return _env
50
+
51
+ # ---------------------------------------------------------------------------
52
+ # HTML β€” colored meter bars
53
+ # ---------------------------------------------------------------------------
54
 
55
+ def _bar_color(v: float) -> str:
56
+ if v < 0.20:
57
+ return "#ef4444"
58
+ if v < 0.40:
59
+ return "#f59e0b"
60
+ return "#22c55e"
61
 
62
+ def format_meters_html(obs) -> str:
63
+ day_name = DAY_NAMES[obs.day] if obs.day < 7 else f"Day {obs.day+1}"
64
  slot_name = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"Slot {obs.slot}"
65
+ event_bit = (
66
+ f'<span style="color:#f59e0b;margin-left:8px">⚑ {obs.active_event}</span>'
67
+ if obs.active_event else ""
 
 
68
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
+ html = f"""
71
+ <div style="background:#f9fafb;border-radius:10px;padding:14px 16px;font-family:monospace">
72
+ <div style="font-size:13px;color:#6b7280;margin-bottom:10px">
73
+ πŸ“… <b>{day_name} {slot_name}</b>
74
+ &nbsp;Β·&nbsp; Step {obs.timestep}/{MAX_STEPS}
75
+ &nbsp;Β·&nbsp; {obs.remaining_steps} steps left
76
+ {event_bit}
77
+ </div>
78
+ """
79
+
80
+ for meter in METERS:
81
+ val = getattr(obs, meter)
82
+ pct = int(val * 100)
83
+ color = _bar_color(val)
84
+ dot = METER_COLORS[meter]
85
+ html += f"""
86
+ <div style="display:flex;align-items:center;gap:8px;margin:5px 0">
87
+ <span style="width:10px;height:10px;border-radius:50%;background:{dot};display:inline-block;flex-shrink:0"></span>
88
+ <span style="width:80px;font-size:12px;color:#374151">{meter.capitalize()}</span>
89
+ <div style="flex:1;background:#e5e7eb;border-radius:6px;height:16px;overflow:hidden;max-width:260px">
90
+ <div style="width:{pct}%;background:{color};height:16px;border-radius:6px;transition:width 0.25s"></div>
91
+ </div>
92
+ <span style="width:36px;font-size:12px;color:#374151;text-align:right">{val:.2f}</span>
93
+ </div>"""
94
+
95
+ html += "\n </div>"
96
+ return html
97
 
 
 
 
 
 
 
 
 
 
98
 
99
+ # ---------------------------------------------------------------------------
100
+ # HTML β€” week calendar grid
101
+ # ---------------------------------------------------------------------------
102
 
103
+ def format_week_grid(obs) -> str:
104
+ html = """
105
+ <div style="background:#f9fafb;border-radius:10px;padding:12px 16px;font-family:monospace;margin-top:8px">
106
+ <div style="font-size:12px;color:#6b7280;margin-bottom:8px">Week Progress</div>
107
+ <table style="border-collapse:separate;border-spacing:3px;width:100%">
108
+ <tr>
109
+ <td style="width:24px"></td>"""
110
+
111
+ for day in DAY_NAMES:
112
+ html += f'<td style="text-align:center;font-size:11px;color:#9ca3af;padding:1px 3px">{day}</td>'
113
+ html += "</tr>"
114
+
115
+ current_step = obs.timestep # 0-based: next step to take
116
+ # timestep goes 0β†’27; obs.timestep is the step about to be taken
117
+ # slots completed = those < current_step
118
+ for slot_idx, icon in enumerate(SLOT_ICONS):
119
+ html += f'<tr><td style="font-size:12px;text-align:center">{icon}</td>'
120
+ for day_idx in range(7):
121
+ step_num = day_idx * 4 + slot_idx
122
+ if step_num < current_step:
123
+ cell = "βœ…"
124
+ bg = "#d1fae5"
125
+ elif step_num == current_step and not obs.done:
126
+ cell = "πŸ”΅"
127
+ bg = "#dbeafe"
128
+ else:
129
+ cell = "Β·"
130
+ bg = "transparent"
131
+ html += f'<td style="text-align:center;background:{bg};border-radius:3px;padding:1px 3px;font-size:13px">{cell}</td>'
132
+ html += "</tr>"
133
+
134
+ html += "</table></div>"
135
+ return html
 
 
136
 
137
 
138
  # ---------------------------------------------------------------------------
139
+ # Matplotlib β€” meter trajectory chart
140
  # ---------------------------------------------------------------------------
141
 
142
+ def make_chart(history: list) -> plt.Figure:
143
+ fig, ax = plt.subplots(figsize=(7, 3.5))
144
+ fig.patch.set_facecolor("#f9fafb")
145
+ ax.set_facecolor("#f9fafb")
146
+
147
+ if history:
148
+ steps = list(range(len(history)))
149
+ for meter, color in METER_COLORS.items():
150
+ vals = [h[meter] for h in history]
151
+ ax.plot(steps, vals, color=color, linewidth=2.0, label=meter.capitalize(), solid_capstyle="round")
152
+ ax.axhline(y=0.20, color="#ef4444", linestyle="--", linewidth=0.8, alpha=0.4)
153
+
154
+ patches = [mpatches.Patch(color=c, label=m.capitalize()) for m, c in METER_COLORS.items()]
155
+ ax.legend(handles=patches, loc="upper right", fontsize=8, ncol=2,
156
+ framealpha=0.7, edgecolor="#e5e7eb")
157
+ ax.set_xlim(0, MAX_STEPS)
158
+ ax.set_ylim(-0.02, 1.08)
159
+ ax.set_xlabel("Step (1 step = 1 time slot)", fontsize=9, color="#6b7280")
160
+ ax.set_ylabel("Meter value", fontsize=9, color="#6b7280")
161
+ ax.set_title("Life Meters Over the Week", fontsize=11, color="#374151", pad=8)
162
+ ax.tick_params(labelsize=8, colors="#9ca3af")
163
+ for spine in ax.spines.values():
164
+ spine.set_edgecolor("#e5e7eb")
165
+ ax.grid(True, alpha=0.3, color="#d1d5db")
166
+ plt.tight_layout(pad=1.2)
167
+ return fig
168
 
169
 
170
+ # ---------------------------------------------------------------------------
171
+ # Helpers
172
+ # ---------------------------------------------------------------------------
173
+
174
+ def _snap(obs):
175
+ return {m: getattr(obs, m) for m in METERS}
176
+
177
+ def _step_line(obs, action_name: str) -> str:
178
+ sign = "+" if obs.reward >= 0 else ""
179
+ day = DAY_NAMES[obs.day] if obs.day < 7 else f"D{obs.day}"
180
+ slot = SLOT_NAMES[obs.slot] if obs.slot < 4 else f"S{obs.slot}"
181
+ line = f"Step {obs.timestep:>2} [{day} {slot}] {action_name:<15} {sign}{obs.reward:.3f}"
182
+ if obs.active_event:
183
+ line += f" ⚑{obs.active_event}"
184
+ return line
185
 
186
 
187
  # ---------------------------------------------------------------------------
188
+ # Tab 1 callbacks
189
  # ---------------------------------------------------------------------------
190
 
191
+ OUTPUTS_COUNT = 5 # meters_html, week_grid, chart, log, score
192
+
193
  def reset_episode(profile_name: str, seed_str: str):
194
+ global _last_obs, _step_log, _meter_history
195
 
196
  try:
197
  seed = int(seed_str.strip()) if seed_str.strip() else 42
 
199
  seed = 42
200
 
201
  env = get_env()
202
+ _last_obs = env.reset(seed=seed) if profile_name == "random" else env.reset(seed=seed, profile=profile_name)
203
+
204
+ _step_log = [f"β–Ά Profile: {env._profile['name']} | Seed: {seed} | 28 steps to go"]
205
+ _meter_history = [_snap(_last_obs)]
206
 
207
+ return (
208
+ format_meters_html(_last_obs),
209
+ format_week_grid(_last_obs),
210
+ make_chart(_meter_history),
211
+ "\n".join(_step_log),
212
+ "β€”",
 
 
 
 
 
 
213
  )
 
214
 
215
 
 
 
 
 
216
  def take_action(action_str: str):
217
+ global _last_obs, _step_log, _meter_history
218
 
219
  if _last_obs is None:
220
+ return "⚠️ Reset the episode first.", "", make_chart([]), "β€”", "β€”"
221
  if _last_obs.done:
222
+ return (
223
+ format_meters_html(_last_obs),
224
+ format_week_grid(_last_obs),
225
+ make_chart(_meter_history),
226
+ "\n".join(_step_log[-22:]),
227
+ "Episode done β€” press Reset to play again.",
228
+ )
229
 
230
+ env = get_env()
231
+ obs = env.step(RhythmAction(action_type=ActionType(action_str.lower())))
 
232
  _last_obs = obs
233
+ _meter_history.append(_snap(obs))
234
+ _step_log.append(_step_line(obs, action_str))
235
 
 
 
 
 
 
 
 
 
 
236
  if obs.done:
237
  final = obs.reward_breakdown.get("final_score", 0.0)
238
+ _step_log.append("─" * 52)
239
+ _step_log.append(f"βœ… Final score: {final:.4f}")
240
 
241
+ score = (
242
  f"Final: {obs.reward_breakdown.get('final_score', 0.0):.4f}"
243
  if obs.done else f"Step reward: {obs.reward:+.4f}"
244
  )
245
  return (
246
+ format_meters_html(obs),
247
+ format_week_grid(obs),
248
+ make_chart(_meter_history),
249
+ "\n".join(_step_log[-22:]),
250
+ score,
251
  )
252
 
253
 
254
+ def _run_auto(profile_name: str, seed_str: str, strategy: str):
255
+ global _last_obs, _step_log, _meter_history
 
256
  import random as _random
257
+ from training.inference_eval import heuristic_action
258
 
259
  try:
260
  seed = int(seed_str.strip()) if seed_str.strip() else 42
261
  except ValueError:
262
  seed = 42
263
 
 
 
264
  rng = _random.Random(seed + 999)
265
  all_actions = list(ActionType)
 
266
  env = get_env()
267
+
268
  obs = env.reset(seed=seed) if profile_name == "random" else env.reset(seed=seed, profile=profile_name)
269
+ _last_obs = obs
270
+ _step_log = [f"β–Ά Auto-run ({strategy}) | Profile: {env._profile['name']} | Seed: {seed}"]
271
+ _meter_history = [_snap(obs)]
 
272
 
273
  while not obs.done:
274
+ action_type = heuristic_action(obs) if strategy == "heuristic" else rng.choice(all_actions)
 
 
 
275
  obs = env.step(RhythmAction(action_type=action_type))
276
  _last_obs = obs
277
+ _meter_history.append(_snap(obs))
278
+ _step_log.append(_step_line(obs, action_type.value.upper()))
 
 
 
 
 
 
 
 
 
279
 
280
  final = obs.reward_breakdown.get("final_score", 0.0)
281
+ _step_log += ["─" * 52, f"βœ… Final score: {final:.4f}"]
 
282
 
283
  return (
284
+ format_meters_html(obs),
285
+ format_week_grid(obs),
286
+ make_chart(_meter_history),
287
+ "\n".join(_step_log[-25:]),
288
+ f"Final: {final:.4f}",
289
  )
290
 
291
 
292
+ def run_heuristic(p, s): return _run_auto(p, s, "heuristic")
293
+ def run_random(p, s): return _run_auto(p, s, "random")
294
 
295
 
296
+ # ---------------------------------------------------------------------------
297
+ # Reference tab helpers
298
+ # ---------------------------------------------------------------------------
299
+
300
+ def show_action_effects() -> str:
301
+ header = f"{'Action':<15}" + "".join(f" {m[:3]:>6}" for m in METERS)
302
+ lines = [header, "─" * 52]
303
+ for action, effects in ACTION_EFFECTS.items():
304
+ row = f"{action:<15}"
305
+ for m in METERS:
306
+ row += f" {effects[m]:>+6.2f}"
307
+ lines.append(row)
308
+ return "\n".join(lines)
309
+
310
+
311
+ def show_profiles() -> str:
312
+ lines = []
313
+ for p in PROFILES:
314
+ weights = p["reward_weights"]
315
+ lines += [f"\n{'═'*52}", f" {p['name'].upper()}", f"{'═'*52}"]
316
+ lines.append(" Reward weights (hidden from agent):")
317
+ for m, w in weights.items():
318
+ bar = "β–ˆ" * int(w * 20)
319
+ lines.append(f" {m:<12} {bar:<20} {w:.0%}")
320
+ lines.append("\n Key hidden modifiers:")
321
+ if p.get("morning_cognition_bonus"):
322
+ lines.append(f" β€’ Morning: cognition/progress Γ—{p['morning_cognition_bonus']} (peak window)")
323
+ if p.get("evening_night_cognition_bonus"):
324
+ lines.append(f" β€’ Evening/Night: cognition/progress Γ—{p['evening_night_cognition_bonus']} (peak zone)")
325
+ if p.get("morning_penalty"):
326
+ lines.append(f" β€’ Morning: cognition/progress Γ—{p['morning_penalty']} (groggy zone)")
327
+ sv = p.get("social_vitality_multiplier", 1.0)
328
+ if sv != 1.0:
329
+ lines.append(f" β€’ Social vitality drain Γ—{sv}")
330
+ if p.get("binge_shame"):
331
+ lines.append(f" β€’ Binge watch: shame spiral βˆ’0.15 serenity")
332
+ if p.get("progress_serenity_bonus"):
333
+ lines.append(f" β€’ Work gives serenity +{p['progress_serenity_bonus']} (meaning)")
334
+ if p.get("idle_serenity_decay"):
335
+ lines.append(f" β€’ Idle drains serenity βˆ’{p['idle_serenity_decay']} (guilt)")
336
+ if p.get("work_vitality_recovery"):
337
+ lines.append(f" β€’ Work recovers vitality +{p['work_vitality_recovery']} (energized)")
338
+ if p.get("solo_serenity_bonus"):
339
+ lines.append(f" β€’ Solo time gives serenity +{p['solo_serenity_bonus']} (recharge)")
340
+ scm = p.get("social_connection_multiplier", 1.0)
341
+ if scm != 1.0:
342
+ lines.append(f" β€’ Social connection Γ—{scm}")
343
+ lines.append(f" β€’ Connection passive decay: βˆ’{p['connection_decay_rate']}/step")
344
+ return "\n".join(lines)
345
 
346
 
347
  # ---------------------------------------------------------------------------
 
352
 
353
  gr.Markdown(
354
  "# RhythmEnv β€” Life Simulator\n"
355
+ "**Can a lightweight AI learn who you are β€” without being told?**\n\n"
356
  "Balance 5 life meters across a 7-day week. "
357
+ "A hidden personality profile secretly changes how every action affects you. "
358
+ "The agent must infer who you are from reward signals alone."
359
  )
360
 
361
  with gr.Tabs():
362
 
363
+ # ── Tab 1: Play ───────────────────────────────────────────────────────
364
+ with gr.TabItem("β–Ά Play"):
 
365
 
366
  with gr.Row():
367
  profile_dd = gr.Dropdown(
368
+ choices=PROFILE_NAMES, value="introvert_morning",
369
+ label="Hidden Profile (visible here for demo β€” agent cannot see this)",
370
+ scale=3,
371
  )
372
+ seed_in = gr.Textbox(label="Seed", value="42", scale=1)
373
+ reset_btn = gr.Button("⟳ Reset", variant="primary", scale=1)
 
374
 
 
375
  gr.Markdown(
376
+ "| Profile | Core trait | What the agent must discover |\n"
377
  "|---|---|---|\n"
378
+ "| `introvert_morning` | Recharges alone, peaks at dawn |"
379
+ " Social drain Γ—3 Β· Morning deep work gives Γ—2 progress |\n"
380
+ "| `extrovert_night_owl` | Energised by people, peaks at night |"
381
+ " Morning is a penalty zone Β· Social gives Γ—2 connection |\n"
382
+ "| `workaholic_stoic` | Finds meaning in output, resilient |"
383
+ " Idle time drains serenity Β· Work recovers vitality |"
384
  )
385
 
386
+ with gr.Row():
387
+ with gr.Column(scale=2):
388
+ meters_html = gr.HTML()
389
+ week_grid_html = gr.HTML()
390
+ score_display = gr.Textbox(label="Score", interactive=False, lines=1)
 
 
 
 
 
 
391
 
392
+ with gr.Column(scale=3):
393
+ chart_display = gr.Plot(label="Meter Trajectories")
 
394
 
395
  with gr.Row():
396
  action_dd = gr.Dropdown(
397
+ choices=ACTION_NAMES, value="DEEP_WORK",
398
+ label="Choose action", scale=4,
 
399
  )
400
+ step_btn = gr.Button("β–Ά Take Step", variant="primary", scale=1)
401
 
402
  with gr.Row():
403
+ heuristic_btn = gr.Button("β–Άβ–Ά Full Episode β€” Heuristic Baseline")
404
+ random_btn = gr.Button("β–Άβ–Ά Full Episode β€” Random Baseline")
 
 
 
 
405
 
406
+ log_display = gr.Textbox(
407
+ label="Step Log (last 22 steps)",
408
+ lines=10, interactive=False,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
409
  )
410
 
411
+ # ── Tab 2: Environment Reference ─────────────────────────────────────
412
+ with gr.TabItem("πŸ“‹ Environment Reference"):
413
  gr.Markdown("### Action Effect Matrix")
414
  gr.Markdown(
415
  "Base delta per action on each meter. "
416
+ "Profile modifiers and time-of-day multipliers are applied on top β€” invisibly."
 
 
 
 
417
  )
418
+ gr.Textbox(value=show_action_effects(), lines=14, interactive=False, label="")
419
 
420
  gr.Markdown("### Hidden Personality Profiles")
421
  gr.Markdown(
 
423
  "It must infer the active profile through reward patterns β€” "
424
  "the core learning challenge of RhythmEnv."
425
  )
426
+ gr.Textbox(value=show_profiles(), lines=55, interactive=False, label="")
427
+
428
+ # ── Wire up ──────────────────────────────────────────────────────────────
429
+ _out = [meters_html, week_grid_html, chart_display, log_display, score_display]
430
+
431
+ reset_btn.click(reset_episode, inputs=[profile_dd, seed_in], outputs=_out)
432
+ step_btn.click(take_action, inputs=[action_dd], outputs=_out)
433
+ heuristic_btn.click(run_heuristic, inputs=[profile_dd, seed_in], outputs=_out)
434
+ random_btn.click(run_random, inputs=[profile_dd, seed_in], outputs=_out)
435
 
436
 
437
  if __name__ == "__main__":
438
+ demo.launch(server_port=7862, share=False, theme=gr.themes.Soft())