InosLihka Claude Sonnet 4.6 commited on
Commit
26b1e6a
Β·
1 Parent(s): 0bdfeaa

docs: expand blog with purpose, sim-to-real framing, lightweight model goal

Browse files

Adds sections explaining why we target a small/lightweight model (always-on,
private, cheap vs frontier API cost), why simulation is valid as a curriculum
(sim-to-real is standard RL practice), and what the trained behavioral
inference skill looks like in a real product.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (1) hide show
  1. docs/blog_post.md +55 -42
docs/blog_post.md CHANGED
@@ -4,25 +4,32 @@ Ask someone how they'd build a personal AI assistant, and they'll say: give it a
4
 
5
  Sounds reasonable. But it's the wrong approach entirely.
6
 
7
- Think about the people who actually know you well -- a close friend, a partner, a sibling. None of them sat you down with a questionnaire. They figured you out by *watching*. They noticed that you get irritable after too many social events. That you do your best thinking before noon. That skipping exercise makes you anxious by Wednesday.
8
 
9
  They learned your hidden patterns through trial, error, and feedback. What if we could train an AI the same way?
10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ## The Trait Decomposition
12
 
13
- Here's how I think about modeling humans. Start with **traits** -- atomic behavioral properties that describe how a person responds to activities:
14
 
15
- - **Social energy cost** -- how much does socializing drain you physically?
16
- - **Cognitive peak time** -- when does your brain work best? Morning? Evening?
17
- - **Guilt sensitivity** -- does leisure make you feel guilty, or recharge you?
18
- - **Work-peace coupling** -- does productivity give you calm, or just tire you out?
19
- - **Stress tolerance** -- how far can your serenity drop before everything spirals?
20
- - **Metabolic rate** -- how fast do you burn baseline energy?
21
- - **Event resilience** -- how much does unexpected chaos throw you off?
22
- - **Solo recharge** -- does alone time restore your inner peace?
23
- - **Social warmth** -- does socializing give you serenity, or drain it?
24
 
25
- No single trait defines a person. It's the *combination* that creates a personality. An introvert morning person has high social energy cost, early cognitive peak, high guilt sensitivity, and strong solo recharge. An extrovert night owl has the opposite: low social cost, late cognitive peak, socializing actually gives them serenity.
26
 
27
  Same list of traits. Different values. Different person.
28
 
@@ -30,11 +37,11 @@ Same list of traits. Different values. Different person.
30
 
31
  Here's the part that took me a while to articulate: two people can do the exact same activities and have completely different days. Not because the activities are different, but because they *value different outcomes*.
32
 
33
- In RhythmEnv, each person has hidden **reward weights** -- a definition of what a good day means to them:
34
 
35
- - The introvert values **serenity** above all (60% weight). A day where they maintained inner peace and made some progress is a great day. Connection barely registers (10%).
36
- - The extrovert values **connection** above all (75% weight). A day full of meaningful social interactions is a great day, even if they didn't get much work done.
37
- - The workaholic values **progress** above all (70% weight). A day of deep productive work is a great day. Everything else is secondary.
38
 
39
  The agent sees the same 5 meters. Takes the same 10 actions. Gets a scalar reward. But the reward is secretly computed using these hidden weights. Same action, same meter changes, completely different reward signal depending on who you're helping.
40
 
@@ -42,69 +49,75 @@ The agent sees the same 5 meters. Takes the same 10 actions. Gets a scalar rewar
42
 
43
  RhythmEnv simulates one week in a person's life. 7 days, 4 time slots each, 28 decisions total. Each decision picks an activity: deep work, exercise, sleep, meditation, family time, socializing, and so on. Ten options.
44
 
45
- Five life meters track the person's state -- picture them like fuel gauges on a dashboard:
46
 
47
- - **Vitality** -- physical energy. Sleep and exercise fill it up. Work drains it.
48
- - **Cognition** -- mental sharpness. Peaks in the morning for some, in the evening for others.
49
- - **Progress** -- career momentum. Only goes up when you work.
50
- - **Serenity** -- inner calm. Meditation and downtime help. Overwork kills it.
51
- - **Connection** -- relationship health. Decays on its own every time slot. If you don't actively socialize, it quietly drops.
52
 
53
- After every action, meters shift. The agent sees the new meter values and gets a reward. That reward is the hidden weighted sum of what changed -- and the weights are different for every person type.
54
 
55
  ## Why Identical Actions Produce Different Results
56
 
57
- The trait modifiers change how actions physically affect the person, not just how rewards are computed. Here's what I mean:
58
 
59
- Tell the introvert to socialize: their vitality drops 3x faster than normal. Their body physically rejects it. Tell the extrovert the same thing: barely any drain. They could socialize all day.
60
 
61
  Tell the introvert to meditate: they get a bonus +0.10 serenity on top of the base effect. Alone time is their recharge mechanism. Tell the workaholic the same thing: their serenity *drops* by 0.10, because idle activities make them anxious.
62
 
63
- Tell the workaholic to do deep work: they get +0.06 vitality recovery -- productive work energizes them. Tell the introvert to do deep work in the morning: their progress and cognition gains are doubled. Same action, different physiological response.
64
 
65
- These aren't arbitrary. They're modeled after real behavioral patterns. The introvert's social drain, the workaholic's anxiety from idleness, the night owl's morning penalty -- these are things people recognize in themselves but rarely articulate.
66
 
67
  ## What the Agent Must Figure Out
68
 
69
- The agent sees meters, time of day, and reward. It doesn't see:
 
 
 
 
70
 
71
- - Which person type it's helping
72
- - The trait values
73
- - The reward weight decomposition
74
 
75
- After a few actions, the patterns start showing. "I socialized and my vitality crashed -- this person drains from socializing." "I meditated and got a huge reward -- serenity must be heavily weighted." "Deep work in the morning gave double progress -- this person peaks early."
76
 
77
- The trained agent should probe early, infer the person type by observing unexpected meter changes, then adapt its strategy for the rest of the week. An agent that discovers it's helping an introvert should meditate more and socialize less. An agent that discovers it's helping a workaholic should maximize productive hours and minimize idle time.
 
 
78
 
79
  ## The Training Pipeline
80
 
81
- Training uses GRPO -- Group Relative Policy Optimization. For each game state, generate multiple candidate actions, score them all against the environment, update the model to prefer the ones that scored higher. The environment *is* the critic.
82
 
83
- The per-step reward signal is strongly differentiated by profile. At the same starting state (morning, all meters at 0.7), the best action is completely different:
84
 
85
  | Profile | Best Action | Reward | Worst Action | Reward |
86
  |---------|-------------|--------|--------------|--------|
87
  | Introvert | MEDITATE | +1.76 | SOCIALIZE | +0.03 |
88
- | Extrovert | FAMILY_TIME | +2.63 | ME_TIME | -0.42 |
89
- | Workaholic | DEEP_WORK | +1.57 | ME_TIME | -0.27 |
90
 
91
- The model is Qwen 2.5-3B with 4-bit quantization and LoRA -- small enough to train on a free Colab T4.
92
 
93
  ## What I'm Hoping To See
94
 
95
- The heuristic baseline (fixed rules, no profile adaptation) scores around 0.76-0.82 depending on the profile. It treats everyone the same -- sleep at night, work in morning, socialize when connection drops. It works despite the hidden dynamics, not because it understands them.
 
 
96
 
97
- A trained agent that discovers the hidden personality should adapt its behavior per profile. The dream is not just higher scores -- it's qualitatively different action sequences for different people. The introvert's week should look nothing like the extrovert's week.
98
 
99
  No questionnaire. No settings page. Just attention, inference, and adjustment.
100
 
101
- That's what personal AI should feel like.
102
 
103
  ---
104
 
105
  **Links:**
106
  - [Live Environment on HF Spaces](https://huggingface.co/spaces/InosLihka/rhythm_env)
107
- - [Training Notebook (Colab)](training/RhythmEnv_GRPO_Training.ipynb)
108
  - [Source Code & README](https://huggingface.co/spaces/InosLihka/rhythm_env)
109
 
110
  *Built for the Meta OpenEnv Hackathon Grand Finale, April 2026.*
 
4
 
5
  Sounds reasonable. But it's the wrong approach entirely.
6
 
7
+ Think about the people who actually know you well β€” a close friend, a partner, a sibling. None of them sat you down with a questionnaire. They figured you out by *watching*. They noticed that you get irritable after too many social events. That you do your best thinking before noon. That skipping exercise makes you anxious by Wednesday.
8
 
9
  They learned your hidden patterns through trial, error, and feedback. What if we could train an AI the same way?
10
 
11
+ ## The problem with frontier models doing this
12
+
13
+ A capable frontier model can already do decent personalized planning if you describe yourself in the prompt. Tell GPT-4 "I'm an introvert who peaks in the morning," and it'll give you reasonable advice. The problem is that approach doesn't scale:
14
+
15
+ - You have to tell it who you are every single time
16
+ - It can't observe your actual behavioral responses to recommendations
17
+ - It runs in the cloud, costs per query, and can't be always-on or private
18
+ - Most users can't accurately describe their own patterns anyway
19
+
20
+ What we actually need is a small model β€” something that can run cheaply, frequently, and eventually on-device β€” that builds up a model of you from *how you respond*, not from what you say about yourself. That's the gap RhythmEnv is designed to train for.
21
+
22
+ I work on AI at Microsoft. One thing I kept running into building assistant features was the gap between what users *say* they want and what actually helps them. People are bad at introspecting their own patterns. The introvert who says "I don't mind meetings" because they've normalized the drain. The workaholic who checks "I value work-life balance" because they know they should.
23
+
24
+ Preference forms capture what people believe about themselves. Behavior reveals what's actually true.
25
+
26
  ## The Trait Decomposition
27
 
28
+ Here's how I think about modeling humans. Start with **traits** β€” atomic behavioral properties that describe how a person responds to activities:
29
 
30
+ How much does socializing physically drain you? When does your brain work best? Does leisure make you feel guilty, or recharge you? Does productivity give you calm, or just tire you out? How far can your stress drop before everything starts spiraling?
 
 
 
 
 
 
 
 
31
 
32
+ No single trait defines a person. It's the *combination* that creates a personality. An introvert morning person has high social energy cost, early cognitive peak, and strong solo recharge. An extrovert night owl has the opposite: low social cost, late cognitive peak, socializing actually gives them serenity.
33
 
34
  Same list of traits. Different values. Different person.
35
 
 
37
 
38
  Here's the part that took me a while to articulate: two people can do the exact same activities and have completely different days. Not because the activities are different, but because they *value different outcomes*.
39
 
40
+ In RhythmEnv, each person has hidden **reward weights** β€” a definition of what a good week means to them:
41
 
42
+ - The introvert values **serenity** above all (60% weight). A week where they maintained inner peace and made some progress is a great week. Connection barely registers.
43
+ - The extrovert values **connection** above all (75% weight). A week full of meaningful social interactions is a great week, even if they didn't get much work done.
44
+ - The workaholic values **progress** above all (70% weight). Deep productive work is the whole point. Everything else is secondary.
45
 
46
  The agent sees the same 5 meters. Takes the same 10 actions. Gets a scalar reward. But the reward is secretly computed using these hidden weights. Same action, same meter changes, completely different reward signal depending on who you're helping.
47
 
 
49
 
50
  RhythmEnv simulates one week in a person's life. 7 days, 4 time slots each, 28 decisions total. Each decision picks an activity: deep work, exercise, sleep, meditation, family time, socializing, and so on. Ten options.
51
 
52
+ Five life meters track the person's state β€” picture them like fuel gauges on a dashboard:
53
 
54
+ - **Vitality** β€” physical energy. Sleep and exercise fill it up. Work drains it.
55
+ - **Cognition** β€” mental sharpness. Peaks in the morning for some, evening for others.
56
+ - **Progress** β€” career momentum. Only goes up when you work.
57
+ - **Serenity** β€” inner calm. Meditation helps. Overwork kills it.
58
+ - **Connection** β€” relationship health. Decays passively every time slot. If you don't actively socialize, it quietly drops on its own.
59
 
60
+ After every action, meters shift. The agent sees the new meter values and gets a reward. That reward is the hidden weighted sum of what changed β€” and the weights are different for every person type.
61
 
62
  ## Why Identical Actions Produce Different Results
63
 
64
+ The trait modifiers change how actions physically affect the person, not just how rewards are computed.
65
 
66
+ Tell the introvert to socialize: their vitality drops 3Γ— faster than normal. Their body physically rejects it. Tell the extrovert the same thing: barely any drain. They could socialize all day.
67
 
68
  Tell the introvert to meditate: they get a bonus +0.10 serenity on top of the base effect. Alone time is their recharge mechanism. Tell the workaholic the same thing: their serenity *drops* by 0.10, because idle activities make them anxious.
69
 
70
+ Tell the workaholic to do deep work: they recover +0.06 vitality β€” productive work energizes them. Tell the introvert to do deep work in the morning: their progress and cognition gains are doubled. Same action, completely different physiological response.
71
 
72
+ These aren't arbitrary. They're modeled after real behavioral patterns. The introvert's social drain, the workaholic's anxiety from idleness, the night owl's morning penalty β€” these are things people recognize in themselves but rarely articulate.
73
 
74
  ## What the Agent Must Figure Out
75
 
76
+ The agent sees meters, time of day, and reward. It doesn't see which profile is active, the trait values, or how the reward is being computed.
77
+
78
+ After a few actions, the patterns start showing. "I socialized and my vitality crashed β€” this person drains from socializing." "I meditated and got a huge reward β€” serenity must be heavily weighted." "Deep work in the morning gave double progress β€” this person peaks early."
79
+
80
+ The trained agent should probe early, infer the person type by observing unexpected meter changes, then adapt its strategy for the rest of the week. An agent that discovers it's helping an introvert should meditate more and socialize less. One that discovers a workaholic should maximize productive hours and cut idle time.
81
 
82
+ This is the skill we're training: *behavioral inference under partial observability*. Detect the hidden pattern from how the environment responds to your actions, then plan accordingly.
 
 
83
 
84
+ ## Why Simulation Is the Right Starting Point
85
 
86
+ Everything in RhythmEnv is simulated β€” the person doesn't exist, the meters aren't biometric readings, the profiles are synthetic. That's intentional, and it's not a limitation.
87
+
88
+ Robotics RL trains in simulation first. The simulator is the curriculum, not the deployment target. The skill the model learns β€” detecting behavioral signatures from differential responses to the same action β€” is real and transferable. In a production version, the "meters" become observable proxies: calendar acceptance patterns, response latency after social-heavy days, a simple end-of-day wellness rating. The agent that learns to infer hidden reward weights from simulation learns the *structure* of the problem. The specific medium can change.
89
 
90
  ## The Training Pipeline
91
 
92
+ Training uses GRPO β€” Group Relative Policy Optimization. For each game state, generate multiple candidate actions, score them all against the real environment, update the model to prefer the ones that scored higher. The environment *is* the critic.
93
 
94
+ The per-step reward signal is strongly differentiated by profile. At the same starting state β€” Monday morning, all meters at 0.7 β€” the best action is completely different:
95
 
96
  | Profile | Best Action | Reward | Worst Action | Reward |
97
  |---------|-------------|--------|--------------|--------|
98
  | Introvert | MEDITATE | +1.76 | SOCIALIZE | +0.03 |
99
+ | Extrovert | FAMILY_TIME | +2.63 | ME_TIME | βˆ’0.42 |
100
+ | Workaholic | DEEP_WORK | +1.57 | ME_TIME | βˆ’0.27 |
101
 
102
+ The model is Qwen 2.5-3B with 4-bit quantization and LoRA β€” small enough to train on a free Colab T4, small enough to eventually run at the edge. The goal isn't matching GPT-4's general reasoning. It's teaching a lightweight model a specific skill it doesn't have out of the box: infer who you're helping from how they respond, not from what they tell you.
103
 
104
  ## What I'm Hoping To See
105
 
106
+ The heuristic baseline β€” fixed rules, no profile adaptation, treats everyone the same β€” scores around 0.76–0.82. Sleep at night, work in the morning, socialize when connection drops. Reasonable advice for anyone.
107
+
108
+ The trained agent should do better by doing something qualitatively different: the introvert's week should look nothing like the extrovert's week. Not just higher scores β€” genuinely differentiated action sequences per profile. That's the signal that inference is happening, not just pattern matching.
109
 
110
+ The bigger goal is a learning curve that works in the other direction too. In a real product, the first few interactions are the model probing β€” making recommendations and observing how the user responds. After a handful of exchanges, it should have enough signal to know whether it's dealing with someone who needs serenity protected, or someone who needs to be pushed into more productive hours.
111
 
112
  No questionnaire. No settings page. Just attention, inference, and adjustment.
113
 
114
+ That's what personal AI should actually feel like.
115
 
116
  ---
117
 
118
  **Links:**
119
  - [Live Environment on HF Spaces](https://huggingface.co/spaces/InosLihka/rhythm_env)
120
+ - [Training Notebook (Colab)](../training/RhythmEnv_GRPO_Training.ipynb)
121
  - [Source Code & README](https://huggingface.co/spaces/InosLihka/rhythm_env)
122
 
123
  *Built for the Meta OpenEnv Hackathon Grand Finale, April 2026.*