Spaces:
Sleeping
Sleeping
docs: expand blog with purpose, sim-to-real framing, lightweight model goal
Browse filesAdds sections explaining why we target a small/lightweight model (always-on,
private, cheap vs frontier API cost), why simulation is valid as a curriculum
(sim-to-real is standard RL practice), and what the trained behavioral
inference skill looks like in a real product.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- docs/blog_post.md +55 -42
docs/blog_post.md
CHANGED
|
@@ -4,25 +4,32 @@ Ask someone how they'd build a personal AI assistant, and they'll say: give it a
|
|
| 4 |
|
| 5 |
Sounds reasonable. But it's the wrong approach entirely.
|
| 6 |
|
| 7 |
-
Think about the people who actually know you well
|
| 8 |
|
| 9 |
They learned your hidden patterns through trial, error, and feedback. What if we could train an AI the same way?
|
| 10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
## The Trait Decomposition
|
| 12 |
|
| 13 |
-
Here's how I think about modeling humans. Start with **traits**
|
| 14 |
|
| 15 |
-
|
| 16 |
-
- **Cognitive peak time** -- when does your brain work best? Morning? Evening?
|
| 17 |
-
- **Guilt sensitivity** -- does leisure make you feel guilty, or recharge you?
|
| 18 |
-
- **Work-peace coupling** -- does productivity give you calm, or just tire you out?
|
| 19 |
-
- **Stress tolerance** -- how far can your serenity drop before everything spirals?
|
| 20 |
-
- **Metabolic rate** -- how fast do you burn baseline energy?
|
| 21 |
-
- **Event resilience** -- how much does unexpected chaos throw you off?
|
| 22 |
-
- **Solo recharge** -- does alone time restore your inner peace?
|
| 23 |
-
- **Social warmth** -- does socializing give you serenity, or drain it?
|
| 24 |
|
| 25 |
-
No single trait defines a person. It's the *combination* that creates a personality. An introvert morning person has high social energy cost, early cognitive peak,
|
| 26 |
|
| 27 |
Same list of traits. Different values. Different person.
|
| 28 |
|
|
@@ -30,11 +37,11 @@ Same list of traits. Different values. Different person.
|
|
| 30 |
|
| 31 |
Here's the part that took me a while to articulate: two people can do the exact same activities and have completely different days. Not because the activities are different, but because they *value different outcomes*.
|
| 32 |
|
| 33 |
-
In RhythmEnv, each person has hidden **reward weights**
|
| 34 |
|
| 35 |
-
- The introvert values **serenity** above all (60% weight). A
|
| 36 |
-
- The extrovert values **connection** above all (75% weight). A
|
| 37 |
-
- The workaholic values **progress** above all (70% weight).
|
| 38 |
|
| 39 |
The agent sees the same 5 meters. Takes the same 10 actions. Gets a scalar reward. But the reward is secretly computed using these hidden weights. Same action, same meter changes, completely different reward signal depending on who you're helping.
|
| 40 |
|
|
@@ -42,69 +49,75 @@ The agent sees the same 5 meters. Takes the same 10 actions. Gets a scalar rewar
|
|
| 42 |
|
| 43 |
RhythmEnv simulates one week in a person's life. 7 days, 4 time slots each, 28 decisions total. Each decision picks an activity: deep work, exercise, sleep, meditation, family time, socializing, and so on. Ten options.
|
| 44 |
|
| 45 |
-
Five life meters track the person's state
|
| 46 |
|
| 47 |
-
- **Vitality**
|
| 48 |
-
- **Cognition**
|
| 49 |
-
- **Progress**
|
| 50 |
-
- **Serenity**
|
| 51 |
-
- **Connection**
|
| 52 |
|
| 53 |
-
After every action, meters shift. The agent sees the new meter values and gets a reward. That reward is the hidden weighted sum of what changed
|
| 54 |
|
| 55 |
## Why Identical Actions Produce Different Results
|
| 56 |
|
| 57 |
-
The trait modifiers change how actions physically affect the person, not just how rewards are computed.
|
| 58 |
|
| 59 |
-
Tell the introvert to socialize: their vitality drops
|
| 60 |
|
| 61 |
Tell the introvert to meditate: they get a bonus +0.10 serenity on top of the base effect. Alone time is their recharge mechanism. Tell the workaholic the same thing: their serenity *drops* by 0.10, because idle activities make them anxious.
|
| 62 |
|
| 63 |
-
Tell the workaholic to do deep work: they
|
| 64 |
|
| 65 |
-
These aren't arbitrary. They're modeled after real behavioral patterns. The introvert's social drain, the workaholic's anxiety from idleness, the night owl's morning penalty
|
| 66 |
|
| 67 |
## What the Agent Must Figure Out
|
| 68 |
|
| 69 |
-
The agent sees meters, time of day, and reward. It doesn't see
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
-
|
| 72 |
-
- The trait values
|
| 73 |
-
- The reward weight decomposition
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
|
|
|
|
|
|
| 78 |
|
| 79 |
## The Training Pipeline
|
| 80 |
|
| 81 |
-
Training uses GRPO
|
| 82 |
|
| 83 |
-
The per-step reward signal is strongly differentiated by profile. At the same starting state
|
| 84 |
|
| 85 |
| Profile | Best Action | Reward | Worst Action | Reward |
|
| 86 |
|---------|-------------|--------|--------------|--------|
|
| 87 |
| Introvert | MEDITATE | +1.76 | SOCIALIZE | +0.03 |
|
| 88 |
-
| Extrovert | FAMILY_TIME | +2.63 | ME_TIME |
|
| 89 |
-
| Workaholic | DEEP_WORK | +1.57 | ME_TIME |
|
| 90 |
|
| 91 |
-
The model is Qwen 2.5-3B with 4-bit quantization and LoRA
|
| 92 |
|
| 93 |
## What I'm Hoping To See
|
| 94 |
|
| 95 |
-
The heuristic baseline
|
|
|
|
|
|
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
No questionnaire. No settings page. Just attention, inference, and adjustment.
|
| 100 |
|
| 101 |
-
That's what personal AI should feel like.
|
| 102 |
|
| 103 |
---
|
| 104 |
|
| 105 |
**Links:**
|
| 106 |
- [Live Environment on HF Spaces](https://huggingface.co/spaces/InosLihka/rhythm_env)
|
| 107 |
-
- [Training Notebook (Colab)](training/RhythmEnv_GRPO_Training.ipynb)
|
| 108 |
- [Source Code & README](https://huggingface.co/spaces/InosLihka/rhythm_env)
|
| 109 |
|
| 110 |
*Built for the Meta OpenEnv Hackathon Grand Finale, April 2026.*
|
|
|
|
| 4 |
|
| 5 |
Sounds reasonable. But it's the wrong approach entirely.
|
| 6 |
|
| 7 |
+
Think about the people who actually know you well β a close friend, a partner, a sibling. None of them sat you down with a questionnaire. They figured you out by *watching*. They noticed that you get irritable after too many social events. That you do your best thinking before noon. That skipping exercise makes you anxious by Wednesday.
|
| 8 |
|
| 9 |
They learned your hidden patterns through trial, error, and feedback. What if we could train an AI the same way?
|
| 10 |
|
| 11 |
+
## The problem with frontier models doing this
|
| 12 |
+
|
| 13 |
+
A capable frontier model can already do decent personalized planning if you describe yourself in the prompt. Tell GPT-4 "I'm an introvert who peaks in the morning," and it'll give you reasonable advice. The problem is that approach doesn't scale:
|
| 14 |
+
|
| 15 |
+
- You have to tell it who you are every single time
|
| 16 |
+
- It can't observe your actual behavioral responses to recommendations
|
| 17 |
+
- It runs in the cloud, costs per query, and can't be always-on or private
|
| 18 |
+
- Most users can't accurately describe their own patterns anyway
|
| 19 |
+
|
| 20 |
+
What we actually need is a small model β something that can run cheaply, frequently, and eventually on-device β that builds up a model of you from *how you respond*, not from what you say about yourself. That's the gap RhythmEnv is designed to train for.
|
| 21 |
+
|
| 22 |
+
I work on AI at Microsoft. One thing I kept running into building assistant features was the gap between what users *say* they want and what actually helps them. People are bad at introspecting their own patterns. The introvert who says "I don't mind meetings" because they've normalized the drain. The workaholic who checks "I value work-life balance" because they know they should.
|
| 23 |
+
|
| 24 |
+
Preference forms capture what people believe about themselves. Behavior reveals what's actually true.
|
| 25 |
+
|
| 26 |
## The Trait Decomposition
|
| 27 |
|
| 28 |
+
Here's how I think about modeling humans. Start with **traits** β atomic behavioral properties that describe how a person responds to activities:
|
| 29 |
|
| 30 |
+
How much does socializing physically drain you? When does your brain work best? Does leisure make you feel guilty, or recharge you? Does productivity give you calm, or just tire you out? How far can your stress drop before everything starts spiraling?
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
No single trait defines a person. It's the *combination* that creates a personality. An introvert morning person has high social energy cost, early cognitive peak, and strong solo recharge. An extrovert night owl has the opposite: low social cost, late cognitive peak, socializing actually gives them serenity.
|
| 33 |
|
| 34 |
Same list of traits. Different values. Different person.
|
| 35 |
|
|
|
|
| 37 |
|
| 38 |
Here's the part that took me a while to articulate: two people can do the exact same activities and have completely different days. Not because the activities are different, but because they *value different outcomes*.
|
| 39 |
|
| 40 |
+
In RhythmEnv, each person has hidden **reward weights** β a definition of what a good week means to them:
|
| 41 |
|
| 42 |
+
- The introvert values **serenity** above all (60% weight). A week where they maintained inner peace and made some progress is a great week. Connection barely registers.
|
| 43 |
+
- The extrovert values **connection** above all (75% weight). A week full of meaningful social interactions is a great week, even if they didn't get much work done.
|
| 44 |
+
- The workaholic values **progress** above all (70% weight). Deep productive work is the whole point. Everything else is secondary.
|
| 45 |
|
| 46 |
The agent sees the same 5 meters. Takes the same 10 actions. Gets a scalar reward. But the reward is secretly computed using these hidden weights. Same action, same meter changes, completely different reward signal depending on who you're helping.
|
| 47 |
|
|
|
|
| 49 |
|
| 50 |
RhythmEnv simulates one week in a person's life. 7 days, 4 time slots each, 28 decisions total. Each decision picks an activity: deep work, exercise, sleep, meditation, family time, socializing, and so on. Ten options.
|
| 51 |
|
| 52 |
+
Five life meters track the person's state β picture them like fuel gauges on a dashboard:
|
| 53 |
|
| 54 |
+
- **Vitality** β physical energy. Sleep and exercise fill it up. Work drains it.
|
| 55 |
+
- **Cognition** β mental sharpness. Peaks in the morning for some, evening for others.
|
| 56 |
+
- **Progress** β career momentum. Only goes up when you work.
|
| 57 |
+
- **Serenity** β inner calm. Meditation helps. Overwork kills it.
|
| 58 |
+
- **Connection** β relationship health. Decays passively every time slot. If you don't actively socialize, it quietly drops on its own.
|
| 59 |
|
| 60 |
+
After every action, meters shift. The agent sees the new meter values and gets a reward. That reward is the hidden weighted sum of what changed β and the weights are different for every person type.
|
| 61 |
|
| 62 |
## Why Identical Actions Produce Different Results
|
| 63 |
|
| 64 |
+
The trait modifiers change how actions physically affect the person, not just how rewards are computed.
|
| 65 |
|
| 66 |
+
Tell the introvert to socialize: their vitality drops 3Γ faster than normal. Their body physically rejects it. Tell the extrovert the same thing: barely any drain. They could socialize all day.
|
| 67 |
|
| 68 |
Tell the introvert to meditate: they get a bonus +0.10 serenity on top of the base effect. Alone time is their recharge mechanism. Tell the workaholic the same thing: their serenity *drops* by 0.10, because idle activities make them anxious.
|
| 69 |
|
| 70 |
+
Tell the workaholic to do deep work: they recover +0.06 vitality β productive work energizes them. Tell the introvert to do deep work in the morning: their progress and cognition gains are doubled. Same action, completely different physiological response.
|
| 71 |
|
| 72 |
+
These aren't arbitrary. They're modeled after real behavioral patterns. The introvert's social drain, the workaholic's anxiety from idleness, the night owl's morning penalty β these are things people recognize in themselves but rarely articulate.
|
| 73 |
|
| 74 |
## What the Agent Must Figure Out
|
| 75 |
|
| 76 |
+
The agent sees meters, time of day, and reward. It doesn't see which profile is active, the trait values, or how the reward is being computed.
|
| 77 |
+
|
| 78 |
+
After a few actions, the patterns start showing. "I socialized and my vitality crashed β this person drains from socializing." "I meditated and got a huge reward β serenity must be heavily weighted." "Deep work in the morning gave double progress β this person peaks early."
|
| 79 |
+
|
| 80 |
+
The trained agent should probe early, infer the person type by observing unexpected meter changes, then adapt its strategy for the rest of the week. An agent that discovers it's helping an introvert should meditate more and socialize less. One that discovers a workaholic should maximize productive hours and cut idle time.
|
| 81 |
|
| 82 |
+
This is the skill we're training: *behavioral inference under partial observability*. Detect the hidden pattern from how the environment responds to your actions, then plan accordingly.
|
|
|
|
|
|
|
| 83 |
|
| 84 |
+
## Why Simulation Is the Right Starting Point
|
| 85 |
|
| 86 |
+
Everything in RhythmEnv is simulated β the person doesn't exist, the meters aren't biometric readings, the profiles are synthetic. That's intentional, and it's not a limitation.
|
| 87 |
+
|
| 88 |
+
Robotics RL trains in simulation first. The simulator is the curriculum, not the deployment target. The skill the model learns β detecting behavioral signatures from differential responses to the same action β is real and transferable. In a production version, the "meters" become observable proxies: calendar acceptance patterns, response latency after social-heavy days, a simple end-of-day wellness rating. The agent that learns to infer hidden reward weights from simulation learns the *structure* of the problem. The specific medium can change.
|
| 89 |
|
| 90 |
## The Training Pipeline
|
| 91 |
|
| 92 |
+
Training uses GRPO β Group Relative Policy Optimization. For each game state, generate multiple candidate actions, score them all against the real environment, update the model to prefer the ones that scored higher. The environment *is* the critic.
|
| 93 |
|
| 94 |
+
The per-step reward signal is strongly differentiated by profile. At the same starting state β Monday morning, all meters at 0.7 β the best action is completely different:
|
| 95 |
|
| 96 |
| Profile | Best Action | Reward | Worst Action | Reward |
|
| 97 |
|---------|-------------|--------|--------------|--------|
|
| 98 |
| Introvert | MEDITATE | +1.76 | SOCIALIZE | +0.03 |
|
| 99 |
+
| Extrovert | FAMILY_TIME | +2.63 | ME_TIME | β0.42 |
|
| 100 |
+
| Workaholic | DEEP_WORK | +1.57 | ME_TIME | β0.27 |
|
| 101 |
|
| 102 |
+
The model is Qwen 2.5-3B with 4-bit quantization and LoRA β small enough to train on a free Colab T4, small enough to eventually run at the edge. The goal isn't matching GPT-4's general reasoning. It's teaching a lightweight model a specific skill it doesn't have out of the box: infer who you're helping from how they respond, not from what they tell you.
|
| 103 |
|
| 104 |
## What I'm Hoping To See
|
| 105 |
|
| 106 |
+
The heuristic baseline β fixed rules, no profile adaptation, treats everyone the same β scores around 0.76β0.82. Sleep at night, work in the morning, socialize when connection drops. Reasonable advice for anyone.
|
| 107 |
+
|
| 108 |
+
The trained agent should do better by doing something qualitatively different: the introvert's week should look nothing like the extrovert's week. Not just higher scores β genuinely differentiated action sequences per profile. That's the signal that inference is happening, not just pattern matching.
|
| 109 |
|
| 110 |
+
The bigger goal is a learning curve that works in the other direction too. In a real product, the first few interactions are the model probing β making recommendations and observing how the user responds. After a handful of exchanges, it should have enough signal to know whether it's dealing with someone who needs serenity protected, or someone who needs to be pushed into more productive hours.
|
| 111 |
|
| 112 |
No questionnaire. No settings page. Just attention, inference, and adjustment.
|
| 113 |
|
| 114 |
+
That's what personal AI should actually feel like.
|
| 115 |
|
| 116 |
---
|
| 117 |
|
| 118 |
**Links:**
|
| 119 |
- [Live Environment on HF Spaces](https://huggingface.co/spaces/InosLihka/rhythm_env)
|
| 120 |
+
- [Training Notebook (Colab)](../training/RhythmEnv_GRPO_Training.ipynb)
|
| 121 |
- [Source Code & README](https://huggingface.co/spaces/InosLihka/rhythm_env)
|
| 122 |
|
| 123 |
*Built for the Meta OpenEnv Hackathon Grand Finale, April 2026.*
|