Spaces:
Sleeping
From Simulation to Real Product β Deployment Architecture
The Question
The training environment teaches a model to discover hidden personality profiles through reward signals alone. But once training is done, how does that actually connect to a real person's calendar, wearables, and daily life? And when the model meets a new real user, does it retrain from scratch β or does the RL pretraining give it a head start?
The Real-World Pipeline
The trained model sits at the center of a three-part bridge:
Real world Bridge layer Trained model
ββββββββββββββββββ βββββββββββββββββ ββββββββββββββββββ
Apple Watch / Whoop β meter proxy mapping β
HRV, resting HR State observation
sleep score (same 5 meters the
steps, activity β Vitality, Serenity model trained on)
stress score
Google Calendar β task β action type β
"Q3 planning doc" DEEP_WORK Model infers profile
"Gym - 7am" EXERCISE from Accept/Ignore
"1:1 with Sarah" SOCIALIZE history, adapts
"Date night" FAMILY_TIME recommendations
"No meeting block" ME_TIME / SLEEP
User taps β +1 / -1 reward β Model refines its
Accept model of this person
Ignore / Reschedule
Meter Proxies from Devices
Each of the 5 simulated meters has a real-world proxy that devices already measure:
| Meter | Proxy sources |
|---|---|
| Vitality | Sleep score, HRV, resting heart rate, step count |
| Cognition | Deep sleep %, time since last break, calendar density |
| Progress | Task completion rate, deadline proximity, focus time logged |
| Serenity | HRV trend, stress score (Whoop/Garmin), meeting density inverse |
| Connection | Social calendar events in past 7 days, message activity |
The agent never asks how you feel. The devices already know.
Calendar β Action Types
A lightweight classifier (keyword rules or a cheap LLM call, not the trained model) maps calendar events to the 10 action types the model understands:
"Q3 planning doc" β DEEP_WORK
"Team standup" β ADMIN_WORK
"Learn Python course" β LEARN
"Gym block" β EXERCISE
"Journaling" β MEDITATE / ME_TIME
"Dinner with family" β FAMILY_TIME
"Team social event" β SOCIALIZE
"No meeting block" β ME_TIME
"Sleep / recovery day" β SLEEP
Accept / Ignore as the Reward Signal
Every time the assistant makes a recommendation and the user acts on it:
- Accept β "you read me right"
- Ignore / Reschedule β "you got something wrong about me"
Over hundreds of these micro-interactions, the model builds a precise model of who this person is β not the person they describe themselves to be, but the person revealed by their actual responses.
Are Weights Updated in Real Time?
No. In training, GRPO runs gradient updates after every batch β the model's weights change continuously. In production, the deployed model is frozen. Personalization happens in the context window, not in the weights.
Phase 1 β In-Context Adaptation (no gradient updates)
The frozen model reads the user's history as part of each prompt:
System: You are a life management agent...
Recent interactions (last 30 days):
Mon Morning, Vitality=0.70 β recommended DEEP_WORK β user: IGNORED
Mon Afternoon, Vitality=0.70 β recommended EXERCISE β user: ACCEPTED
Thu Morning, Vitality=0.65 β recommended DEEP_WORK β user: IGNORED
Sat Evening, Serenity=0.45 β recommended MEDITATE β user: ACCEPTED
...
Current state: Tuesday Morning, Vitality=0.68, Serenity=0.71...
Recommend an action:
The model reads the pattern β morning deep work keeps getting rejected, evening recovery keeps getting accepted β infers "morning penalty profile", and shifts its recommendations. No gradient update required. This is in-context learning over the Accept/Ignore history.
Phase 2 β Periodic Offline Fine-Tuning (background, weekly/monthly)
After 3β4 weeks of real data, a lightweight LoRA fine-tuning pass runs in the background on that specific user's interactions. The updated adapter encodes their preferences into the weights. This is not real-time β it is a scheduled job, much like a weekly model refresh. The resulting model is then more accurate from inference step 1, without needing to re-read 30 days of context every time.
Does RL Pretraining Help Personalize Faster?
Yes. This is the most important architectural property of the system.
The Core Insight
The RL training in simulation does not teach the model facts about specific people. It teaches a skill: how to detect a person's hidden preferences from differential responses to actions.
Base model (no RL pretraining):
User ignores DEEP_WORK 3 mornings in a row.
Model has no prior. Needs 40β50 samples to converge on a hypothesis.
RL-pretrained model:
User ignores DEEP_WORK 3 mornings in a row.
Model immediately forms a hypothesis: "morning penalty profile β likely
extrovert or late chronotype." Probes with evening DEEP_WORK. Confirms
in 5β8 interactions.
The simulation trained the model on thousands of episodes across three wildly different profiles. It learned the structure of the inference problem β what patterns look like when someone has a morning penalty, when someone's social tolerance is low, when someone's vitality recovers from work rather than draining. When it meets a real user, it is not starting from zero. It has strong priors about what the data means.
The Formal Analogy
This is essentially MAML (Model-Agnostic Meta-Learning) without explicitly calling it that. MAML trains a model specifically so it can adapt to new tasks with very few gradient steps. Our RL pretraining does the same thing: the simulation is the meta-training distribution, each real user is a new task, and the few-shot personalization (in-context or fine-tune) is the inner-loop adaptation.
The sim-to-real transfer works because the skill is structural, not numerical. Whether "vitality" comes from an HRV formula or a 4am workout log, the model's ability to detect that "positive actions are underperforming expectations here, and here is how to probe for the hidden reason" remains valid.
Full Architecture: Three Phases
Phase 1 β Foundation (simulation):
Train on 3 archetypes Γ 28-step weeks β internalize the inference skill
Output: frozen RL-pretrained model weights
Phase 2 β Deployment (in-context adaptation, day 1):
New user β 5β10 Accept/Ignore interactions
Model infers closest archetype from history in context window
No weight updates, immediate personalization
Phase 3 β Personalization (periodic fine-tuning, week 4+):
Collect real Accept/Ignore data
Small LoRA update on user-specific data
Far fewer samples needed because Phase 1 built the prior
User-specific adapter deployed; context window requirement shrinks
The RL pretraining is writing the chapter on "how to read a person" into the model's weights. When it meets a real user, it is not starting from zero β it already knows what patterns to look for and what they mean.