rhythm_env / docs /references /sim_to_real_deployment.md
InosLihka's picture
docs: add sim-to-real deployment architecture reference
24adee5
# From Simulation to Real Product β€” Deployment Architecture
## The Question
The training environment teaches a model to discover hidden personality profiles through reward
signals alone. But once training is done, how does that actually connect to a real person's
calendar, wearables, and daily life? And when the model meets a new real user, does it
retrain from scratch β€” or does the RL pretraining give it a head start?
---
## The Real-World Pipeline
The trained model sits at the center of a three-part bridge:
```
Real world Bridge layer Trained model
────────────────── ───────────────── ──────────────────
Apple Watch / Whoop β†’ meter proxy mapping β†’
HRV, resting HR State observation
sleep score (same 5 meters the
steps, activity β†’ Vitality, Serenity model trained on)
stress score
Google Calendar β†’ task β†’ action type β†’
"Q3 planning doc" DEEP_WORK Model infers profile
"Gym - 7am" EXERCISE from Accept/Ignore
"1:1 with Sarah" SOCIALIZE history, adapts
"Date night" FAMILY_TIME recommendations
"No meeting block" ME_TIME / SLEEP
User taps β†’ +1 / -1 reward β†’ Model refines its
Accept model of this person
Ignore / Reschedule
```
### Meter Proxies from Devices
Each of the 5 simulated meters has a real-world proxy that devices already measure:
| Meter | Proxy sources |
|---|---|
| Vitality | Sleep score, HRV, resting heart rate, step count |
| Cognition | Deep sleep %, time since last break, calendar density |
| Progress | Task completion rate, deadline proximity, focus time logged |
| Serenity | HRV trend, stress score (Whoop/Garmin), meeting density inverse |
| Connection | Social calendar events in past 7 days, message activity |
The agent never asks how you feel. The devices already know.
### Calendar β†’ Action Types
A lightweight classifier (keyword rules or a cheap LLM call, not the trained model) maps
calendar events to the 10 action types the model understands:
```
"Q3 planning doc" β†’ DEEP_WORK
"Team standup" β†’ ADMIN_WORK
"Learn Python course" β†’ LEARN
"Gym block" β†’ EXERCISE
"Journaling" β†’ MEDITATE / ME_TIME
"Dinner with family" β†’ FAMILY_TIME
"Team social event" β†’ SOCIALIZE
"No meeting block" β†’ ME_TIME
"Sleep / recovery day" β†’ SLEEP
```
### Accept / Ignore as the Reward Signal
Every time the assistant makes a recommendation and the user acts on it:
- **Accept** β†’ "you read me right"
- **Ignore / Reschedule** β†’ "you got something wrong about me"
Over hundreds of these micro-interactions, the model builds a precise model of who this
person is β€” not the person they describe themselves to be, but the person revealed by
their actual responses.
---
## Are Weights Updated in Real Time?
No. In training, GRPO runs gradient updates after every batch β€” the model's weights change
continuously. In production, the deployed model is **frozen**. Personalization happens
in the context window, not in the weights.
### Phase 1 β€” In-Context Adaptation (no gradient updates)
The frozen model reads the user's history as part of each prompt:
```
System: You are a life management agent...
Recent interactions (last 30 days):
Mon Morning, Vitality=0.70 β†’ recommended DEEP_WORK β†’ user: IGNORED
Mon Afternoon, Vitality=0.70 β†’ recommended EXERCISE β†’ user: ACCEPTED
Thu Morning, Vitality=0.65 β†’ recommended DEEP_WORK β†’ user: IGNORED
Sat Evening, Serenity=0.45 β†’ recommended MEDITATE β†’ user: ACCEPTED
...
Current state: Tuesday Morning, Vitality=0.68, Serenity=0.71...
Recommend an action:
```
The model reads the pattern β€” morning deep work keeps getting rejected, evening recovery
keeps getting accepted β€” infers "morning penalty profile", and shifts its recommendations.
No gradient update required. This is in-context learning over the Accept/Ignore history.
### Phase 2 β€” Periodic Offline Fine-Tuning (background, weekly/monthly)
After 3–4 weeks of real data, a lightweight LoRA fine-tuning pass runs in the background
on that specific user's interactions. The updated adapter encodes their preferences into
the weights. This is not real-time β€” it is a scheduled job, much like a weekly model
refresh. The resulting model is then more accurate from inference step 1, without needing
to re-read 30 days of context every time.
---
## Does RL Pretraining Help Personalize Faster?
Yes. This is the most important architectural property of the system.
### The Core Insight
The RL training in simulation does not teach the model facts about specific people. It
teaches a **skill**: how to detect a person's hidden preferences from differential responses
to actions.
```
Base model (no RL pretraining):
User ignores DEEP_WORK 3 mornings in a row.
Model has no prior. Needs 40–50 samples to converge on a hypothesis.
RL-pretrained model:
User ignores DEEP_WORK 3 mornings in a row.
Model immediately forms a hypothesis: "morning penalty profile β€” likely
extrovert or late chronotype." Probes with evening DEEP_WORK. Confirms
in 5–8 interactions.
```
The simulation trained the model on thousands of episodes across three wildly different
profiles. It learned the *structure* of the inference problem β€” what patterns look like
when someone has a morning penalty, when someone's social tolerance is low, when someone's
vitality recovers from work rather than draining. When it meets a real user, it is not
starting from zero. It has strong priors about what the data means.
### The Formal Analogy
This is essentially **MAML** (Model-Agnostic Meta-Learning) without explicitly calling it
that. MAML trains a model specifically so it can adapt to new tasks with very few gradient
steps. Our RL pretraining does the same thing: the simulation is the meta-training
distribution, each real user is a new task, and the few-shot personalization (in-context
or fine-tune) is the inner-loop adaptation.
The sim-to-real transfer works because the skill is structural, not numerical. Whether
"vitality" comes from an HRV formula or a 4am workout log, the model's ability to detect
that "positive actions are underperforming expectations here, and here is how to probe for
the hidden reason" remains valid.
---
## Full Architecture: Three Phases
```
Phase 1 β€” Foundation (simulation):
Train on 3 archetypes Γ— 28-step weeks β†’ internalize the inference skill
Output: frozen RL-pretrained model weights
Phase 2 β€” Deployment (in-context adaptation, day 1):
New user β†’ 5–10 Accept/Ignore interactions
Model infers closest archetype from history in context window
No weight updates, immediate personalization
Phase 3 β€” Personalization (periodic fine-tuning, week 4+):
Collect real Accept/Ignore data
Small LoRA update on user-specific data
Far fewer samples needed because Phase 1 built the prior
User-specific adapter deployed; context window requirement shrinks
```
The RL pretraining is writing the chapter on "how to read a person" into the model's
weights. When it meets a real user, it is not starting from zero β€” it already knows what
patterns to look for and what they mean.