Spaces:
Sleeping
Sleeping
docs: add sim-to-real deployment architecture reference
Browse filesCaptures the real-world product pipeline: wearable proxies for meters,
calendar-to-action-type mapping, Accept/Ignore as reward signal. Also
covers why the frozen model adapts in-context and why RL pretraining
enables fast personalization (MAML analogy).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs/references/sim_to_real_deployment.md
ADDED
|
@@ -0,0 +1,182 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# From Simulation to Real Product β Deployment Architecture
|
| 2 |
+
|
| 3 |
+
## The Question
|
| 4 |
+
|
| 5 |
+
The training environment teaches a model to discover hidden personality profiles through reward
|
| 6 |
+
signals alone. But once training is done, how does that actually connect to a real person's
|
| 7 |
+
calendar, wearables, and daily life? And when the model meets a new real user, does it
|
| 8 |
+
retrain from scratch β or does the RL pretraining give it a head start?
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
## The Real-World Pipeline
|
| 13 |
+
|
| 14 |
+
The trained model sits at the center of a three-part bridge:
|
| 15 |
+
|
| 16 |
+
```
|
| 17 |
+
Real world Bridge layer Trained model
|
| 18 |
+
ββββββββββββββββββ βββββββββββββββββ ββββββββββββββββββ
|
| 19 |
+
Apple Watch / Whoop β meter proxy mapping β
|
| 20 |
+
HRV, resting HR State observation
|
| 21 |
+
sleep score (same 5 meters the
|
| 22 |
+
steps, activity β Vitality, Serenity model trained on)
|
| 23 |
+
stress score
|
| 24 |
+
|
| 25 |
+
Google Calendar β task β action type β
|
| 26 |
+
"Q3 planning doc" DEEP_WORK Model infers profile
|
| 27 |
+
"Gym - 7am" EXERCISE from Accept/Ignore
|
| 28 |
+
"1:1 with Sarah" SOCIALIZE history, adapts
|
| 29 |
+
"Date night" FAMILY_TIME recommendations
|
| 30 |
+
"No meeting block" ME_TIME / SLEEP
|
| 31 |
+
|
| 32 |
+
User taps β +1 / -1 reward β Model refines its
|
| 33 |
+
Accept model of this person
|
| 34 |
+
Ignore / Reschedule
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
### Meter Proxies from Devices
|
| 38 |
+
|
| 39 |
+
Each of the 5 simulated meters has a real-world proxy that devices already measure:
|
| 40 |
+
|
| 41 |
+
| Meter | Proxy sources |
|
| 42 |
+
|---|---|
|
| 43 |
+
| Vitality | Sleep score, HRV, resting heart rate, step count |
|
| 44 |
+
| Cognition | Deep sleep %, time since last break, calendar density |
|
| 45 |
+
| Progress | Task completion rate, deadline proximity, focus time logged |
|
| 46 |
+
| Serenity | HRV trend, stress score (Whoop/Garmin), meeting density inverse |
|
| 47 |
+
| Connection | Social calendar events in past 7 days, message activity |
|
| 48 |
+
|
| 49 |
+
The agent never asks how you feel. The devices already know.
|
| 50 |
+
|
| 51 |
+
### Calendar β Action Types
|
| 52 |
+
|
| 53 |
+
A lightweight classifier (keyword rules or a cheap LLM call, not the trained model) maps
|
| 54 |
+
calendar events to the 10 action types the model understands:
|
| 55 |
+
|
| 56 |
+
```
|
| 57 |
+
"Q3 planning doc" β DEEP_WORK
|
| 58 |
+
"Team standup" β ADMIN_WORK
|
| 59 |
+
"Learn Python course" β LEARN
|
| 60 |
+
"Gym block" β EXERCISE
|
| 61 |
+
"Journaling" β MEDITATE / ME_TIME
|
| 62 |
+
"Dinner with family" β FAMILY_TIME
|
| 63 |
+
"Team social event" β SOCIALIZE
|
| 64 |
+
"No meeting block" β ME_TIME
|
| 65 |
+
"Sleep / recovery day" β SLEEP
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
### Accept / Ignore as the Reward Signal
|
| 69 |
+
|
| 70 |
+
Every time the assistant makes a recommendation and the user acts on it:
|
| 71 |
+
- **Accept** β "you read me right"
|
| 72 |
+
- **Ignore / Reschedule** β "you got something wrong about me"
|
| 73 |
+
|
| 74 |
+
Over hundreds of these micro-interactions, the model builds a precise model of who this
|
| 75 |
+
person is β not the person they describe themselves to be, but the person revealed by
|
| 76 |
+
their actual responses.
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
+
|
| 80 |
+
## Are Weights Updated in Real Time?
|
| 81 |
+
|
| 82 |
+
No. In training, GRPO runs gradient updates after every batch β the model's weights change
|
| 83 |
+
continuously. In production, the deployed model is **frozen**. Personalization happens
|
| 84 |
+
in the context window, not in the weights.
|
| 85 |
+
|
| 86 |
+
### Phase 1 β In-Context Adaptation (no gradient updates)
|
| 87 |
+
|
| 88 |
+
The frozen model reads the user's history as part of each prompt:
|
| 89 |
+
|
| 90 |
+
```
|
| 91 |
+
System: You are a life management agent...
|
| 92 |
+
|
| 93 |
+
Recent interactions (last 30 days):
|
| 94 |
+
Mon Morning, Vitality=0.70 β recommended DEEP_WORK β user: IGNORED
|
| 95 |
+
Mon Afternoon, Vitality=0.70 β recommended EXERCISE β user: ACCEPTED
|
| 96 |
+
Thu Morning, Vitality=0.65 β recommended DEEP_WORK β user: IGNORED
|
| 97 |
+
Sat Evening, Serenity=0.45 β recommended MEDITATE β user: ACCEPTED
|
| 98 |
+
...
|
| 99 |
+
|
| 100 |
+
Current state: Tuesday Morning, Vitality=0.68, Serenity=0.71...
|
| 101 |
+
Recommend an action:
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
The model reads the pattern β morning deep work keeps getting rejected, evening recovery
|
| 105 |
+
keeps getting accepted β infers "morning penalty profile", and shifts its recommendations.
|
| 106 |
+
No gradient update required. This is in-context learning over the Accept/Ignore history.
|
| 107 |
+
|
| 108 |
+
### Phase 2 β Periodic Offline Fine-Tuning (background, weekly/monthly)
|
| 109 |
+
|
| 110 |
+
After 3β4 weeks of real data, a lightweight LoRA fine-tuning pass runs in the background
|
| 111 |
+
on that specific user's interactions. The updated adapter encodes their preferences into
|
| 112 |
+
the weights. This is not real-time β it is a scheduled job, much like a weekly model
|
| 113 |
+
refresh. The resulting model is then more accurate from inference step 1, without needing
|
| 114 |
+
to re-read 30 days of context every time.
|
| 115 |
+
|
| 116 |
+
---
|
| 117 |
+
|
| 118 |
+
## Does RL Pretraining Help Personalize Faster?
|
| 119 |
+
|
| 120 |
+
Yes. This is the most important architectural property of the system.
|
| 121 |
+
|
| 122 |
+
### The Core Insight
|
| 123 |
+
|
| 124 |
+
The RL training in simulation does not teach the model facts about specific people. It
|
| 125 |
+
teaches a **skill**: how to detect a person's hidden preferences from differential responses
|
| 126 |
+
to actions.
|
| 127 |
+
|
| 128 |
+
```
|
| 129 |
+
Base model (no RL pretraining):
|
| 130 |
+
User ignores DEEP_WORK 3 mornings in a row.
|
| 131 |
+
Model has no prior. Needs 40β50 samples to converge on a hypothesis.
|
| 132 |
+
|
| 133 |
+
RL-pretrained model:
|
| 134 |
+
User ignores DEEP_WORK 3 mornings in a row.
|
| 135 |
+
Model immediately forms a hypothesis: "morning penalty profile β likely
|
| 136 |
+
extrovert or late chronotype." Probes with evening DEEP_WORK. Confirms
|
| 137 |
+
in 5β8 interactions.
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
The simulation trained the model on thousands of episodes across three wildly different
|
| 141 |
+
profiles. It learned the *structure* of the inference problem β what patterns look like
|
| 142 |
+
when someone has a morning penalty, when someone's social tolerance is low, when someone's
|
| 143 |
+
vitality recovers from work rather than draining. When it meets a real user, it is not
|
| 144 |
+
starting from zero. It has strong priors about what the data means.
|
| 145 |
+
|
| 146 |
+
### The Formal Analogy
|
| 147 |
+
|
| 148 |
+
This is essentially **MAML** (Model-Agnostic Meta-Learning) without explicitly calling it
|
| 149 |
+
that. MAML trains a model specifically so it can adapt to new tasks with very few gradient
|
| 150 |
+
steps. Our RL pretraining does the same thing: the simulation is the meta-training
|
| 151 |
+
distribution, each real user is a new task, and the few-shot personalization (in-context
|
| 152 |
+
or fine-tune) is the inner-loop adaptation.
|
| 153 |
+
|
| 154 |
+
The sim-to-real transfer works because the skill is structural, not numerical. Whether
|
| 155 |
+
"vitality" comes from an HRV formula or a 4am workout log, the model's ability to detect
|
| 156 |
+
that "positive actions are underperforming expectations here, and here is how to probe for
|
| 157 |
+
the hidden reason" remains valid.
|
| 158 |
+
|
| 159 |
+
---
|
| 160 |
+
|
| 161 |
+
## Full Architecture: Three Phases
|
| 162 |
+
|
| 163 |
+
```
|
| 164 |
+
Phase 1 β Foundation (simulation):
|
| 165 |
+
Train on 3 archetypes Γ 28-step weeks β internalize the inference skill
|
| 166 |
+
Output: frozen RL-pretrained model weights
|
| 167 |
+
|
| 168 |
+
Phase 2 β Deployment (in-context adaptation, day 1):
|
| 169 |
+
New user β 5β10 Accept/Ignore interactions
|
| 170 |
+
Model infers closest archetype from history in context window
|
| 171 |
+
No weight updates, immediate personalization
|
| 172 |
+
|
| 173 |
+
Phase 3 β Personalization (periodic fine-tuning, week 4+):
|
| 174 |
+
Collect real Accept/Ignore data
|
| 175 |
+
Small LoRA update on user-specific data
|
| 176 |
+
Far fewer samples needed because Phase 1 built the prior
|
| 177 |
+
User-specific adapter deployed; context window requirement shrinks
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
The RL pretraining is writing the chapter on "how to read a person" into the model's
|
| 181 |
+
weights. When it meets a real user, it is not starting from zero β it already knows what
|
| 182 |
+
patterns to look for and what they mean.
|