InosLihka Claude Sonnet 4.6 commited on
Commit
24adee5
Β·
1 Parent(s): fb112e4

docs: add sim-to-real deployment architecture reference

Browse files

Captures the real-world product pipeline: wearable proxies for meters,
calendar-to-action-type mapping, Accept/Ignore as reward signal. Also
covers why the frozen model adapts in-context and why RL pretraining
enables fast personalization (MAML analogy).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs/references/sim_to_real_deployment.md ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # From Simulation to Real Product β€” Deployment Architecture
2
+
3
+ ## The Question
4
+
5
+ The training environment teaches a model to discover hidden personality profiles through reward
6
+ signals alone. But once training is done, how does that actually connect to a real person's
7
+ calendar, wearables, and daily life? And when the model meets a new real user, does it
8
+ retrain from scratch β€” or does the RL pretraining give it a head start?
9
+
10
+ ---
11
+
12
+ ## The Real-World Pipeline
13
+
14
+ The trained model sits at the center of a three-part bridge:
15
+
16
+ ```
17
+ Real world Bridge layer Trained model
18
+ ────────────────── ───────────────── ──────────────────
19
+ Apple Watch / Whoop β†’ meter proxy mapping β†’
20
+ HRV, resting HR State observation
21
+ sleep score (same 5 meters the
22
+ steps, activity β†’ Vitality, Serenity model trained on)
23
+ stress score
24
+
25
+ Google Calendar β†’ task β†’ action type β†’
26
+ "Q3 planning doc" DEEP_WORK Model infers profile
27
+ "Gym - 7am" EXERCISE from Accept/Ignore
28
+ "1:1 with Sarah" SOCIALIZE history, adapts
29
+ "Date night" FAMILY_TIME recommendations
30
+ "No meeting block" ME_TIME / SLEEP
31
+
32
+ User taps β†’ +1 / -1 reward β†’ Model refines its
33
+ Accept model of this person
34
+ Ignore / Reschedule
35
+ ```
36
+
37
+ ### Meter Proxies from Devices
38
+
39
+ Each of the 5 simulated meters has a real-world proxy that devices already measure:
40
+
41
+ | Meter | Proxy sources |
42
+ |---|---|
43
+ | Vitality | Sleep score, HRV, resting heart rate, step count |
44
+ | Cognition | Deep sleep %, time since last break, calendar density |
45
+ | Progress | Task completion rate, deadline proximity, focus time logged |
46
+ | Serenity | HRV trend, stress score (Whoop/Garmin), meeting density inverse |
47
+ | Connection | Social calendar events in past 7 days, message activity |
48
+
49
+ The agent never asks how you feel. The devices already know.
50
+
51
+ ### Calendar β†’ Action Types
52
+
53
+ A lightweight classifier (keyword rules or a cheap LLM call, not the trained model) maps
54
+ calendar events to the 10 action types the model understands:
55
+
56
+ ```
57
+ "Q3 planning doc" β†’ DEEP_WORK
58
+ "Team standup" β†’ ADMIN_WORK
59
+ "Learn Python course" β†’ LEARN
60
+ "Gym block" β†’ EXERCISE
61
+ "Journaling" β†’ MEDITATE / ME_TIME
62
+ "Dinner with family" β†’ FAMILY_TIME
63
+ "Team social event" β†’ SOCIALIZE
64
+ "No meeting block" β†’ ME_TIME
65
+ "Sleep / recovery day" β†’ SLEEP
66
+ ```
67
+
68
+ ### Accept / Ignore as the Reward Signal
69
+
70
+ Every time the assistant makes a recommendation and the user acts on it:
71
+ - **Accept** β†’ "you read me right"
72
+ - **Ignore / Reschedule** β†’ "you got something wrong about me"
73
+
74
+ Over hundreds of these micro-interactions, the model builds a precise model of who this
75
+ person is β€” not the person they describe themselves to be, but the person revealed by
76
+ their actual responses.
77
+
78
+ ---
79
+
80
+ ## Are Weights Updated in Real Time?
81
+
82
+ No. In training, GRPO runs gradient updates after every batch β€” the model's weights change
83
+ continuously. In production, the deployed model is **frozen**. Personalization happens
84
+ in the context window, not in the weights.
85
+
86
+ ### Phase 1 β€” In-Context Adaptation (no gradient updates)
87
+
88
+ The frozen model reads the user's history as part of each prompt:
89
+
90
+ ```
91
+ System: You are a life management agent...
92
+
93
+ Recent interactions (last 30 days):
94
+ Mon Morning, Vitality=0.70 β†’ recommended DEEP_WORK β†’ user: IGNORED
95
+ Mon Afternoon, Vitality=0.70 β†’ recommended EXERCISE β†’ user: ACCEPTED
96
+ Thu Morning, Vitality=0.65 β†’ recommended DEEP_WORK β†’ user: IGNORED
97
+ Sat Evening, Serenity=0.45 β†’ recommended MEDITATE β†’ user: ACCEPTED
98
+ ...
99
+
100
+ Current state: Tuesday Morning, Vitality=0.68, Serenity=0.71...
101
+ Recommend an action:
102
+ ```
103
+
104
+ The model reads the pattern β€” morning deep work keeps getting rejected, evening recovery
105
+ keeps getting accepted β€” infers "morning penalty profile", and shifts its recommendations.
106
+ No gradient update required. This is in-context learning over the Accept/Ignore history.
107
+
108
+ ### Phase 2 β€” Periodic Offline Fine-Tuning (background, weekly/monthly)
109
+
110
+ After 3–4 weeks of real data, a lightweight LoRA fine-tuning pass runs in the background
111
+ on that specific user's interactions. The updated adapter encodes their preferences into
112
+ the weights. This is not real-time β€” it is a scheduled job, much like a weekly model
113
+ refresh. The resulting model is then more accurate from inference step 1, without needing
114
+ to re-read 30 days of context every time.
115
+
116
+ ---
117
+
118
+ ## Does RL Pretraining Help Personalize Faster?
119
+
120
+ Yes. This is the most important architectural property of the system.
121
+
122
+ ### The Core Insight
123
+
124
+ The RL training in simulation does not teach the model facts about specific people. It
125
+ teaches a **skill**: how to detect a person's hidden preferences from differential responses
126
+ to actions.
127
+
128
+ ```
129
+ Base model (no RL pretraining):
130
+ User ignores DEEP_WORK 3 mornings in a row.
131
+ Model has no prior. Needs 40–50 samples to converge on a hypothesis.
132
+
133
+ RL-pretrained model:
134
+ User ignores DEEP_WORK 3 mornings in a row.
135
+ Model immediately forms a hypothesis: "morning penalty profile β€” likely
136
+ extrovert or late chronotype." Probes with evening DEEP_WORK. Confirms
137
+ in 5–8 interactions.
138
+ ```
139
+
140
+ The simulation trained the model on thousands of episodes across three wildly different
141
+ profiles. It learned the *structure* of the inference problem β€” what patterns look like
142
+ when someone has a morning penalty, when someone's social tolerance is low, when someone's
143
+ vitality recovers from work rather than draining. When it meets a real user, it is not
144
+ starting from zero. It has strong priors about what the data means.
145
+
146
+ ### The Formal Analogy
147
+
148
+ This is essentially **MAML** (Model-Agnostic Meta-Learning) without explicitly calling it
149
+ that. MAML trains a model specifically so it can adapt to new tasks with very few gradient
150
+ steps. Our RL pretraining does the same thing: the simulation is the meta-training
151
+ distribution, each real user is a new task, and the few-shot personalization (in-context
152
+ or fine-tune) is the inner-loop adaptation.
153
+
154
+ The sim-to-real transfer works because the skill is structural, not numerical. Whether
155
+ "vitality" comes from an HRV formula or a 4am workout log, the model's ability to detect
156
+ that "positive actions are underperforming expectations here, and here is how to probe for
157
+ the hidden reason" remains valid.
158
+
159
+ ---
160
+
161
+ ## Full Architecture: Three Phases
162
+
163
+ ```
164
+ Phase 1 β€” Foundation (simulation):
165
+ Train on 3 archetypes Γ— 28-step weeks β†’ internalize the inference skill
166
+ Output: frozen RL-pretrained model weights
167
+
168
+ Phase 2 β€” Deployment (in-context adaptation, day 1):
169
+ New user β†’ 5–10 Accept/Ignore interactions
170
+ Model infers closest archetype from history in context window
171
+ No weight updates, immediate personalization
172
+
173
+ Phase 3 β€” Personalization (periodic fine-tuning, week 4+):
174
+ Collect real Accept/Ignore data
175
+ Small LoRA update on user-specific data
176
+ Far fewer samples needed because Phase 1 built the prior
177
+ User-specific adapter deployed; context window requirement shrinks
178
+ ```
179
+
180
+ The RL pretraining is writing the chapter on "how to read a person" into the model's
181
+ weights. When it meets a real user, it is not starting from zero β€” it already knows what
182
+ patterns to look for and what they mean.