File size: 7,602 Bytes
24adee5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
# From Simulation to Real Product β€” Deployment Architecture

## The Question

The training environment teaches a model to discover hidden personality profiles through reward
signals alone. But once training is done, how does that actually connect to a real person's
calendar, wearables, and daily life? And when the model meets a new real user, does it
retrain from scratch β€” or does the RL pretraining give it a head start?

---

## The Real-World Pipeline

The trained model sits at the center of a three-part bridge:

```
Real world                  Bridge layer              Trained model
──────────────────          ─────────────────         ──────────────────
Apple Watch / Whoop    β†’    meter proxy mapping   β†’
  HRV, resting HR                                     State observation
  sleep score                                         (same 5 meters the
  steps, activity     β†’    Vitality, Serenity         model trained on)
  stress score

Google Calendar        β†’    task β†’ action type   β†’
  "Q3 planning doc"         DEEP_WORK                 Model infers profile
  "Gym - 7am"               EXERCISE                  from Accept/Ignore
  "1:1 with Sarah"          SOCIALIZE                 history, adapts
  "Date night"              FAMILY_TIME               recommendations
  "No meeting block"        ME_TIME / SLEEP

User taps              β†’    +1 / -1 reward       β†’    Model refines its
  Accept                                              model of this person
  Ignore / Reschedule
```

### Meter Proxies from Devices

Each of the 5 simulated meters has a real-world proxy that devices already measure:

| Meter | Proxy sources |
|---|---|
| Vitality | Sleep score, HRV, resting heart rate, step count |
| Cognition | Deep sleep %, time since last break, calendar density |
| Progress | Task completion rate, deadline proximity, focus time logged |
| Serenity | HRV trend, stress score (Whoop/Garmin), meeting density inverse |
| Connection | Social calendar events in past 7 days, message activity |

The agent never asks how you feel. The devices already know.

### Calendar β†’ Action Types

A lightweight classifier (keyword rules or a cheap LLM call, not the trained model) maps
calendar events to the 10 action types the model understands:

```
"Q3 planning doc"        β†’  DEEP_WORK
"Team standup"           β†’  ADMIN_WORK
"Learn Python course"    β†’  LEARN
"Gym block"              β†’  EXERCISE
"Journaling"             β†’  MEDITATE / ME_TIME
"Dinner with family"     β†’  FAMILY_TIME
"Team social event"      β†’  SOCIALIZE
"No meeting block"       β†’  ME_TIME
"Sleep / recovery day"   β†’  SLEEP
```

### Accept / Ignore as the Reward Signal

Every time the assistant makes a recommendation and the user acts on it:
- **Accept** β†’ "you read me right"
- **Ignore / Reschedule** β†’ "you got something wrong about me"

Over hundreds of these micro-interactions, the model builds a precise model of who this
person is β€” not the person they describe themselves to be, but the person revealed by
their actual responses.

---

## Are Weights Updated in Real Time?

No. In training, GRPO runs gradient updates after every batch β€” the model's weights change
continuously. In production, the deployed model is **frozen**. Personalization happens
in the context window, not in the weights.

### Phase 1 β€” In-Context Adaptation (no gradient updates)

The frozen model reads the user's history as part of each prompt:

```
System: You are a life management agent...

Recent interactions (last 30 days):
  Mon Morning, Vitality=0.70 β†’ recommended DEEP_WORK β†’ user: IGNORED
  Mon Afternoon, Vitality=0.70 β†’ recommended EXERCISE β†’ user: ACCEPTED
  Thu Morning, Vitality=0.65 β†’ recommended DEEP_WORK β†’ user: IGNORED
  Sat Evening, Serenity=0.45 β†’ recommended MEDITATE β†’ user: ACCEPTED
  ...

Current state: Tuesday Morning, Vitality=0.68, Serenity=0.71...
Recommend an action:
```

The model reads the pattern β€” morning deep work keeps getting rejected, evening recovery
keeps getting accepted β€” infers "morning penalty profile", and shifts its recommendations.
No gradient update required. This is in-context learning over the Accept/Ignore history.

### Phase 2 β€” Periodic Offline Fine-Tuning (background, weekly/monthly)

After 3–4 weeks of real data, a lightweight LoRA fine-tuning pass runs in the background
on that specific user's interactions. The updated adapter encodes their preferences into
the weights. This is not real-time β€” it is a scheduled job, much like a weekly model
refresh. The resulting model is then more accurate from inference step 1, without needing
to re-read 30 days of context every time.

---

## Does RL Pretraining Help Personalize Faster?

Yes. This is the most important architectural property of the system.

### The Core Insight

The RL training in simulation does not teach the model facts about specific people. It
teaches a **skill**: how to detect a person's hidden preferences from differential responses
to actions.

```
Base model (no RL pretraining):
  User ignores DEEP_WORK 3 mornings in a row.
  Model has no prior. Needs 40–50 samples to converge on a hypothesis.

RL-pretrained model:
  User ignores DEEP_WORK 3 mornings in a row.
  Model immediately forms a hypothesis: "morning penalty profile β€” likely
  extrovert or late chronotype." Probes with evening DEEP_WORK. Confirms
  in 5–8 interactions.
```

The simulation trained the model on thousands of episodes across three wildly different
profiles. It learned the *structure* of the inference problem β€” what patterns look like
when someone has a morning penalty, when someone's social tolerance is low, when someone's
vitality recovers from work rather than draining. When it meets a real user, it is not
starting from zero. It has strong priors about what the data means.

### The Formal Analogy

This is essentially **MAML** (Model-Agnostic Meta-Learning) without explicitly calling it
that. MAML trains a model specifically so it can adapt to new tasks with very few gradient
steps. Our RL pretraining does the same thing: the simulation is the meta-training
distribution, each real user is a new task, and the few-shot personalization (in-context
or fine-tune) is the inner-loop adaptation.

The sim-to-real transfer works because the skill is structural, not numerical. Whether
"vitality" comes from an HRV formula or a 4am workout log, the model's ability to detect
that "positive actions are underperforming expectations here, and here is how to probe for
the hidden reason" remains valid.

---

## Full Architecture: Three Phases

```
Phase 1 β€” Foundation (simulation):
  Train on 3 archetypes Γ— 28-step weeks β†’ internalize the inference skill
  Output: frozen RL-pretrained model weights

Phase 2 β€” Deployment (in-context adaptation, day 1):
  New user β†’ 5–10 Accept/Ignore interactions
  Model infers closest archetype from history in context window
  No weight updates, immediate personalization

Phase 3 β€” Personalization (periodic fine-tuning, week 4+):
  Collect real Accept/Ignore data
  Small LoRA update on user-specific data
  Far fewer samples needed because Phase 1 built the prior
  User-specific adapter deployed; context window requirement shrinks
```

The RL pretraining is writing the chapter on "how to read a person" into the model's
weights. When it meets a real user, it is not starting from zero β€” it already knows what
patterns to look for and what they mean.