# From Simulation to Real Product — Deployment Architecture ## The Question The training environment teaches a model to discover hidden personality profiles through reward signals alone. But once training is done, how does that actually connect to a real person's calendar, wearables, and daily life? And when the model meets a new real user, does it retrain from scratch — or does the RL pretraining give it a head start? --- ## The Real-World Pipeline The trained model sits at the center of a three-part bridge: ``` Real world Bridge layer Trained model ────────────────── ───────────────── ────────────────── Apple Watch / Whoop → meter proxy mapping → HRV, resting HR State observation sleep score (same 5 meters the steps, activity → Vitality, Serenity model trained on) stress score Google Calendar → task → action type → "Q3 planning doc" DEEP_WORK Model infers profile "Gym - 7am" EXERCISE from Accept/Ignore "1:1 with Sarah" SOCIALIZE history, adapts "Date night" FAMILY_TIME recommendations "No meeting block" ME_TIME / SLEEP User taps → +1 / -1 reward → Model refines its Accept model of this person Ignore / Reschedule ``` ### Meter Proxies from Devices Each of the 5 simulated meters has a real-world proxy that devices already measure: | Meter | Proxy sources | |---|---| | Vitality | Sleep score, HRV, resting heart rate, step count | | Cognition | Deep sleep %, time since last break, calendar density | | Progress | Task completion rate, deadline proximity, focus time logged | | Serenity | HRV trend, stress score (Whoop/Garmin), meeting density inverse | | Connection | Social calendar events in past 7 days, message activity | The agent never asks how you feel. The devices already know. ### Calendar → Action Types A lightweight classifier (keyword rules or a cheap LLM call, not the trained model) maps calendar events to the 10 action types the model understands: ``` "Q3 planning doc" → DEEP_WORK "Team standup" → ADMIN_WORK "Learn Python course" → LEARN "Gym block" → EXERCISE "Journaling" → MEDITATE / ME_TIME "Dinner with family" → FAMILY_TIME "Team social event" → SOCIALIZE "No meeting block" → ME_TIME "Sleep / recovery day" → SLEEP ``` ### Accept / Ignore as the Reward Signal Every time the assistant makes a recommendation and the user acts on it: - **Accept** → "you read me right" - **Ignore / Reschedule** → "you got something wrong about me" Over hundreds of these micro-interactions, the model builds a precise model of who this person is — not the person they describe themselves to be, but the person revealed by their actual responses. --- ## Are Weights Updated in Real Time? No. In training, GRPO runs gradient updates after every batch — the model's weights change continuously. In production, the deployed model is **frozen**. Personalization happens in the context window, not in the weights. ### Phase 1 — In-Context Adaptation (no gradient updates) The frozen model reads the user's history as part of each prompt: ``` System: You are a life management agent... Recent interactions (last 30 days): Mon Morning, Vitality=0.70 → recommended DEEP_WORK → user: IGNORED Mon Afternoon, Vitality=0.70 → recommended EXERCISE → user: ACCEPTED Thu Morning, Vitality=0.65 → recommended DEEP_WORK → user: IGNORED Sat Evening, Serenity=0.45 → recommended MEDITATE → user: ACCEPTED ... Current state: Tuesday Morning, Vitality=0.68, Serenity=0.71... Recommend an action: ``` The model reads the pattern — morning deep work keeps getting rejected, evening recovery keeps getting accepted — infers "morning penalty profile", and shifts its recommendations. No gradient update required. This is in-context learning over the Accept/Ignore history. ### Phase 2 — Periodic Offline Fine-Tuning (background, weekly/monthly) After 3–4 weeks of real data, a lightweight LoRA fine-tuning pass runs in the background on that specific user's interactions. The updated adapter encodes their preferences into the weights. This is not real-time — it is a scheduled job, much like a weekly model refresh. The resulting model is then more accurate from inference step 1, without needing to re-read 30 days of context every time. --- ## Does RL Pretraining Help Personalize Faster? Yes. This is the most important architectural property of the system. ### The Core Insight The RL training in simulation does not teach the model facts about specific people. It teaches a **skill**: how to detect a person's hidden preferences from differential responses to actions. ``` Base model (no RL pretraining): User ignores DEEP_WORK 3 mornings in a row. Model has no prior. Needs 40–50 samples to converge on a hypothesis. RL-pretrained model: User ignores DEEP_WORK 3 mornings in a row. Model immediately forms a hypothesis: "morning penalty profile — likely extrovert or late chronotype." Probes with evening DEEP_WORK. Confirms in 5–8 interactions. ``` The simulation trained the model on thousands of episodes across three wildly different profiles. It learned the *structure* of the inference problem — what patterns look like when someone has a morning penalty, when someone's social tolerance is low, when someone's vitality recovers from work rather than draining. When it meets a real user, it is not starting from zero. It has strong priors about what the data means. ### The Formal Analogy This is essentially **MAML** (Model-Agnostic Meta-Learning) without explicitly calling it that. MAML trains a model specifically so it can adapt to new tasks with very few gradient steps. Our RL pretraining does the same thing: the simulation is the meta-training distribution, each real user is a new task, and the few-shot personalization (in-context or fine-tune) is the inner-loop adaptation. The sim-to-real transfer works because the skill is structural, not numerical. Whether "vitality" comes from an HRV formula or a 4am workout log, the model's ability to detect that "positive actions are underperforming expectations here, and here is how to probe for the hidden reason" remains valid. --- ## Full Architecture: Three Phases ``` Phase 1 — Foundation (simulation): Train on 3 archetypes × 28-step weeks → internalize the inference skill Output: frozen RL-pretrained model weights Phase 2 — Deployment (in-context adaptation, day 1): New user → 5–10 Accept/Ignore interactions Model infers closest archetype from history in context window No weight updates, immediate personalization Phase 3 — Personalization (periodic fine-tuning, week 4+): Collect real Accept/Ignore data Small LoRA update on user-specific data Far fewer samples needed because Phase 1 built the prior User-specific adapter deployed; context window requirement shrinks ``` The RL pretraining is writing the chapter on "how to read a person" into the model's weights. When it meets a real user, it is not starting from zero — it already knows what patterns to look for and what they mean.