Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka Claude Sonnet 4.6 commited on 13 days ago

Commit

24adee5

1 Parent(s): fb112e4

docs: add sim-to-real deployment architecture reference

Captures the real-world product pipeline: wearable proxies for meters,
calendar-to-action-type mapping, Accept/Ignore as reward signal. Also
covers why the frozen model adapts in-context and why RL pretraining
enables fast personalization (MAML analogy).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (1) hide show

docs/references/sim_to_real_deployment.md +182 -0

docs/references/sim_to_real_deployment.md ADDED Viewed

	@@ -0,0 +1,182 @@

+# From Simulation to Real Product — Deployment Architecture
+## The Question
+The training environment teaches a model to discover hidden personality profiles through reward
+signals alone. But once training is done, how does that actually connect to a real person's
+calendar, wearables, and daily life? And when the model meets a new real user, does it
+retrain from scratch — or does the RL pretraining give it a head start?
+---
+## The Real-World Pipeline
+The trained model sits at the center of a three-part bridge:
+```
+Real world                  Bridge layer              Trained model
+──────────────────          ─────────────────         ──────────────────
+Apple Watch / Whoop    →    meter proxy mapping   →
+  HRV, resting HR                                     State observation
+  sleep score                                         (same 5 meters the
+  steps, activity     →    Vitality, Serenity         model trained on)
+  stress score
+Google Calendar        →    task → action type   →
+  "Q3 planning doc"         DEEP_WORK                 Model infers profile
+  "Gym - 7am"               EXERCISE                  from Accept/Ignore
+  "1:1 with Sarah"          SOCIALIZE                 history, adapts
+  "Date night"              FAMILY_TIME               recommendations
+  "No meeting block"        ME_TIME / SLEEP
+User taps              →    +1 / -1 reward       →    Model refines its
+  Accept                                              model of this person
+  Ignore / Reschedule
+```
+### Meter Proxies from Devices
+Each of the 5 simulated meters has a real-world proxy that devices already measure:
+| Meter | Proxy sources |
+|---|---|
+| Vitality | Sleep score, HRV, resting heart rate, step count |
+| Cognition | Deep sleep %, time since last break, calendar density |
+| Progress | Task completion rate, deadline proximity, focus time logged |
+| Serenity | HRV trend, stress score (Whoop/Garmin), meeting density inverse |
+| Connection | Social calendar events in past 7 days, message activity |
+The agent never asks how you feel. The devices already know.
+### Calendar → Action Types
+A lightweight classifier (keyword rules or a cheap LLM call, not the trained model) maps
+calendar events to the 10 action types the model understands:
+```
+"Q3 planning doc"        →  DEEP_WORK
+"Team standup"           →  ADMIN_WORK
+"Learn Python course"    →  LEARN
+"Gym block"              →  EXERCISE
+"Journaling"             →  MEDITATE / ME_TIME
+"Dinner with family"     →  FAMILY_TIME
+"Team social event"      →  SOCIALIZE
+"No meeting block"       →  ME_TIME
+"Sleep / recovery day"   →  SLEEP
+```
+### Accept / Ignore as the Reward Signal
+Every time the assistant makes a recommendation and the user acts on it:
+- **Accept** → "you read me right"
+- **Ignore / Reschedule** → "you got something wrong about me"
+Over hundreds of these micro-interactions, the model builds a precise model of who this
+person is — not the person they describe themselves to be, but the person revealed by
+their actual responses.
+---
+## Are Weights Updated in Real Time?
+No. In training, GRPO runs gradient updates after every batch — the model's weights change
+continuously. In production, the deployed model is **frozen**. Personalization happens
+in the context window, not in the weights.
+### Phase 1 — In-Context Adaptation (no gradient updates)
+The frozen model reads the user's history as part of each prompt:
+```
+System: You are a life management agent...
+Recent interactions (last 30 days):
+  Mon Morning, Vitality=0.70 → recommended DEEP_WORK → user: IGNORED
+  Mon Afternoon, Vitality=0.70 → recommended EXERCISE → user: ACCEPTED
+  Thu Morning, Vitality=0.65 → recommended DEEP_WORK → user: IGNORED
+  Sat Evening, Serenity=0.45 → recommended MEDITATE → user: ACCEPTED
+  ...
+Current state: Tuesday Morning, Vitality=0.68, Serenity=0.71...
+Recommend an action:
+```
+The model reads the pattern — morning deep work keeps getting rejected, evening recovery
+keeps getting accepted — infers "morning penalty profile", and shifts its recommendations.
+No gradient update required. This is in-context learning over the Accept/Ignore history.
+### Phase 2 — Periodic Offline Fine-Tuning (background, weekly/monthly)
+After 3–4 weeks of real data, a lightweight LoRA fine-tuning pass runs in the background
+on that specific user's interactions. The updated adapter encodes their preferences into
+the weights. This is not real-time — it is a scheduled job, much like a weekly model
+refresh. The resulting model is then more accurate from inference step 1, without needing
+to re-read 30 days of context every time.
+---
+## Does RL Pretraining Help Personalize Faster?
+Yes. This is the most important architectural property of the system.
+### The Core Insight
+The RL training in simulation does not teach the model facts about specific people. It
+teaches a **skill**: how to detect a person's hidden preferences from differential responses
+to actions.
+```
+Base model (no RL pretraining):
+  User ignores DEEP_WORK 3 mornings in a row.
+  Model has no prior. Needs 40–50 samples to converge on a hypothesis.
+RL-pretrained model:
+  User ignores DEEP_WORK 3 mornings in a row.
+  Model immediately forms a hypothesis: "morning penalty profile — likely
+  extrovert or late chronotype." Probes with evening DEEP_WORK. Confirms
+  in 5–8 interactions.
+```
+The simulation trained the model on thousands of episodes across three wildly different
+profiles. It learned the *structure* of the inference problem — what patterns look like
+when someone has a morning penalty, when someone's social tolerance is low, when someone's
+vitality recovers from work rather than draining. When it meets a real user, it is not
+starting from zero. It has strong priors about what the data means.
+### The Formal Analogy
+This is essentially **MAML** (Model-Agnostic Meta-Learning) without explicitly calling it
+that. MAML trains a model specifically so it can adapt to new tasks with very few gradient
+steps. Our RL pretraining does the same thing: the simulation is the meta-training
+distribution, each real user is a new task, and the few-shot personalization (in-context
+or fine-tune) is the inner-loop adaptation.
+The sim-to-real transfer works because the skill is structural, not numerical. Whether
+"vitality" comes from an HRV formula or a 4am workout log, the model's ability to detect
+that "positive actions are underperforming expectations here, and here is how to probe for
+the hidden reason" remains valid.
+---
+## Full Architecture: Three Phases
+```
+Phase 1 — Foundation (simulation):
+  Train on 3 archetypes × 28-step weeks → internalize the inference skill
+  Output: frozen RL-pretrained model weights
+Phase 2 — Deployment (in-context adaptation, day 1):
+  New user → 5–10 Accept/Ignore interactions
+  Model infers closest archetype from history in context window
+  No weight updates, immediate personalization
+Phase 3 — Personalization (periodic fine-tuning, week 4+):
+  Collect real Accept/Ignore data
+  Small LoRA update on user-specific data
+  Far fewer samples needed because Phase 1 built the prior
+  User-specific adapter deployed; context window requirement shrinks
+```
+The RL pretraining is writing the chapter on "how to read a person" into the model's
+weights. When it meets a real user, it is not starting from zero — it already knows what
+patterns to look for and what they mean.