rhythm_env / docs /references /sim_to_real_deployment.md
InosLihka's picture
docs: add sim-to-real deployment architecture reference
24adee5

From Simulation to Real Product β€” Deployment Architecture

The Question

The training environment teaches a model to discover hidden personality profiles through reward signals alone. But once training is done, how does that actually connect to a real person's calendar, wearables, and daily life? And when the model meets a new real user, does it retrain from scratch β€” or does the RL pretraining give it a head start?


The Real-World Pipeline

The trained model sits at the center of a three-part bridge:

Real world                  Bridge layer              Trained model
──────────────────          ─────────────────         ──────────────────
Apple Watch / Whoop    β†’    meter proxy mapping   β†’
  HRV, resting HR                                     State observation
  sleep score                                         (same 5 meters the
  steps, activity     β†’    Vitality, Serenity         model trained on)
  stress score

Google Calendar        β†’    task β†’ action type   β†’
  "Q3 planning doc"         DEEP_WORK                 Model infers profile
  "Gym - 7am"               EXERCISE                  from Accept/Ignore
  "1:1 with Sarah"          SOCIALIZE                 history, adapts
  "Date night"              FAMILY_TIME               recommendations
  "No meeting block"        ME_TIME / SLEEP

User taps              β†’    +1 / -1 reward       β†’    Model refines its
  Accept                                              model of this person
  Ignore / Reschedule

Meter Proxies from Devices

Each of the 5 simulated meters has a real-world proxy that devices already measure:

Meter Proxy sources
Vitality Sleep score, HRV, resting heart rate, step count
Cognition Deep sleep %, time since last break, calendar density
Progress Task completion rate, deadline proximity, focus time logged
Serenity HRV trend, stress score (Whoop/Garmin), meeting density inverse
Connection Social calendar events in past 7 days, message activity

The agent never asks how you feel. The devices already know.

Calendar β†’ Action Types

A lightweight classifier (keyword rules or a cheap LLM call, not the trained model) maps calendar events to the 10 action types the model understands:

"Q3 planning doc"        β†’  DEEP_WORK
"Team standup"           β†’  ADMIN_WORK
"Learn Python course"    β†’  LEARN
"Gym block"              β†’  EXERCISE
"Journaling"             β†’  MEDITATE / ME_TIME
"Dinner with family"     β†’  FAMILY_TIME
"Team social event"      β†’  SOCIALIZE
"No meeting block"       β†’  ME_TIME
"Sleep / recovery day"   β†’  SLEEP

Accept / Ignore as the Reward Signal

Every time the assistant makes a recommendation and the user acts on it:

  • Accept β†’ "you read me right"
  • Ignore / Reschedule β†’ "you got something wrong about me"

Over hundreds of these micro-interactions, the model builds a precise model of who this person is β€” not the person they describe themselves to be, but the person revealed by their actual responses.


Are Weights Updated in Real Time?

No. In training, GRPO runs gradient updates after every batch β€” the model's weights change continuously. In production, the deployed model is frozen. Personalization happens in the context window, not in the weights.

Phase 1 β€” In-Context Adaptation (no gradient updates)

The frozen model reads the user's history as part of each prompt:

System: You are a life management agent...

Recent interactions (last 30 days):
  Mon Morning, Vitality=0.70 β†’ recommended DEEP_WORK β†’ user: IGNORED
  Mon Afternoon, Vitality=0.70 β†’ recommended EXERCISE β†’ user: ACCEPTED
  Thu Morning, Vitality=0.65 β†’ recommended DEEP_WORK β†’ user: IGNORED
  Sat Evening, Serenity=0.45 β†’ recommended MEDITATE β†’ user: ACCEPTED
  ...

Current state: Tuesday Morning, Vitality=0.68, Serenity=0.71...
Recommend an action:

The model reads the pattern β€” morning deep work keeps getting rejected, evening recovery keeps getting accepted β€” infers "morning penalty profile", and shifts its recommendations. No gradient update required. This is in-context learning over the Accept/Ignore history.

Phase 2 β€” Periodic Offline Fine-Tuning (background, weekly/monthly)

After 3–4 weeks of real data, a lightweight LoRA fine-tuning pass runs in the background on that specific user's interactions. The updated adapter encodes their preferences into the weights. This is not real-time β€” it is a scheduled job, much like a weekly model refresh. The resulting model is then more accurate from inference step 1, without needing to re-read 30 days of context every time.


Does RL Pretraining Help Personalize Faster?

Yes. This is the most important architectural property of the system.

The Core Insight

The RL training in simulation does not teach the model facts about specific people. It teaches a skill: how to detect a person's hidden preferences from differential responses to actions.

Base model (no RL pretraining):
  User ignores DEEP_WORK 3 mornings in a row.
  Model has no prior. Needs 40–50 samples to converge on a hypothesis.

RL-pretrained model:
  User ignores DEEP_WORK 3 mornings in a row.
  Model immediately forms a hypothesis: "morning penalty profile β€” likely
  extrovert or late chronotype." Probes with evening DEEP_WORK. Confirms
  in 5–8 interactions.

The simulation trained the model on thousands of episodes across three wildly different profiles. It learned the structure of the inference problem β€” what patterns look like when someone has a morning penalty, when someone's social tolerance is low, when someone's vitality recovers from work rather than draining. When it meets a real user, it is not starting from zero. It has strong priors about what the data means.

The Formal Analogy

This is essentially MAML (Model-Agnostic Meta-Learning) without explicitly calling it that. MAML trains a model specifically so it can adapt to new tasks with very few gradient steps. Our RL pretraining does the same thing: the simulation is the meta-training distribution, each real user is a new task, and the few-shot personalization (in-context or fine-tune) is the inner-loop adaptation.

The sim-to-real transfer works because the skill is structural, not numerical. Whether "vitality" comes from an HRV formula or a 4am workout log, the model's ability to detect that "positive actions are underperforming expectations here, and here is how to probe for the hidden reason" remains valid.


Full Architecture: Three Phases

Phase 1 β€” Foundation (simulation):
  Train on 3 archetypes Γ— 28-step weeks β†’ internalize the inference skill
  Output: frozen RL-pretrained model weights

Phase 2 β€” Deployment (in-context adaptation, day 1):
  New user β†’ 5–10 Accept/Ignore interactions
  Model infers closest archetype from history in context window
  No weight updates, immediate personalization

Phase 3 β€” Personalization (periodic fine-tuning, week 4+):
  Collect real Accept/Ignore data
  Small LoRA update on user-specific data
  Far fewer samples needed because Phase 1 built the prior
  User-specific adapter deployed; context window requirement shrinks

The RL pretraining is writing the chapter on "how to read a person" into the model's weights. When it meets a real user, it is not starting from zero β€” it already knows what patterns to look for and what they mean.