Spaces:

InosLihka
/

rhythm_env

Sleeping

File size: 7,602 Bytes

24adee5

# From Simulation to Real Product — Deployment Architecture

## The Question

The training environment teaches a model to discover hidden personality profiles through reward
signals alone. But once training is done, how does that actually connect to a real person's
calendar, wearables, and daily life? And when the model meets a new real user, does it
retrain from scratch — or does the RL pretraining give it a head start?

---

## The Real-World Pipeline

The trained model sits at the center of a three-part bridge:

```
Real world                  Bridge layer              Trained model
──────────────────          ─────────────────         ──────────────────
Apple Watch / Whoop    →    meter proxy mapping   →
  HRV, resting HR                                     State observation
  sleep score                                         (same 5 meters the
  steps, activity     →    Vitality, Serenity         model trained on)
  stress score

Google Calendar        →    task → action type   →
  "Q3 planning doc"         DEEP_WORK                 Model infers profile
  "Gym - 7am"               EXERCISE                  from Accept/Ignore
  "1:1 with Sarah"          SOCIALIZE                 history, adapts
  "Date night"              FAMILY_TIME               recommendations
  "No meeting block"        ME_TIME / SLEEP

User taps              →    +1 / -1 reward       →    Model refines its
  Accept                                              model of this person
  Ignore / Reschedule
```

### Meter Proxies from Devices

Each of the 5 simulated meters has a real-world proxy that devices already measure:

| Meter | Proxy sources |
|---|---|
| Vitality | Sleep score, HRV, resting heart rate, step count |
| Cognition | Deep sleep %, time since last break, calendar density |
| Progress | Task completion rate, deadline proximity, focus time logged |
| Serenity | HRV trend, stress score (Whoop/Garmin), meeting density inverse |
| Connection | Social calendar events in past 7 days, message activity |

The agent never asks how you feel. The devices already know.

### Calendar → Action Types

A lightweight classifier (keyword rules or a cheap LLM call, not the trained model) maps
calendar events to the 10 action types the model understands:

```
"Q3 planning doc"        →  DEEP_WORK
"Team standup"           →  ADMIN_WORK
"Learn Python course"    →  LEARN
"Gym block"              →  EXERCISE
"Journaling"             →  MEDITATE / ME_TIME
"Dinner with family"     →  FAMILY_TIME
"Team social event"      →  SOCIALIZE
"No meeting block"       →  ME_TIME
"Sleep / recovery day"   →  SLEEP
```

### Accept / Ignore as the Reward Signal

Every time the assistant makes a recommendation and the user acts on it:
- **Accept** → "you read me right"
- **Ignore / Reschedule** → "you got something wrong about me"

Over hundreds of these micro-interactions, the model builds a precise model of who this
person is — not the person they describe themselves to be, but the person revealed by
their actual responses.

---

## Are Weights Updated in Real Time?

No. In training, GRPO runs gradient updates after every batch — the model's weights change
continuously. In production, the deployed model is **frozen**. Personalization happens
in the context window, not in the weights.

### Phase 1 — In-Context Adaptation (no gradient updates)

The frozen model reads the user's history as part of each prompt:

```
System: You are a life management agent...

Recent interactions (last 30 days):
  Mon Morning, Vitality=0.70 → recommended DEEP_WORK → user: IGNORED
  Mon Afternoon, Vitality=0.70 → recommended EXERCISE → user: ACCEPTED
  Thu Morning, Vitality=0.65 → recommended DEEP_WORK → user: IGNORED
  Sat Evening, Serenity=0.45 → recommended MEDITATE → user: ACCEPTED
  ...

Current state: Tuesday Morning, Vitality=0.68, Serenity=0.71...
Recommend an action:
```

The model reads the pattern — morning deep work keeps getting rejected, evening recovery
keeps getting accepted — infers "morning penalty profile", and shifts its recommendations.
No gradient update required. This is in-context learning over the Accept/Ignore history.

### Phase 2 — Periodic Offline Fine-Tuning (background, weekly/monthly)

After 3–4 weeks of real data, a lightweight LoRA fine-tuning pass runs in the background
on that specific user's interactions. The updated adapter encodes their preferences into
the weights. This is not real-time — it is a scheduled job, much like a weekly model
refresh. The resulting model is then more accurate from inference step 1, without needing
to re-read 30 days of context every time.

---

## Does RL Pretraining Help Personalize Faster?

Yes. This is the most important architectural property of the system.

### The Core Insight

The RL training in simulation does not teach the model facts about specific people. It
teaches a **skill**: how to detect a person's hidden preferences from differential responses
to actions.

```
Base model (no RL pretraining):
  User ignores DEEP_WORK 3 mornings in a row.
  Model has no prior. Needs 40–50 samples to converge on a hypothesis.

RL-pretrained model:
  User ignores DEEP_WORK 3 mornings in a row.
  Model immediately forms a hypothesis: "morning penalty profile — likely
  extrovert or late chronotype." Probes with evening DEEP_WORK. Confirms
  in 5–8 interactions.
```

The simulation trained the model on thousands of episodes across three wildly different
profiles. It learned the *structure* of the inference problem — what patterns look like
when someone has a morning penalty, when someone's social tolerance is low, when someone's
vitality recovers from work rather than draining. When it meets a real user, it is not
starting from zero. It has strong priors about what the data means.

### The Formal Analogy

This is essentially **MAML** (Model-Agnostic Meta-Learning) without explicitly calling it
that. MAML trains a model specifically so it can adapt to new tasks with very few gradient
steps. Our RL pretraining does the same thing: the simulation is the meta-training
distribution, each real user is a new task, and the few-shot personalization (in-context
or fine-tune) is the inner-loop adaptation.

The sim-to-real transfer works because the skill is structural, not numerical. Whether
"vitality" comes from an HRV formula or a 4am workout log, the model's ability to detect
that "positive actions are underperforming expectations here, and here is how to probe for
the hidden reason" remains valid.

---

## Full Architecture: Three Phases

```
Phase 1 — Foundation (simulation):
  Train on 3 archetypes × 28-step weeks → internalize the inference skill
  Output: frozen RL-pretrained model weights

Phase 2 — Deployment (in-context adaptation, day 1):
  New user → 5–10 Accept/Ignore interactions
  Model infers closest archetype from history in context window
  No weight updates, immediate personalization

Phase 3 — Personalization (periodic fine-tuning, week 4+):
  Collect real Accept/Ignore data
  Small LoRA update on user-specific data
  Far fewer samples needed because Phase 1 built the prior
  User-specific adapter deployed; context window requirement shrinks
```

The RL pretraining is writing the chapter on "how to read a person" into the model's
weights. When it meets a real user, it is not starting from zero — it already knows what
patterns to look for and what they mean.