Spaces:

InosLihka
/

rhythm_env

Sleeping

App Files Files Community

rhythm_env / docs /references /sim_to_real_deployment.md

InosLihka

docs: add sim-to-real deployment architecture reference

24adee5 13 days ago

preview code

raw

history blame contribute delete

7.6 kB

	# From Simulation to Real Product — Deployment Architecture

	## The Question

	The training environment teaches a model to discover hidden personality profiles through reward
	signals alone. But once training is done, how does that actually connect to a real person's
	calendar, wearables, and daily life? And when the model meets a new real user, does it
	retrain from scratch — or does the RL pretraining give it a head start?

	---

	## The Real-World Pipeline

	The trained model sits at the center of a three-part bridge:

	```
	Real world Bridge layer Trained model
	────────────────── ───────────────── ──────────────────
	Apple Watch / Whoop → meter proxy mapping →
	HRV, resting HR State observation
	sleep score (same 5 meters the
	steps, activity → Vitality, Serenity model trained on)
	stress score

	Google Calendar → task → action type →
	"Q3 planning doc" DEEP_WORK Model infers profile
	"Gym - 7am" EXERCISE from Accept/Ignore
	"1:1 with Sarah" SOCIALIZE history, adapts
	"Date night" FAMILY_TIME recommendations
	"No meeting block" ME_TIME / SLEEP

	User taps → +1 / -1 reward → Model refines its
	Accept model of this person
	Ignore / Reschedule
	```

	### Meter Proxies from Devices

	Each of the 5 simulated meters has a real-world proxy that devices already measure:

	\| Meter \| Proxy sources \|
	\|---\|---\|
	\| Vitality \| Sleep score, HRV, resting heart rate, step count \|
	\| Cognition \| Deep sleep %, time since last break, calendar density \|
	\| Progress \| Task completion rate, deadline proximity, focus time logged \|
	\| Serenity \| HRV trend, stress score (Whoop/Garmin), meeting density inverse \|
	\| Connection \| Social calendar events in past 7 days, message activity \|

	The agent never asks how you feel. The devices already know.

	### Calendar → Action Types

	A lightweight classifier (keyword rules or a cheap LLM call, not the trained model) maps
	calendar events to the 10 action types the model understands:

	```
	"Q3 planning doc" → DEEP_WORK
	"Team standup" → ADMIN_WORK
	"Learn Python course" → LEARN
	"Gym block" → EXERCISE
	"Journaling" → MEDITATE / ME_TIME
	"Dinner with family" → FAMILY_TIME
	"Team social event" → SOCIALIZE
	"No meeting block" → ME_TIME
	"Sleep / recovery day" → SLEEP
	```

	### Accept / Ignore as the Reward Signal

	Every time the assistant makes a recommendation and the user acts on it:
	- Accept → "you read me right"
	- Ignore / Reschedule → "you got something wrong about me"

	Over hundreds of these micro-interactions, the model builds a precise model of who this
	person is — not the person they describe themselves to be, but the person revealed by
	their actual responses.

	---

	## Are Weights Updated in Real Time?

	No. In training, GRPO runs gradient updates after every batch — the model's weights change
	continuously. In production, the deployed model is frozen. Personalization happens
	in the context window, not in the weights.

	### Phase 1 — In-Context Adaptation (no gradient updates)

	The frozen model reads the user's history as part of each prompt:

	```
	System: You are a life management agent...

	Recent interactions (last 30 days):
	Mon Morning, Vitality=0.70 → recommended DEEP_WORK → user: IGNORED
	Mon Afternoon, Vitality=0.70 → recommended EXERCISE → user: ACCEPTED
	Thu Morning, Vitality=0.65 → recommended DEEP_WORK → user: IGNORED
	Sat Evening, Serenity=0.45 → recommended MEDITATE → user: ACCEPTED
	...

	Current state: Tuesday Morning, Vitality=0.68, Serenity=0.71...
	Recommend an action:
	```

	The model reads the pattern — morning deep work keeps getting rejected, evening recovery
	keeps getting accepted — infers "morning penalty profile", and shifts its recommendations.
	No gradient update required. This is in-context learning over the Accept/Ignore history.

	### Phase 2 — Periodic Offline Fine-Tuning (background, weekly/monthly)

	After 3–4 weeks of real data, a lightweight LoRA fine-tuning pass runs in the background
	on that specific user's interactions. The updated adapter encodes their preferences into
	the weights. This is not real-time — it is a scheduled job, much like a weekly model
	refresh. The resulting model is then more accurate from inference step 1, without needing
	to re-read 30 days of context every time.

	---

	## Does RL Pretraining Help Personalize Faster?

	Yes. This is the most important architectural property of the system.

	### The Core Insight

	The RL training in simulation does not teach the model facts about specific people. It
	teaches a skill: how to detect a person's hidden preferences from differential responses
	to actions.

	```
	Base model (no RL pretraining):
	User ignores DEEP_WORK 3 mornings in a row.
	Model has no prior. Needs 40–50 samples to converge on a hypothesis.

	RL-pretrained model:
	User ignores DEEP_WORK 3 mornings in a row.
	Model immediately forms a hypothesis: "morning penalty profile — likely
	extrovert or late chronotype." Probes with evening DEEP_WORK. Confirms
	in 5–8 interactions.
	```

	The simulation trained the model on thousands of episodes across three wildly different
	profiles. It learned the structure of the inference problem — what patterns look like
	when someone has a morning penalty, when someone's social tolerance is low, when someone's
	vitality recovers from work rather than draining. When it meets a real user, it is not
	starting from zero. It has strong priors about what the data means.

	### The Formal Analogy

	This is essentially MAML (Model-Agnostic Meta-Learning) without explicitly calling it
	that. MAML trains a model specifically so it can adapt to new tasks with very few gradient
	steps. Our RL pretraining does the same thing: the simulation is the meta-training
	distribution, each real user is a new task, and the few-shot personalization (in-context
	or fine-tune) is the inner-loop adaptation.

	The sim-to-real transfer works because the skill is structural, not numerical. Whether
	"vitality" comes from an HRV formula or a 4am workout log, the model's ability to detect
	that "positive actions are underperforming expectations here, and here is how to probe for
	the hidden reason" remains valid.

	---

	## Full Architecture: Three Phases

	```
	Phase 1 — Foundation (simulation):
	Train on 3 archetypes × 28-step weeks → internalize the inference skill
	Output: frozen RL-pretrained model weights

	Phase 2 — Deployment (in-context adaptation, day 1):
	New user → 5–10 Accept/Ignore interactions
	Model infers closest archetype from history in context window
	No weight updates, immediate personalization

	Phase 3 — Personalization (periodic fine-tuning, week 4+):
	Collect real Accept/Ignore data
	Small LoRA update on user-specific data
	Far fewer samples needed because Phase 1 built the prior
	User-specific adapter deployed; context window requirement shrinks
	```

	The RL pretraining is writing the chapter on "how to read a person" into the model's
	weights. When it meets a real user, it is not starting from zero — it already knows what
	patterns to look for and what they mean.