Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka Claude Opus 4.7 (1M context) commited on 14 days ago

Commit

63216a8

1 Parent(s): 4c69214

docs: add explicit sim-to-real mapping diagram (vision alignment)

The architecture doc was purely technical. Added a new section 10
'Sim -> Real mapping' that shows side-by-side how each simulated signal
maps to its production analog:

- meters <-> wearable HRV, calendar, screen time, voice sentiment
- profile <-> the real user's actual personality (also hidden)
- belief <-> agent's internal user model (visible in POC only)
- action <-> recommendation to user
- reward <-> meter trend + accept/ignore taps
- episode <-> rolling weekly window
- OOD eval <-> cold start with new user

Plus an explicit list of design choices that preserve the passive-only
constraint (no ASK action, no profile feature at eval, belief is internal).

The KEY connection: continuous-OOD eval = production cold-start scenario.
If the agent personalizes to unseen profiles in sim, it has acquired the
skill the real product needs when meeting a new user.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show

docs/architecture.md +63 -1

docs/architecture.md CHANGED Viewed

@@ -464,7 +464,69 @@ This is the only direct gradient signal pointing at the actual episode quality.
 ---
-## 10. Spend & timing (concrete)
 ```
 HARDWARE: A100 large (80GB) on HF Jobs at $2.50/hr ($0.0417/min)

 ---
+## 10. Sim → Real mapping (why this POC matters)
+The whole point of training in this env is that the SKILL transfers. The
+mapping from simulated signals to production signals is direct: same
+information, different source.
+```
+┌─────────────────────── SIMULATION (training) ───────┐  ┌──── REAL PRODUCT (deployment) ───────┐
+│                                                     │  │                                       │
+│  meter: Vitality = 0.62                             │──┼─►  HRV from watch + sleep score      │
+│  meter: Cognition = 0.51                            │──┼─►  focus app metrics + screen time   │
+│  meter: Progress = 0.24                             │──┼─►  calendar density + task completion│
+│  meter: Serenity = 0.71                             │──┼─►  HRV variability + voice sentiment │
+│  meter: Connection = 0.38                           │──┼─►  call/message freq + calendar      │
+│                                                     │  │                                       │
+│  hidden profile (sampled per episode)               │──┼─►  the real user's actual personality│
+│    └─ never visible to agent                        │  │     └─ also never visible             │
+│                                                     │  │                                       │
+│  agent emits: "3 7 5 DEEP_WORK"                     │  │                                       │
+│    ├─ "3 7 5" = belief about user (hidden internal) │──┼─►  agent's internal user model       │
+│    └─ "DEEP_WORK" = action choice                   │──┼─►  recommendation to user            │
+│                                                     │  │     ("I suggest a focus block now")  │
+│                                                     │  │                                       │
+│  per-step env_reward                                │──┼─►  meter trend + accept/ignore tap   │
+│    └─ "did meters improve under profile weights?"   │  │     └─ "did user accept and benefit?"│
+│                                                     │  │                                       │
+│  episode = 1 week (28 steps)                        │──┼─►  rolling weekly window             │
+│                                                     │  │                                       │
+│  3 eval conditions                                  │  │   USER ARRIVES (cold start)          │
+│    ├─ discrete-3 (memorization check)               │  │   ├─ Agent has zero info about them  │
+│    ├─ continuous-in-dist (training-distrib)         │  │   ├─ ~5-10 interactions to converge  │
+│    └─ continuous-OOD ◄── KEY ──►                    │──┼─►   on a confident belief vector     │
+│       agent must infer profile NEVER seen in        │  │   └─ Acts on belief, learns from     │
+│       training. THIS is the production scenario.    │  │      tap responses                   │
+└─────────────────────────────────────────────────────┘  └───────────────────────────────────────┘
+SUCCESS METRIC IN BOTH WORLDS: how fast does the agent personalize?
+Sim:  belief_accuracy curve over a single episode (does it climb?)
+      adaptation_score (late-half mean reward > early-half mean reward?)
+Real: how many interactions until user's accept-rate stabilizes high?
+      How quickly does the agent stop suggesting things the user always ignores?
+Both reduce to the same skill: form a model of this person from limited
+observation, then act on it.
+```
+**Design choices that explicitly preserve the passive-only constraint:**
+- ❌ No "ASK USER" action — agent never quizzes
+- ❌ No profile feature in observation at eval time — agent must infer
+- ❌ No explicit "what's your goal?" prompt — only sensor-equivalent meters
+- ✅ Belief output is INTERNAL (visible in POC only for training reward)
+- ✅ Reward formula uses only quantities computable from passive signals
+If the agent learns to do well on continuous-OOD (profiles never seen in
+training), it has acquired the skill of "figure out a new person from
+observation alone" — exactly the skill the production assistant needs
+when meeting a real user for the first time.
+---
+## 11. Spend & timing (concrete)
 ```
 HARDWARE: A100 large (80GB) on HF Jobs at $2.50/hr ($0.0417/min)