Spaces:
Sleeping
docs: add explicit sim-to-real mapping diagram (vision alignment)
Browse filesThe architecture doc was purely technical. Added a new section 10
'Sim -> Real mapping' that shows side-by-side how each simulated signal
maps to its production analog:
- meters <-> wearable HRV, calendar, screen time, voice sentiment
- profile <-> the real user's actual personality (also hidden)
- belief <-> agent's internal user model (visible in POC only)
- action <-> recommendation to user
- reward <-> meter trend + accept/ignore taps
- episode <-> rolling weekly window
- OOD eval <-> cold start with new user
Plus an explicit list of design choices that preserve the passive-only
constraint (no ASK action, no profile feature at eval, belief is internal).
The KEY connection: continuous-OOD eval = production cold-start scenario.
If the agent personalizes to unseen profiles in sim, it has acquired the
skill the real product needs when meeting a new user.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docs/architecture.md +63 -1
|
@@ -464,7 +464,69 @@ This is the only direct gradient signal pointing at the actual episode quality.
|
|
| 464 |
|
| 465 |
---
|
| 466 |
|
| 467 |
-
## 10.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 468 |
|
| 469 |
```
|
| 470 |
HARDWARE: A100 large (80GB) on HF Jobs at $2.50/hr ($0.0417/min)
|
|
|
|
| 464 |
|
| 465 |
---
|
| 466 |
|
| 467 |
+
## 10. Sim β Real mapping (why this POC matters)
|
| 468 |
+
|
| 469 |
+
The whole point of training in this env is that the SKILL transfers. The
|
| 470 |
+
mapping from simulated signals to production signals is direct: same
|
| 471 |
+
information, different source.
|
| 472 |
+
|
| 473 |
+
```
|
| 474 |
+
ββββββββββββββββββββββββ SIMULATION (training) ββββββββ βββββ REAL PRODUCT (deployment) ββββββββ
|
| 475 |
+
β β β β
|
| 476 |
+
β meter: Vitality = 0.62 ββββΌββΊ HRV from watch + sleep score β
|
| 477 |
+
β meter: Cognition = 0.51 ββββΌββΊ focus app metrics + screen time β
|
| 478 |
+
β meter: Progress = 0.24 ββββΌββΊ calendar density + task completionβ
|
| 479 |
+
β meter: Serenity = 0.71 ββββΌββΊ HRV variability + voice sentiment β
|
| 480 |
+
β meter: Connection = 0.38 ββββΌββΊ call/message freq + calendar β
|
| 481 |
+
β β β β
|
| 482 |
+
β hidden profile (sampled per episode) ββββΌββΊ the real user's actual personalityβ
|
| 483 |
+
β ββ never visible to agent β β ββ also never visible β
|
| 484 |
+
β β β β
|
| 485 |
+
β agent emits: "3 7 5 DEEP_WORK" β β β
|
| 486 |
+
β ββ "3 7 5" = belief about user (hidden internal) ββββΌββΊ agent's internal user model β
|
| 487 |
+
β ββ "DEEP_WORK" = action choice ββββΌββΊ recommendation to user β
|
| 488 |
+
β β β ("I suggest a focus block now") β
|
| 489 |
+
β β β β
|
| 490 |
+
β per-step env_reward ββββΌββΊ meter trend + accept/ignore tap β
|
| 491 |
+
β ββ "did meters improve under profile weights?" β β ββ "did user accept and benefit?"β
|
| 492 |
+
β β β β
|
| 493 |
+
β episode = 1 week (28 steps) ββββΌββΊ rolling weekly window β
|
| 494 |
+
β β β β
|
| 495 |
+
β 3 eval conditions β β USER ARRIVES (cold start) β
|
| 496 |
+
β ββ discrete-3 (memorization check) β β ββ Agent has zero info about them β
|
| 497 |
+
β ββ continuous-in-dist (training-distrib) β β ββ ~5-10 interactions to converge β
|
| 498 |
+
β ββ continuous-OOD βββ KEY βββΊ ββββΌββΊ on a confident belief vector β
|
| 499 |
+
β agent must infer profile NEVER seen in β β ββ Acts on belief, learns from β
|
| 500 |
+
β training. THIS is the production scenario. β β tap responses β
|
| 501 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββ
|
| 502 |
+
|
| 503 |
+
SUCCESS METRIC IN BOTH WORLDS: how fast does the agent personalize?
|
| 504 |
+
|
| 505 |
+
Sim: belief_accuracy curve over a single episode (does it climb?)
|
| 506 |
+
adaptation_score (late-half mean reward > early-half mean reward?)
|
| 507 |
+
|
| 508 |
+
Real: how many interactions until user's accept-rate stabilizes high?
|
| 509 |
+
How quickly does the agent stop suggesting things the user always ignores?
|
| 510 |
+
|
| 511 |
+
Both reduce to the same skill: form a model of this person from limited
|
| 512 |
+
observation, then act on it.
|
| 513 |
+
```
|
| 514 |
+
|
| 515 |
+
**Design choices that explicitly preserve the passive-only constraint:**
|
| 516 |
+
- β No "ASK USER" action β agent never quizzes
|
| 517 |
+
- β No profile feature in observation at eval time β agent must infer
|
| 518 |
+
- β No explicit "what's your goal?" prompt β only sensor-equivalent meters
|
| 519 |
+
- β
Belief output is INTERNAL (visible in POC only for training reward)
|
| 520 |
+
- β
Reward formula uses only quantities computable from passive signals
|
| 521 |
+
|
| 522 |
+
If the agent learns to do well on continuous-OOD (profiles never seen in
|
| 523 |
+
training), it has acquired the skill of "figure out a new person from
|
| 524 |
+
observation alone" β exactly the skill the production assistant needs
|
| 525 |
+
when meeting a real user for the first time.
|
| 526 |
+
|
| 527 |
+
---
|
| 528 |
+
|
| 529 |
+
## 11. Spend & timing (concrete)
|
| 530 |
|
| 531 |
```
|
| 532 |
HARDWARE: A100 large (80GB) on HF Jobs at $2.50/hr ($0.0417/min)
|