InosLihka Claude Opus 4.7 (1M context) commited on
Commit
63216a8
Β·
1 Parent(s): 4c69214

docs: add explicit sim-to-real mapping diagram (vision alignment)

Browse files

The architecture doc was purely technical. Added a new section 10
'Sim -> Real mapping' that shows side-by-side how each simulated signal
maps to its production analog:

- meters <-> wearable HRV, calendar, screen time, voice sentiment
- profile <-> the real user's actual personality (also hidden)
- belief <-> agent's internal user model (visible in POC only)
- action <-> recommendation to user
- reward <-> meter trend + accept/ignore taps
- episode <-> rolling weekly window
- OOD eval <-> cold start with new user

Plus an explicit list of design choices that preserve the passive-only
constraint (no ASK action, no profile feature at eval, belief is internal).

The KEY connection: continuous-OOD eval = production cold-start scenario.
If the agent personalizes to unseen profiles in sim, it has acquired the
skill the real product needs when meeting a new user.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. docs/architecture.md +63 -1
docs/architecture.md CHANGED
@@ -464,7 +464,69 @@ This is the only direct gradient signal pointing at the actual episode quality.
464
 
465
  ---
466
 
467
- ## 10. Spend & timing (concrete)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
468
 
469
  ```
470
  HARDWARE: A100 large (80GB) on HF Jobs at $2.50/hr ($0.0417/min)
 
464
 
465
  ---
466
 
467
+ ## 10. Sim β†’ Real mapping (why this POC matters)
468
+
469
+ The whole point of training in this env is that the SKILL transfers. The
470
+ mapping from simulated signals to production signals is direct: same
471
+ information, different source.
472
+
473
+ ```
474
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ SIMULATION (training) ───────┐ β”Œβ”€β”€β”€β”€ REAL PRODUCT (deployment) ───────┐
475
+ β”‚ β”‚ β”‚ β”‚
476
+ β”‚ meter: Vitality = 0.62 │──┼─► HRV from watch + sleep score β”‚
477
+ β”‚ meter: Cognition = 0.51 │──┼─► focus app metrics + screen time β”‚
478
+ β”‚ meter: Progress = 0.24 │──┼─► calendar density + task completionβ”‚
479
+ β”‚ meter: Serenity = 0.71 │──┼─► HRV variability + voice sentiment β”‚
480
+ β”‚ meter: Connection = 0.38 │──┼─► call/message freq + calendar β”‚
481
+ β”‚ β”‚ β”‚ β”‚
482
+ β”‚ hidden profile (sampled per episode) │──┼─► the real user's actual personalityβ”‚
483
+ β”‚ └─ never visible to agent β”‚ β”‚ └─ also never visible β”‚
484
+ β”‚ β”‚ β”‚ β”‚
485
+ β”‚ agent emits: "3 7 5 DEEP_WORK" β”‚ β”‚ β”‚
486
+ β”‚ β”œβ”€ "3 7 5" = belief about user (hidden internal) │──┼─► agent's internal user model β”‚
487
+ β”‚ └─ "DEEP_WORK" = action choice │──┼─► recommendation to user β”‚
488
+ β”‚ β”‚ β”‚ ("I suggest a focus block now") β”‚
489
+ β”‚ β”‚ β”‚ β”‚
490
+ β”‚ per-step env_reward │──┼─► meter trend + accept/ignore tap β”‚
491
+ β”‚ └─ "did meters improve under profile weights?" β”‚ β”‚ └─ "did user accept and benefit?"β”‚
492
+ β”‚ β”‚ β”‚ β”‚
493
+ β”‚ episode = 1 week (28 steps) │──┼─► rolling weekly window β”‚
494
+ β”‚ β”‚ β”‚ β”‚
495
+ β”‚ 3 eval conditions β”‚ β”‚ USER ARRIVES (cold start) β”‚
496
+ β”‚ β”œβ”€ discrete-3 (memorization check) β”‚ β”‚ β”œβ”€ Agent has zero info about them β”‚
497
+ β”‚ β”œβ”€ continuous-in-dist (training-distrib) β”‚ β”‚ β”œβ”€ ~5-10 interactions to converge β”‚
498
+ β”‚ └─ continuous-OOD ◄── KEY ──► │──┼─► on a confident belief vector β”‚
499
+ β”‚ agent must infer profile NEVER seen in β”‚ β”‚ └─ Acts on belief, learns from β”‚
500
+ β”‚ training. THIS is the production scenario. β”‚ β”‚ tap responses β”‚
501
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
502
+
503
+ SUCCESS METRIC IN BOTH WORLDS: how fast does the agent personalize?
504
+
505
+ Sim: belief_accuracy curve over a single episode (does it climb?)
506
+ adaptation_score (late-half mean reward > early-half mean reward?)
507
+
508
+ Real: how many interactions until user's accept-rate stabilizes high?
509
+ How quickly does the agent stop suggesting things the user always ignores?
510
+
511
+ Both reduce to the same skill: form a model of this person from limited
512
+ observation, then act on it.
513
+ ```
514
+
515
+ **Design choices that explicitly preserve the passive-only constraint:**
516
+ - ❌ No "ASK USER" action β€” agent never quizzes
517
+ - ❌ No profile feature in observation at eval time β€” agent must infer
518
+ - ❌ No explicit "what's your goal?" prompt β€” only sensor-equivalent meters
519
+ - βœ… Belief output is INTERNAL (visible in POC only for training reward)
520
+ - βœ… Reward formula uses only quantities computable from passive signals
521
+
522
+ If the agent learns to do well on continuous-OOD (profiles never seen in
523
+ training), it has acquired the skill of "figure out a new person from
524
+ observation alone" β€” exactly the skill the production assistant needs
525
+ when meeting a real user for the first time.
526
+
527
+ ---
528
+
529
+ ## 11. Spend & timing (concrete)
530
 
531
  ```
532
  HARDWARE: A100 large (80GB) on HF Jobs at $2.50/hr ($0.0417/min)