Spaces:
Sleeping
Sleeping
| # RhythmEnv β Architecture & Training Flow | |
| Visual deep dive into how the env is structured and how the LLM agent | |
| learns from it. All examples use concrete values (seed=42, sampled | |
| profile, real numbers from the reward calculation). | |
| --- | |
| ## 1. System: three-layer separation | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββββββββ | |
| β AGENT (Qwen 2.5-3B + LoRA r=8, 4-bit) β | |
| β β | |
| β Input: prompt (state + history) β | |
| β Output: "3 7 5 DEEP_WORK" β | |
| β β β β β β | |
| β social|morn|work action β | |
| β belief|pref|pref β | |
| βββββββββββββββββββ¬βββββββββββββββββββββββββββ | |
| β | |
| β N=8 completions per prompt | |
| β (sampling temp=1.5) | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β ORCHESTRATION (TRL GRPOTrainer + Unsloth) β | |
| β β | |
| β β’ Picks 1 prompt from dataset (~3000 rows) β | |
| β β’ Generates 8 completions β | |
| β β’ Calls 4 reward functions in parallel β | |
| β β’ Computes group-relative advantages β | |
| β β’ Backprop on LoRA weights only (~30M) β | |
| β β’ KL constraint to base Qwen (Ξ²=0.04) β | |
| βββββββββββββββββββ¬ββββββββββββββββββββββββββββββ | |
| β | |
| β env.reset(seed) β step(action) | |
| β (replay-based reward) | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β ENVIRONMENT (RhythmEnvironment / FastAPI) β | |
| β β | |
| β reset(seed=42) β samples profile β | |
| β step(action) β updates 5 meters, β | |
| β returns observation β | |
| β + per-step reward β | |
| β β | |
| β Lives at: huggingface.co/spaces/... β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| The agent never imports env code. They communicate via the OpenEnv | |
| HTTP/WebSocket contract: `POST /reset`, `POST /step`, `GET /state`. | |
| --- | |
| ## 2. One episode = one week (28 steps) | |
| ``` | |
| ONE EPISODE = 1 WEEK = 28 STEPS | |
| Day 1 (Mon) Day 2 (Tue) ... Day 7 (Sun) | |
| βββββββ¬ββββββ¬ββββββ¬ββββββ βββββββ¬ββββββ¬ββββββ¬ββββββ βββββββ¬ββββββ¬ββββββ¬ββββββ | |
| β M β A β E β N β β M β A β E β N β .. β M β A β E β N β | |
| β 0 β 1 β 2 β 3 β β 4 β 5 β 6 β 7 β β 24 β 25 β 26 β 27 β | |
| βββββββ΄ββββββ΄ββββββ΄ββββββ βββββββ΄ββββββ΄ββββββ΄ββββββ βββββββ΄ββββββ΄ββββββ΄ββββββ | |
| β β β | |
| reset() random event roll (8% per step) done=True | |
| meters: V=0.7 meditate? deep_work? grader runs | |
| C=0.7 P=0 S=0.7 sleep? socialize? final_score [0,1] | |
| Cn=0.5 | |
| At each step the agent picks 1 of 10 actions: | |
| ββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββ¬βββββββββββββββββ | |
| β PRODUCTIVITY β RECOVERY β SOCIAL β LEISURE β | |
| ββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββΌβββββββββββββββββ€ | |
| β DEEP_WORK β SLEEP β FAMILY_TIME β ME_TIME β | |
| β ADMIN_WORK β EXERCISE β SOCIALIZE β BINGE_WATCH β | |
| β LEARN β MEDITATE β β β | |
| ββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββ΄βββββββββββββββββ | |
| Time-of-day multipliers (apply to action effects): | |
| Morning (M): cognition gains Γ 1.2 vitality drain Γ 0.8 | |
| Afternoon (A): cognition gains Γ 1.0 vitality drain Γ 1.0 | |
| Evening (E): cognition gains Γ 0.8 vitality drain Γ 1.1 | |
| Night (N): cognition gains Γ 0.6 vitality drain Γ 1.3 | |
| (sleep BYPASSES these multipliers) | |
| Critical thresholds: | |
| Any meter < 0.1 at end of step β -0.30 reward penalty (one per meter) | |
| Connection decays passively every step (-0.01 to -0.02 per profile) | |
| ``` | |
| --- | |
| ## 3. State tree: what the env tracks (and what's hidden) | |
| ``` | |
| RhythmEnvironment instance state | |
| β | |
| βββ _profile βββββ HIDDEN from agent ββββββββββ | |
| β β β | |
| β βββ name: "sampled_42" β | |
| β βββ reward_weights: β | |
| β β {V: 0.05, C: 0.05, P: 0.30, β | |
| β β S: 0.20, Cn: 0.40} β | |
| β β β Used INTERNALLY | |
| β βββ social_vitality_multiplier: 1.8 β to compute reward | |
| β βββ morning_cognition_bonus: 1.5 β and modify action | |
| β βββ evening_night_cognition_bonus: None β effects. | |
| β βββ morning_penalty: None β Belief reward | |
| β βββ binge_shame: False β rewards INFERRING | |
| β βββ progress_serenity_bonus: 0.04 β this profile. | |
| β βββ idle_serenity_decay: 0.02 β | |
| β βββ vitality_decay_rate: 0.01 β | |
| β βββ stress_tolerance: 0.22 β | |
| β βββ event_impact_multiplier: 0.7 β | |
| β βββ connection_decay_rate: 0.012 β | |
| β βββ solo_serenity_bonus: 0.05 β | |
| β βββ social_connection_multiplier: 1.4 β | |
| β βββ social_serenity_bonus: 0.03 β | |
| β βββ work_vitality_recovery: 0.03 β | |
| β βββ (14 hidden parameters total) β | |
| β βββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| β Reduces to a 3-D BELIEF VECTOR (the inference target): | |
| β profile_to_belief_vector(profile) β [0.30, 0.70, 0.50] | |
| β β β β | |
| β β β work_pref | |
| β β morning_pref | |
| β social_pref | |
| β | |
| βββ meters βββββββ visible to agent βββββββββββββββββββββββββββββ | |
| β βββ _vitality: 0.62 (range 0β1) β | |
| β βββ _cognition: 0.51 β | |
| β βββ _progress: 0.24 β Agent | |
| β βββ _serenity: 0.71 β observes | |
| β βββ _connection: 0.38 β in prompt | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βββ _timestep: 5 | |
| β | |
| βββ _step_history: list[StepRecord] βββ visible to agent βββββββ | |
| β βββ (last 7 steps) β | |
| β β step 0: deep_work β +0.42 β | |
| β β deltas: V-0.10 C-0.12 P+0.18 S-0.05 Cn+0.00 β | |
| β β anomalies: V+0.00 C+0.00 P+0.06 S+0.00 Cn+0.00 ββ profile fingerprint | |
| β β step 1: sleep β +0.18 β | |
| β β deltas: V+0.20 C+0.10 P+0.00 S+0.05 Cn+0.00 β | |
| β β anomalies: V+0.00 C+0.00 P+0.00 S+0.00 Cn+0.00 β | |
| β β ... β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βββ _step_rewards: [+0.42, +0.18, +0.31, +0.55, ...] | |
| βββ used by grader to compute adaptation_score (late-half - early-half) | |
| ``` | |
| --- | |
| ## 4. What the agent sees (concrete prompt) | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββ SYSTEM PROMPT ββββββββββββββββββββββββββββββββββ | |
| β You are a life-management agent helping a person whose preferences are HIDDEN. β | |
| β Each step, output ONE LINE in this exact format: β | |
| β S M W ACTION_NAME β | |
| β S = social pref (0=hates, 9=loves), M = morning, W = work β | |
| β ACTION_NAME β {DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE, β | |
| β FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH} β | |
| β Example: 3 8 7 DEEP_WORK β | |
| β Tactics: probe early, exploit late; don't repeat actions; ... β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| βββββββββββββββββββββββββββββββββββββ USER PROMPT βββββββββββββββββββββββββββββββββββββ | |
| β Step: 5/28 (Tuesday Afternoon) β | |
| β Remaining steps: 22 β | |
| β β | |
| β Meters: β | |
| β Vitality: 0.62 β | |
| β Cognition: 0.51 β | |
| β Progress: 0.24 β | |
| β Serenity: 0.71 β | |
| β Connection: 0.38 β | |
| β β | |
| β Recent history (anom = how this person deviated from neutral baseline): β | |
| β step 0: deep_work -> reward +0.42 (V-0.10 C-0.12 P+0.18 S-0.05 Cn+0.00) β | |
| β [anom V+0.00 C+0.00 P+0.06 S+0.00 Cn+0.00] β | |
| β step 1: sleep -> reward +0.18 (V+0.20 C+0.10 P+0.00 S+0.05 Cn+0.00) β | |
| β [anom V+0.00 C+0.00 P+0.00 S+0.00 Cn+0.00] β | |
| β step 2: socialize -> reward -0.05 (V-0.11 C-0.03 P+0.00 S+0.04 Cn+0.17) β | |
| β [anom V-0.05 C+0.00 P+0.00 S+0.00 Cn+0.05] β strong profile signal β | |
| β step 3: meditate -> reward +0.30 (V+0.03 C+0.08 P+0.00 S+0.20 Cn+0.00) β | |
| β [anom V+0.00 C+0.00 P+0.00 S+0.05 Cn+0.00] β solo bonus visible β | |
| β step 4: deep_work -> reward +0.18 (V-0.06 C-0.06 P+0.18 S+0.04 Cn+0.00) β | |
| β [anom V+0.00 C+0.00 P+0.00 S+0.00 Cn+0.00] β | |
| β β | |
| β Output your belief, then your action (format: S M W ACTION_NAME): β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| LLM completion (1 of 8 sampled at temp=1.5): | |
| ββββββββββββββββββββββββ | |
| β "3 7 5 DEEP_WORK" β | |
| ββββββββββββ¬ββββββββββββ | |
| βΌ | |
| Parsed: belief = [0.33, 0.78, 0.56] | |
| action = DEEP_WORK | |
| ``` | |
| --- | |
| ## 5. The reward stack (4 layers, with concrete values for the example above) | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Completion: "3 7 5 DEEP_WORK" β | |
| β for prompt above (seed=42, step 5) β | |
| βββββββββββββββββββ¬ββββββββββββββββββββ | |
| β | |
| ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ | |
| β β β | |
| βΌ βΌ βΌ | |
| ββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββββ | |
| β format_valid β β action_legal β β env_reward β | |
| ββββββββββββββββ€ ββββββββββββββββββββ€ ββββββββββββββββββββββ€ | |
| β parses? β β action β 10 ? β β env.replay(seed=42,β | |
| β has belief? β β β β step5, action) β | |
| β β β DEEP_WORK β β β β | |
| β β +1.0 β β reward 0.0 β β deltas: β | |
| β Γ wt 0.05 β β Γ wt 0.05 β β V-0.10 C-0.12 β | |
| β β β β β P+0.18 S-0.05 β | |
| β β +0.05 β β β 0.00 β β Cn+0.00 β | |
| ββββββββββββββββ ββββββββββββββββββββ β β | |
| β profile_reward = β | |
| β sum(deltas Γ β | |
| β prof_weights) β | |
| β Γ 15 β | |
| β = (-0.10Γ0.05 β | |
| β + -0.12Γ0.05 β | |
| β + 0.18Γ0.30 β | |
| β + -0.05Γ0.20 β | |
| β + 0Γ0.40) Γ 15 β | |
| β = +0.65 β | |
| β β | |
| β + grader_bias: β | |
| β 0.5Γ0.18+0.4Γ0 β | |
| β = +0.09 β | |
| β + new-action: +0.07β | |
| β + b-act coupling: β | |
| β work=0.56 (mid), β | |
| β no bonus = 0 β | |
| β - cycle pen: 0 β | |
| β - rep pen: 0 β | |
| β β | |
| β env_reward = +0.81 β | |
| β Γ wt 1.5 = +1.21 β | |
| ββββββββββββββββββββββ | |
| β | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β belief_accuracy β | |
| βββββββββββββββββββββββββββββββ€ | |
| β true belief (sampled_42) β | |
| β = [0.30, 0.70, 0.50] β | |
| β agent belief β | |
| β = [0.33, 0.78, 0.56] β | |
| β β | |
| β MAE = (0.03 + 0.08 + 0.06) β | |
| β / 3 = 0.057 β | |
| β similarity = 0.943 β | |
| β β | |
| β baseline (constant 0.5): β | |
| β MAE = (0.20+0.20+0.00)/3 β | |
| β = 0.133 β | |
| β similarity = 0.867 β | |
| β β | |
| β reward = 0.943 - 0.867 β | |
| β = +0.076 β | |
| β Γ wt 3.0 = +0.23 β | |
| βββββββββββββββββββββββββββββββ | |
| Ξ£ TOTAL REWARD (this completion) | |
| = 0.05 + 0.00 + 1.21 + 0.23 | |
| = +1.49 | |
| ``` | |
| --- | |
| ## 6. GRPO update step (8 completions β 1 gradient update) | |
| ``` | |
| ONE TRAINING STEP (one row from the dataset β one gradient update) | |
| ββ pick prompt βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β prompt#1247 from dataset: β | |
| β state: step 5, seed=42, history=[deep_work, sleep, ...] β | |
| βββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ generate 8 completions @ temp=1.5 | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β c1: "3 7 5 DEEP_WORK" β reward +1.49 β | |
| β c2: "5 5 5 SLEEP" β reward -0.21 (constant belief: -0.03 β | |
| β + sleep replay: -0.18) β | |
| β c3: "3 6 4 ADMIN_WORK" β reward +1.10 β | |
| β c4: "4 7 6 LEARN" β reward +1.32 β | |
| β c5: "2 8 3 MEDITATE" β reward +0.45 (rep pen: -0.10 since β | |
| β meditate at step 3 too) β | |
| β c6: "3 7 7 DEEP_WORK" β reward +1.55 (belief slightly better) β | |
| β c7: "5 5 5 EXERCISE" β reward -0.15 β | |
| β c8: "4 6 5 FAMILY_TIME" β reward +0.92 β | |
| β β | |
| β group_mean = +0.81 β | |
| βββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ advantages = reward - group_mean | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β ADVANTAGE (this is what GRPO actually backprops on) β | |
| β c1: +0.68 β strongly preferred β | |
| β c2: -1.02 β strongly discouraged (constant belief) β | |
| β c3: +0.29 β | |
| β c4: +0.51 β | |
| β c5: -0.36 β | |
| β c6: +0.74 β most preferred β | |
| β c7: -0.96 β | |
| β c8: +0.11 β | |
| β β | |
| β KEY INSIGHT: only the SPREAD matters. If all 8 had the β | |
| β same reward, advantages would all be 0 β no gradient. β | |
| β This is why iter 1 mode-collapsed: format_valid +1.0 for β | |
| β every completion meant zero variance from that layer. β | |
| βββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ policy loss = -E[ adv Γ log_prob(completion) ] | |
| β + Ξ² Γ KL(policy || base_qwen) | |
| β | |
| βΌ backprop (only LoRA weights, ~30M params) | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Model nudge: β | |
| β push: "3 7 5 DEEP_WORK"-like outputs UP β | |
| β push: "3 7 7 DEEP_WORK"-like outputs UP β | |
| β pull: "5 5 5 *"-like outputs DOWN β | |
| β β | |
| β KL constraint: prevents the policy from diverging too β | |
| β far from base Qwen (avoids gibberish drift). β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| Repeat 800-2000 times. Each step ~3-5 sec on A100. | |
| ``` | |
| --- | |
| ## 7. The dataset is just starting positions (not labels) | |
| ``` | |
| DATASET (~3000 rows, generated ONCE before training) | |
| For 200-300 episodes: | |
| env.reset(seed=N) | |
| for step in range(28): | |
| record { | |
| prompt: [system, user_for_this_state], | |
| seed: N, ββββ replay metadata | |
| step_index: <current step>, | |
| action_history: <actions taken so far>, | |
| profile_mode: "continuous", | |
| } | |
| env.step(rollout_policy(obs)) ββ rollout=heuristic OR random | |
| (only matters for STATE diversity, | |
| the agent's training generations | |
| REPLACE these actions) | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β A row from the dataset: β | |
| β { β | |
| β prompt: [...full state at step 5 of episode seed=42...] β | |
| β seed: 42, β | |
| β step_index: 5, β | |
| β action_history: ["deep_work", "sleep", "socialize", β | |
| β "meditate", "deep_work"], β | |
| β profile_mode: "continuous" β | |
| β } β | |
| β β | |
| β NOTE: NO "correct action" or "label" anywhere. β | |
| β The reward function reconstructs the env from this metadata β | |
| β and scores whatever action the LLM picks. β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| This is fundamentally different from supervised learning: | |
| - Supervised: (input, target_output) β model learns to mimic target | |
| - GRPO: (prompt, replay_metadata) β model learns to maximize reward | |
| ``` | |
| --- | |
| ## 8. Final episode grader (only fires at step 28) | |
| ``` | |
| _grade_episode() β runs at done=True | |
| ββββββββββββββββ | |
| β final_score β | |
| β β [0, 1] β | |
| ββββββββ¬ββββββββ | |
| ββββββββββββββββ¬ββββββββββββββΌβββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ | |
| β β β β β β | |
| βΌ βΌ βΌ βΌ βΌ βΌ | |
| ββββββββββββββ ββββββββββββββ ββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ | |
| β crash_free β β progress β β connectionβ β adaptation β β efficiency β β belief β | |
| β Γ 0.15 β β Γ 0.20 β β Γ 0.10 β β Γ 0.25 β β Γ 0.10 β β accuracy β | |
| β β β β β β β β β β β Γ 0.20 β | |
| ββββββββββββββ€ ββββββββββββββ€ ββββββββββββ€ ββββββββββββββ€ ββββββββββββββ€ ββββββββββββββ€ | |
| β 1 - crashesβ β final P β β final Cn β β late-half β β avg_reward β β 1 - MAE β | |
| β /total_ck β β value β β value β β mean rewardβ β normalized β β vs true β | |
| β β β β β β β - early β β to [0,1] β β profile β | |
| β e.g. 0.95 β β e.g. 0.42 β β e.g. 0.51β β e.g. +0.18 β β e.g. 0.55 β β e.g. 0.80 β | |
| β Γ0.15=0.14 β β Γ0.20=0.084β β Γ0.10=0.05β β Γ0.25=0.045β β Γ0.10=0.055β β Γ0.20=0.16 β | |
| ββββββββββββββ ββββββββββββββ ββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ | |
| Ξ£ = 0.14 + 0.084 + 0.05 + 0.045 + 0.055 + 0.16 | |
| = 0.534 β final_score (with inference) | |
| Heuristic / random baselines never call env.record_belief(), so the belief | |
| component scores 0 for them β by design: the meta-RL skill is INFERENCE, | |
| and only agents that actually try get credit on this axis. | |
| Plus a sparse terminal reward (added to step 27's per-step reward): | |
| terminal_bonus = (final_score - 0.5) Γ 5 β e.g. (0.534 - 0.5) Γ 5 = +0.17 | |
| This means: at step 27, agent gets last per-step reward + bonus from grader. | |
| This is the only direct gradient signal pointing at the actual episode quality. | |
| ``` | |
| --- | |
| ## 9. Three eval conditions (post-training) | |
| ``` | |
| inference_eval.py runs ALL THREE | |
| ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ | |
| β discrete-3-profiles β continuous-in-distributionβ continuous-OOD β | |
| β (3 reference profiles) β (was the agent able to β (does meta-policy β | |
| β β learn the meta-policy?) β generalize?) β | |
| ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ€ | |
| β env.reset(seed=N, β env.reset(seed=N) β env.reset(seed=10000+N) β | |
| β profile="introvert_ β β samples from training β β samples from a region β | |
| β morning") β distribution β never seen in trainingβ | |
| β β β β | |
| β β 3 hardcoded profiles β β ~10 sampled profiles β β ~10 sampled profiles β | |
| β (from PROFILES list) β from seeds 100..110 β from seeds 10000..10010 β | |
| β β β β | |
| β Each strategy plays each β Each strategy plays each β Each strategy plays each β | |
| β profile 5x = 15 episodes β seed 1x = 10 episodes β seed 1x = 10 episodes β | |
| β β β β | |
| β Strategies tested: β Strategies tested: β Strategies tested: β | |
| β β’ random β β’ random β β’ random β | |
| β β’ heuristic β β’ heuristic β β’ heuristic β | |
| β β’ model (trained Qwen) β β’ model (trained Qwen) β β’ model (trained Qwen) β | |
| ββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββ | |
| THE KEY METRIC: trained model's score on continuous-OOD vs heuristic baseline. | |
| Heuristic baseline (profile-blind hand rules): score 0.580 on OOD. | |
| Trained meta-RL agent's target: > 0.580 on OOD. | |
| If the trained agent beats the heuristic on OOD (profiles never seen in | |
| training), that's direct proof it learned the SKILL of profile inference, | |
| not just memorized training profiles. | |
| ``` | |
| --- | |
| ## 10. Sim β Real mapping (why this POC matters) | |
| The whole point of training in this env is that the SKILL transfers. The | |
| mapping from simulated signals to production signals is direct: same | |
| information, different source. | |
| ``` | |
| ββββββββββββββββββββββββ SIMULATION (training) ββββββββ βββββ REAL PRODUCT (deployment) ββββββββ | |
| β β β β | |
| β meter: Vitality = 0.62 ββββΌββΊ HRV from watch + sleep score β | |
| β meter: Cognition = 0.51 ββββΌββΊ focus app metrics + screen time β | |
| β meter: Progress = 0.24 ββββΌββΊ calendar density + task completionβ | |
| β meter: Serenity = 0.71 ββββΌββΊ HRV variability + voice sentiment β | |
| β meter: Connection = 0.38 ββββΌββΊ call/message freq + calendar β | |
| β β β β | |
| β hidden profile (sampled per episode) ββββΌββΊ the real user's actual personalityβ | |
| β ββ never visible to agent β β ββ also never visible β | |
| β β β β | |
| β agent emits: "3 7 5 DEEP_WORK" β β β | |
| β ββ "3 7 5" = belief about user (hidden internal) ββββΌββΊ agent's internal user model β | |
| β ββ "DEEP_WORK" = action choice ββββΌββΊ recommendation to user β | |
| β β β ("I suggest a focus block now") β | |
| β β β β | |
| β per-step env_reward ββββΌββΊ meter trend + accept/ignore tap β | |
| β ββ "did meters improve under profile weights?" β β ββ "did user accept and benefit?"β | |
| β β β β | |
| β episode = 1 week (28 steps) ββββΌββΊ rolling weekly window β | |
| β β β β | |
| β 3 eval conditions β β USER ARRIVES (cold start) β | |
| β ββ discrete-3 (memorization check) β β ββ Agent has zero info about them β | |
| β ββ continuous-in-dist (training-distrib) β β ββ ~5-10 interactions to converge β | |
| β ββ continuous-OOD βββ KEY βββΊ ββββΌββΊ on a confident belief vector β | |
| β agent must infer profile NEVER seen in β β ββ Acts on belief, learns from β | |
| β training. THIS is the production scenario. β β tap responses β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββ | |
| SUCCESS METRIC IN BOTH WORLDS: how fast does the agent personalize? | |
| Sim: belief_accuracy curve over a single episode (does it climb?) | |
| adaptation_score (late-half mean reward > early-half mean reward?) | |
| Real: how many interactions until user's accept-rate stabilizes high? | |
| How quickly does the agent stop suggesting things the user always ignores? | |
| Both reduce to the same skill: form a model of this person from limited | |
| observation, then act on it. | |
| ``` | |
| **Design choices that explicitly preserve the passive-only constraint:** | |
| - β No "ASK USER" action β agent never quizzes | |
| - β No profile feature in observation at eval time β agent must infer | |
| - β No explicit "what's your goal?" prompt β only sensor-equivalent meters | |
| - β Belief output is INTERNAL (visible in POC only for training reward) | |
| - β Reward formula uses only quantities computable from passive signals | |
| If the agent learns to do well on continuous-OOD (profiles never seen in | |
| training), it has acquired the skill of "figure out a new person from | |
| observation alone" β exactly the skill the production assistant needs | |
| when meeting a real user for the first time. | |
| --- | |
| ## 11. Spend & timing (concrete) | |
| ``` | |
| HARDWARE: A100 large (80GB) on HF Jobs at $2.50/hr ($0.0417/min) | |
| FAST_MODE (200-800 steps): | |
| dataset gen: ~2 min | |
| model load: ~3 min | |
| training: ~10-25 min (depends on steps) | |
| eval: ~3 min | |
| plot + upload: ~2 min | |
| ββββββββββββββββββββββββββββ | |
| total: 20-35 min ($0.80-1.50 per iter) | |
| FULL RUN (2000 steps): | |
| dataset gen: ~3 min | |
| model load: ~3 min | |
| training: ~60-90 min | |
| eval: ~3 min | |
| plot + upload: ~2 min | |
| ββββββββββββββββββββββββββββ | |
| total: 70-100 min ($3-4) | |
| Iter 1 (200 steps): $0.50 β mode collapse (single action) | |
| Iter 2 (400 steps): $1.50 β mode collapse (2-cycle) | |
| Iter 3 (800 steps): $5 β³ in flight (control) | |
| Iter 4 (800 steps): $5 β³ in flight (with full fixes) | |
| Final (2000 steps): $4 β³ pending iter 3+4 results | |
| ββββββ | |
| ~$16 of $30 budget | |
| ``` | |