InosLihka Claude Opus 4.7 (1M context) commited on
Commit
4c69214
Β·
1 Parent(s): 105973d

docs: detailed architecture diagrams with concrete values

Browse files

10 visual diagrams (mostly ASCII trees and box layouts) walking through:

1. Three-layer system separation (agent / orchestration / env)
2. Episode timeline (28 steps = 7 days x 4 slots)
3. State tree (hidden profile + visible meters + history with anomalies)
4. What the agent sees (concrete prompt + completion example)
5. Reward stack with worked example: '3 7 5 DEEP_WORK' for seed=42 step 5
- Layer-by-layer reward computation with actual numbers
- Final reward = +1.49 (sum of 4 weighted layers)
6. GRPO update step (8 completions -> advantages -> backprop)
- Shows why mode collapse happens when reward_std=0
7. Dataset structure (replay metadata, NOT supervised labels)
8. Final episode grader (5 components weighted 0.20/0.25/0.15/0.30/0.10)
9. Three eval conditions (discrete-3, in-dist, OOD)
10. Spend & timing per iter

Every diagram uses concrete values from a real example (sampled_42 profile,
specific belief vector, real meter values, real reward calculations).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. docs/architecture.md +497 -0
docs/architecture.md ADDED
@@ -0,0 +1,497 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RhythmEnv β€” Architecture & Training Flow
2
+
3
+ Visual deep dive into how the env is structured and how the LLM agent
4
+ learns from it. All examples use concrete values (seed=42, sampled
5
+ profile, real numbers from the reward calculation).
6
+
7
+ ---
8
+
9
+ ## 1. System: three-layer separation
10
+
11
+ ```
12
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
13
+ β”‚ AGENT (Qwen 2.5-3B + LoRA r=16, 4-bit) β”‚
14
+ β”‚ β”‚
15
+ β”‚ Input: prompt (state + history) β”‚
16
+ β”‚ Output: "3 7 5 DEEP_WORK" β”‚
17
+ β”‚ ↑ ↑ ↑ ↑ β”‚
18
+ β”‚ social|morn|work action β”‚
19
+ β”‚ belief|pref|pref β”‚
20
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
21
+ β”‚
22
+ β”‚ N=8 completions per prompt
23
+ β”‚ (sampling temp=1.5)
24
+ β–Ό
25
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
26
+ β”‚ ORCHESTRATION (TRL GRPOTrainer + Unsloth) β”‚
27
+ β”‚ β”‚
28
+ β”‚ β€’ Picks 1 prompt from dataset (~3000 rows) β”‚
29
+ β”‚ β€’ Generates 8 completions β”‚
30
+ β”‚ β€’ Calls 4 reward functions in parallel β”‚
31
+ β”‚ β€’ Computes group-relative advantages β”‚
32
+ β”‚ β€’ Backprop on LoRA weights only (~30M) β”‚
33
+ β”‚ β€’ KL constraint to base Qwen (Ξ²=0.04) β”‚
34
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
35
+ β”‚
36
+ β”‚ env.reset(seed) β†’ step(action)
37
+ β”‚ (replay-based reward)
38
+ β–Ό
39
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
40
+ β”‚ ENVIRONMENT (RhythmEnvironment / FastAPI) β”‚
41
+ β”‚ β”‚
42
+ β”‚ reset(seed=42) β†’ samples profile β”‚
43
+ β”‚ step(action) β†’ updates 5 meters, β”‚
44
+ β”‚ returns observation β”‚
45
+ β”‚ + per-step reward β”‚
46
+ β”‚ β”‚
47
+ β”‚ Lives at: huggingface.co/spaces/... β”‚
48
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
49
+ ```
50
+
51
+ The agent never imports env code. They communicate via the OpenEnv
52
+ HTTP/WebSocket contract: `POST /reset`, `POST /step`, `GET /state`.
53
+
54
+ ---
55
+
56
+ ## 2. One episode = one week (28 steps)
57
+
58
+ ```
59
+ ONE EPISODE = 1 WEEK = 28 STEPS
60
+
61
+ Day 1 (Mon) Day 2 (Tue) ... Day 7 (Sun)
62
+ β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
63
+ β”‚ M β”‚ A β”‚ E β”‚ N β”‚ β”‚ M β”‚ A β”‚ E β”‚ N β”‚ .. β”‚ M β”‚ A β”‚ E β”‚ N β”‚
64
+ β”‚ 0 β”‚ 1 β”‚ 2 β”‚ 3 β”‚ β”‚ 4 β”‚ 5 β”‚ 6 β”‚ 7 β”‚ β”‚ 24 β”‚ 25 β”‚ 26 β”‚ 27 β”‚
65
+ β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
66
+ ↑ ↑ ↑
67
+ reset() random event roll (8% per step) done=True
68
+ meters: V=0.7 meditate? deep_work? grader runs
69
+ C=0.7 P=0 S=0.7 sleep? socialize? final_score [0,1]
70
+ Cn=0.5
71
+
72
+ At each step the agent picks 1 of 10 actions:
73
+
74
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
75
+ β”‚ PRODUCTIVITY β”‚ RECOVERY β”‚ SOCIAL β”‚ LEISURE β”‚
76
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
77
+ β”‚ DEEP_WORK β”‚ SLEEP β”‚ FAMILY_TIME β”‚ ME_TIME β”‚
78
+ β”‚ ADMIN_WORK β”‚ EXERCISE β”‚ SOCIALIZE β”‚ BINGE_WATCH β”‚
79
+ β”‚ LEARN β”‚ MEDITATE β”‚ β”‚ β”‚
80
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
81
+
82
+ Time-of-day multipliers (apply to action effects):
83
+ Morning (M): cognition gains Γ— 1.2 vitality drain Γ— 0.8
84
+ Afternoon (A): cognition gains Γ— 1.0 vitality drain Γ— 1.0
85
+ Evening (E): cognition gains Γ— 0.8 vitality drain Γ— 1.1
86
+ Night (N): cognition gains Γ— 0.6 vitality drain Γ— 1.3
87
+ (sleep BYPASSES these multipliers)
88
+
89
+ Critical thresholds:
90
+ Any meter < 0.1 at end of step β†’ -0.30 reward penalty (one per meter)
91
+ Connection decays passively every step (-0.01 to -0.02 per profile)
92
+ ```
93
+
94
+ ---
95
+
96
+ ## 3. State tree: what the env tracks (and what's hidden)
97
+
98
+ ```
99
+ RhythmEnvironment instance state
100
+ β”‚
101
+ β”œβ”€β”€ _profile ◄──── HIDDEN from agent ─────────┐
102
+ β”‚ β”‚ β”‚
103
+ β”‚ β”œβ”€β”€ name: "sampled_42" β”‚
104
+ β”‚ β”œβ”€β”€ reward_weights: β”‚
105
+ β”‚ β”‚ {V: 0.05, C: 0.05, P: 0.30, β”‚
106
+ β”‚ β”‚ S: 0.20, Cn: 0.40} β”‚
107
+ β”‚ β”‚ β”‚ Used INTERNALLY
108
+ β”‚ β”œβ”€β”€ social_vitality_multiplier: 1.8 β”‚ to compute reward
109
+ β”‚ β”œβ”€β”€ morning_cognition_bonus: 1.5 β”‚ and modify action
110
+ β”‚ β”œβ”€β”€ evening_night_cognition_bonus: None β”‚ effects.
111
+ β”‚ β”œβ”€β”€ morning_penalty: None β”‚ Belief reward
112
+ β”‚ β”œβ”€β”€ binge_shame: False β”‚ rewards INFERRING
113
+ β”‚ β”œβ”€β”€ progress_serenity_bonus: 0.04 β”‚ this profile.
114
+ β”‚ β”œβ”€β”€ idle_serenity_decay: 0.02 β”‚
115
+ β”‚ β”œβ”€β”€ vitality_decay_rate: 0.01 β”‚
116
+ β”‚ β”œβ”€β”€ stress_tolerance: 0.22 β”‚
117
+ β”‚ β”œβ”€β”€ event_impact_multiplier: 0.7 β”‚
118
+ β”‚ β”œβ”€β”€ connection_decay_rate: 0.012 β”‚
119
+ β”‚ β”œβ”€β”€ solo_serenity_bonus: 0.05 β”‚
120
+ β”‚ β”œβ”€β”€ social_connection_multiplier: 1.4 β”‚
121
+ β”‚ β”œβ”€β”€ social_serenity_bonus: 0.03 β”‚
122
+ β”‚ β”œβ”€β”€ work_vitality_recovery: 0.03 β”‚
123
+ β”‚ └── (14 hidden parameters total) β”‚
124
+ β”‚ β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
125
+ β”‚
126
+ β”‚ Reduces to a 3-D BELIEF VECTOR (the inference target):
127
+ β”‚ profile_to_belief_vector(profile) β†’ [0.30, 0.70, 0.50]
128
+ β”‚ β”‚ β”‚ β”‚
129
+ β”‚ β”‚ β”‚ work_pref
130
+ β”‚ β”‚ morning_pref
131
+ β”‚ social_pref
132
+ β”‚
133
+ β”œβ”€β”€ meters ◄────── visible to agent ────────────────────────────┐
134
+ β”‚ β”œβ”€β”€ _vitality: 0.62 (range 0–1) β”‚
135
+ β”‚ β”œβ”€β”€ _cognition: 0.51 β”‚
136
+ β”‚ β”œβ”€β”€ _progress: 0.24 β”‚ Agent
137
+ β”‚ β”œβ”€β”€ _serenity: 0.71 β”‚ observes
138
+ β”‚ └── _connection: 0.38 β”‚ in prompt
139
+ β”‚ β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
140
+ β”‚
141
+ β”œβ”€β”€ _timestep: 5
142
+ β”‚
143
+ β”œβ”€β”€ _step_history: list[StepRecord] ◄── visible to agent ──────┐
144
+ β”‚ β”œβ”€β”€ (last 7 steps) β”‚
145
+ β”‚ β”‚ step 0: deep_work β†’ +0.42 β”‚
146
+ β”‚ β”‚ deltas: V-0.10 C-0.12 P+0.18 S-0.05 Cn+0.00 β”‚
147
+ β”‚ β”‚ anomalies: V+0.00 C+0.00 P+0.06 S+0.00 Cn+0.00 ◄─ profile fingerprint
148
+ β”‚ β”‚ step 1: sleep β†’ +0.18 β”‚
149
+ β”‚ β”‚ deltas: V+0.20 C+0.10 P+0.00 S+0.05 Cn+0.00 β”‚
150
+ β”‚ β”‚ anomalies: V+0.00 C+0.00 P+0.00 S+0.00 Cn+0.00 β”‚
151
+ β”‚ β”‚ ... β”‚
152
+ β”‚ β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
153
+ β”‚
154
+ └── _step_rewards: [+0.42, +0.18, +0.31, +0.55, ...]
155
+ └── used by grader to compute adaptation_score (late-half - early-half)
156
+ ```
157
+
158
+ ---
159
+
160
+ ## 4. What the agent sees (concrete prompt)
161
+
162
+ ```
163
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ SYSTEM PROMPT ─────────────────────────────────┐
164
+ β”‚ You are a life-management agent helping a person whose preferences are HIDDEN. β”‚
165
+ β”‚ Each step, output ONE LINE in this exact format: β”‚
166
+ β”‚ S M W ACTION_NAME β”‚
167
+ β”‚ S = social pref (0=hates, 9=loves), M = morning, W = work β”‚
168
+ β”‚ ACTION_NAME ∈ {DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE, β”‚
169
+ β”‚ FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH} β”‚
170
+ β”‚ Example: 3 8 7 DEEP_WORK β”‚
171
+ β”‚ Tactics: probe early, exploit late; don't repeat actions; ... β”‚
172
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
173
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ USER PROMPT ────────────────────────────────────┐
174
+ β”‚ Step: 5/28 (Tuesday Afternoon) β”‚
175
+ β”‚ Remaining steps: 22 β”‚
176
+ β”‚ β”‚
177
+ β”‚ Meters: β”‚
178
+ β”‚ Vitality: 0.62 β”‚
179
+ β”‚ Cognition: 0.51 β”‚
180
+ β”‚ Progress: 0.24 β”‚
181
+ β”‚ Serenity: 0.71 β”‚
182
+ β”‚ Connection: 0.38 β”‚
183
+ β”‚ β”‚
184
+ β”‚ Recent history (anom = how this person deviated from neutral baseline): β”‚
185
+ β”‚ step 0: deep_work -> reward +0.42 (V-0.10 C-0.12 P+0.18 S-0.05 Cn+0.00) β”‚
186
+ β”‚ [anom V+0.00 C+0.00 P+0.06 S+0.00 Cn+0.00] β”‚
187
+ β”‚ step 1: sleep -> reward +0.18 (V+0.20 C+0.10 P+0.00 S+0.05 Cn+0.00) β”‚
188
+ β”‚ [anom V+0.00 C+0.00 P+0.00 S+0.00 Cn+0.00] β”‚
189
+ β”‚ step 2: socialize -> reward -0.05 (V-0.11 C-0.03 P+0.00 S+0.04 Cn+0.17) β”‚
190
+ β”‚ [anom V-0.05 C+0.00 P+0.00 S+0.00 Cn+0.05] ← strong profile signal β”‚
191
+ β”‚ step 3: meditate -> reward +0.30 (V+0.03 C+0.08 P+0.00 S+0.20 Cn+0.00) β”‚
192
+ β”‚ [anom V+0.00 C+0.00 P+0.00 S+0.05 Cn+0.00] ← solo bonus visible β”‚
193
+ β”‚ step 4: deep_work -> reward +0.18 (V-0.06 C-0.06 P+0.18 S+0.04 Cn+0.00) β”‚
194
+ β”‚ [anom V+0.00 C+0.00 P+0.00 S+0.00 Cn+0.00] β”‚
195
+ β”‚ β”‚
196
+ β”‚ Output your belief, then your action (format: S M W ACTION_NAME): β”‚
197
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
198
+
199
+ LLM completion (1 of 8 sampled at temp=1.5):
200
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
201
+ β”‚ "3 7 5 DEEP_WORK" β”‚
202
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
203
+ β–Ό
204
+ Parsed: belief = [0.33, 0.78, 0.56]
205
+ action = DEEP_WORK
206
+ ```
207
+
208
+ ---
209
+
210
+ ## 5. The reward stack (4 layers, with concrete values for the example above)
211
+
212
+ ```
213
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
214
+ β”‚ Completion: "3 7 5 DEEP_WORK" β”‚
215
+ β”‚ for prompt above (seed=42, step 5) β”‚
216
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
217
+ β”‚
218
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€οΏ½οΏ½οΏ½
219
+ β”‚ β”‚ β”‚
220
+ β–Ό β–Ό β–Ό
221
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
222
+ β”‚ format_valid β”‚ β”‚ action_legal β”‚ β”‚ env_reward β”‚
223
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
224
+ β”‚ parses? β”‚ β”‚ action ∈ 10 ? β”‚ β”‚ env.replay(seed=42,β”‚
225
+ β”‚ has belief? β”‚ β”‚ β”‚ β”‚ step5, action) β”‚
226
+ β”‚ β”‚ β”‚ DEEP_WORK βœ“ β”‚ β”‚ β”‚
227
+ β”‚ βœ“ +1.0 β”‚ β”‚ reward 0.0 β”‚ β”‚ deltas: β”‚
228
+ β”‚ Γ— wt 0.05 β”‚ β”‚ Γ— wt 0.05 β”‚ β”‚ V-0.10 C-0.12 β”‚
229
+ β”‚ β”‚ β”‚ β”‚ β”‚ P+0.18 S-0.05 β”‚
230
+ β”‚ β†’ +0.05 β”‚ β”‚ β†’ 0.00 β”‚ β”‚ Cn+0.00 β”‚
231
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
232
+ β”‚ profile_reward = β”‚
233
+ β”‚ sum(deltas Γ— β”‚
234
+ β”‚ prof_weights) β”‚
235
+ β”‚ Γ— 15 β”‚
236
+ β”‚ = (-0.10Γ—0.05 β”‚
237
+ β”‚ + -0.12Γ—0.05 β”‚
238
+ β”‚ + 0.18Γ—0.30 β”‚
239
+ β”‚ + -0.05Γ—0.20 β”‚
240
+ β”‚ + 0Γ—0.40) Γ— 15 β”‚
241
+ β”‚ = +0.65 β”‚
242
+ β”‚ β”‚
243
+ β”‚ + grader_bias: β”‚
244
+ β”‚ 0.5Γ—0.18+0.4Γ—0 β”‚
245
+ β”‚ = +0.09 β”‚
246
+ β”‚ + new-action: +0.07β”‚
247
+ β”‚ + b-act coupling: β”‚
248
+ β”‚ work=0.56 (mid), β”‚
249
+ β”‚ no bonus = 0 β”‚
250
+ β”‚ - cycle pen: 0 β”‚
251
+ β”‚ - rep pen: 0 β”‚
252
+ β”‚ β”‚
253
+ β”‚ env_reward = +0.81 β”‚
254
+ β”‚ Γ— wt 1.5 = +1.21 β”‚
255
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
256
+ β”‚
257
+ β”‚
258
+ β–Ό
259
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
260
+ β”‚ belief_accuracy β”‚
261
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
262
+ β”‚ true belief (sampled_42) β”‚
263
+ β”‚ = [0.30, 0.70, 0.50] β”‚
264
+ β”‚ agent belief β”‚
265
+ β”‚ = [0.33, 0.78, 0.56] β”‚
266
+ β”‚ β”‚
267
+ β”‚ MAE = (0.03 + 0.08 + 0.06) β”‚
268
+ β”‚ / 3 = 0.057 β”‚
269
+ β”‚ similarity = 0.943 β”‚
270
+ β”‚ β”‚
271
+ β”‚ baseline (constant 0.5): β”‚
272
+ β”‚ MAE = (0.20+0.20+0.00)/3 β”‚
273
+ β”‚ = 0.133 β”‚
274
+ β”‚ similarity = 0.867 β”‚
275
+ β”‚ β”‚
276
+ β”‚ reward = 0.943 - 0.867 β”‚
277
+ β”‚ = +0.076 β”‚
278
+ β”‚ Γ— wt 3.0 = +0.23 β”‚
279
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
280
+
281
+ Ξ£ TOTAL REWARD (this completion)
282
+ = 0.05 + 0.00 + 1.21 + 0.23
283
+ = +1.49
284
+ ```
285
+
286
+ ---
287
+
288
+ ## 6. GRPO update step (8 completions β†’ 1 gradient update)
289
+
290
+ ```
291
+ ONE TRAINING STEP (one row from the dataset β†’ one gradient update)
292
+
293
+ β”Œβ”€ pick prompt ────────────────────────────────────────────────────────┐
294
+ β”‚ prompt#1247 from dataset: β”‚
295
+ β”‚ state: step 5, seed=42, history=[deep_work, sleep, ...] β”‚
296
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
297
+ β”‚
298
+ β–Ό generate 8 completions @ temp=1.5
299
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
300
+ β”‚ c1: "3 7 5 DEEP_WORK" β†’ reward +1.49 β”‚
301
+ β”‚ c2: "5 5 5 SLEEP" β†’ reward -0.21 (constant belief: -0.03 β”‚
302
+ β”‚ + sleep replay: -0.18) β”‚
303
+ β”‚ c3: "3 6 4 ADMIN_WORK" β†’ reward +1.10 β”‚
304
+ β”‚ c4: "4 7 6 LEARN" β†’ reward +1.32 β”‚
305
+ β”‚ c5: "2 8 3 MEDITATE" β†’ reward +0.45 (rep pen: -0.10 since β”‚
306
+ β”‚ meditate at step 3 too) β”‚
307
+ β”‚ c6: "3 7 7 DEEP_WORK" β†’ reward +1.55 (belief slightly better) β”‚
308
+ β”‚ c7: "5 5 5 EXERCISE" β†’ reward -0.15 β”‚
309
+ β”‚ c8: "4 6 5 FAMILY_TIME" β†’ reward +0.92 β”‚
310
+ β”‚ β”‚
311
+ β”‚ group_mean = +0.81 β”‚
312
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
313
+ β”‚
314
+ β–Ό advantages = reward - group_mean
315
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
316
+ β”‚ ADVANTAGE (this is what GRPO actually backprops on) β”‚
317
+ β”‚ c1: +0.68 ← strongly preferred β”‚
318
+ β”‚ c2: -1.02 ← strongly discouraged (constant belief) β”‚
319
+ β”‚ c3: +0.29 β”‚
320
+ β”‚ c4: +0.51 β”‚
321
+ β”‚ c5: -0.36 β”‚
322
+ β”‚ c6: +0.74 ← most preferred β”‚
323
+ β”‚ c7: -0.96 β”‚
324
+ β”‚ c8: +0.11 β”‚
325
+ β”‚ β”‚
326
+ β”‚ KEY INSIGHT: only the SPREAD matters. If all 8 had the β”‚
327
+ β”‚ same reward, advantages would all be 0 β†’ no gradient. β”‚
328
+ β”‚ This is why iter 1 mode-collapsed: format_valid +1.0 for β”‚
329
+ β”‚ every completion meant zero variance from that layer. β”‚
330
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
331
+ β”‚
332
+ β–Ό policy loss = -E[ adv Γ— log_prob(completion) ]
333
+ β”‚ + Ξ² Γ— KL(policy || base_qwen)
334
+ β”‚
335
+ β–Ό backprop (only LoRA weights, ~30M params)
336
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
337
+ β”‚ Model nudge: β”‚
338
+ β”‚ push: "3 7 5 DEEP_WORK"-like outputs UP β”‚
339
+ β”‚ push: "3 7 7 DEEP_WORK"-like outputs UP β”‚
340
+ β”‚ pull: "5 5 5 *"-like outputs DOWN β”‚
341
+ β”‚ β”‚
342
+ β”‚ KL constraint: prevents the policy from diverging too β”‚
343
+ β”‚ far from base Qwen (avoids gibberish drift). β”‚
344
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
345
+
346
+ Repeat 800-2000 times. Each step ~3-5 sec on A100.
347
+ ```
348
+
349
+ ---
350
+
351
+ ## 7. The dataset is just starting positions (not labels)
352
+
353
+ ```
354
+ DATASET (~3000 rows, generated ONCE before training)
355
+
356
+ For 200-300 episodes:
357
+ env.reset(seed=N)
358
+ for step in range(28):
359
+ record {
360
+ prompt: [system, user_for_this_state],
361
+ seed: N, ◄─── replay metadata
362
+ step_index: <current step>,
363
+ action_history: <actions taken so far>,
364
+ profile_mode: "continuous",
365
+ }
366
+ env.step(rollout_policy(obs)) ◄─ rollout=heuristic OR random
367
+ (only matters for STATE diversity,
368
+ the agent's training generations
369
+ REPLACE these actions)
370
+
371
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
372
+ β”‚ A row from the dataset: β”‚
373
+ β”‚ { β”‚
374
+ β”‚ prompt: [...full state at step 5 of episode seed=42...] β”‚
375
+ β”‚ seed: 42, β”‚
376
+ β”‚ step_index: 5, β”‚
377
+ β”‚ action_history: ["deep_work", "sleep", "socialize", β”‚
378
+ β”‚ "meditate", "deep_work"], β”‚
379
+ β”‚ profile_mode: "continuous" β”‚
380
+ β”‚ } β”‚
381
+ β”‚ β”‚
382
+ β”‚ NOTE: NO "correct action" or "label" anywhere. β”‚
383
+ β”‚ The reward function reconstructs the env from this metadata β”‚
384
+ β”‚ and scores whatever action the LLM picks. β”‚
385
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
386
+
387
+ This is fundamentally different from supervised learning:
388
+ - Supervised: (input, target_output) β€” model learns to mimic target
389
+ - GRPO: (prompt, replay_metadata) β€” model learns to maximize reward
390
+ ```
391
+
392
+ ---
393
+
394
+ ## 8. Final episode grader (only fires at step 28)
395
+
396
+ ```
397
+ _grade_episode() β€” runs at done=True
398
+
399
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
400
+ β”‚ final_score β”‚
401
+ β”‚ ∈ [0, 1] β”‚
402
+ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
403
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
404
+ β”‚ β”‚ β”‚ β”‚ β”‚
405
+ β–Ό β–Ό β–Ό β–Ό β–Ό
406
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
407
+ β”‚ crash_free β”‚ β”‚ progress β”‚ β”‚ connectionβ”‚ β”‚ adaptation β”‚ β”‚ efficiency β”‚
408
+ β”‚ Γ— 0.20 β”‚ β”‚ Γ— 0.25 β”‚ β”‚ Γ— 0.15 β”‚ β”‚ Γ— 0.30 β”‚ β”‚ Γ— 0.10 β”‚
409
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
410
+ β”‚ 1 - crashesβ”‚ β”‚ final P β”‚ β”‚ final Cn β”‚ β”‚ late-half β”‚ β”‚ avg_reward β”‚
411
+ β”‚ /total_ck β”‚ β”‚ value β”‚ β”‚ value β”‚ β”‚ mean rewardβ”‚ β”‚ normalized β”‚
412
+ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ - early β”‚ β”‚ to [0,1] β”‚
413
+ β”‚ e.g. 0.95 β”‚ β”‚ e.g. 0.42 β”‚ β”‚ e.g. 0.51β”‚ β”‚ e.g. +0.18 β”‚ β”‚ e.g. 0.55 β”‚
414
+ β”‚ Γ—0.20=0.19 β”‚ β”‚ Γ—0.25=0.105β”‚ β”‚ Γ—0.15=0.08β”‚ β”‚ Γ—0.30=0.054β”‚ β”‚ Γ—0.10=0.055β”‚
415
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ └──────────�� β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
416
+
417
+ Ξ£ = 0.19 + 0.105 + 0.08 + 0.054 + 0.055
418
+ = 0.484 ← final_score
419
+
420
+ Plus iter 4 sparse terminal reward (added to step 27's per-step reward):
421
+ terminal_bonus = (0.484 - 0.5) Γ— 5 = -0.08
422
+
423
+ This means: at step 27, agent gets last per-step reward + bonus from grader.
424
+ This is the only direct gradient signal pointing at the actual episode quality.
425
+ ```
426
+
427
+ ---
428
+
429
+ ## 9. Three eval conditions (post-training)
430
+
431
+ ```
432
+ inference_eval.py runs ALL THREE
433
+
434
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
435
+ β”‚ discrete-3-profiles β”‚ continuous-in-distributionβ”‚ continuous-OOD β”‚
436
+ β”‚ (legacy comparison) β”‚ (was the agent able to β”‚ (does meta-policy β”‚
437
+ β”‚ β”‚ learn the meta-policy?) β”‚ generalize?) β”‚
438
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
439
+ β”‚ env.reset(seed=N, β”‚ env.reset(seed=N) β”‚ env.reset(seed=10000+N) β”‚
440
+ β”‚ profile="introvert_ β”‚ ← samples from training β”‚ ← samples from a region β”‚
441
+ β”‚ morning") β”‚ distribution β”‚ never seen in trainingβ”‚
442
+ β”‚ β”‚ β”‚ β”‚
443
+ β”‚ β†’ 3 hardcoded profiles β”‚ β†’ ~10 sampled profiles β”‚ β†’ ~10 sampled profiles β”‚
444
+ β”‚ (from PROFILES list) β”‚ from seeds 100..110 β”‚ from seeds 10000..10010 β”‚
445
+ β”‚ β”‚ β”‚ β”‚
446
+ β”‚ Each strategy plays each β”‚ Each strategy plays each β”‚ Each strategy plays each β”‚
447
+ β”‚ profile 5x = 15 episodes β”‚ seed 1x = 10 episodes β”‚ seed 1x = 10 episodes β”‚
448
+ β”‚ β”‚ β”‚ β”‚
449
+ β”‚ Strategies tested: β”‚ Strategies tested: β”‚ Strategies tested: β”‚
450
+ β”‚ β€’ random β”‚ β€’ random β”‚ β€’ random β”‚
451
+ β”‚ β€’ heuristic β”‚ β€’ heuristic β”‚ β€’ heuristic β”‚
452
+ β”‚ β€’ model (trained Qwen) β”‚ β€’ model (trained Qwen) β”‚ β€’ model (trained Qwen) β”‚
453
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
454
+
455
+ THE KEY METRIC: trained model's score on continuous-OOD vs heuristic baseline.
456
+
457
+ Heuristic baseline (profile-blind hand rules): score 0.580 on OOD.
458
+ Trained meta-RL agent's target: > 0.580 on OOD.
459
+
460
+ If the trained agent beats the heuristic on OOD (profiles never seen in
461
+ training), that's direct proof it learned the SKILL of profile inference,
462
+ not just memorized training profiles.
463
+ ```
464
+
465
+ ---
466
+
467
+ ## 10. Spend & timing (concrete)
468
+
469
+ ```
470
+ HARDWARE: A100 large (80GB) on HF Jobs at $2.50/hr ($0.0417/min)
471
+
472
+ FAST_MODE (200-800 steps):
473
+ dataset gen: ~2 min
474
+ model load: ~3 min
475
+ training: ~10-25 min (depends on steps)
476
+ eval: ~3 min
477
+ plot + upload: ~2 min
478
+ ────────────────────────────
479
+ total: 20-35 min ($0.80-1.50 per iter)
480
+
481
+ FULL RUN (2000 steps):
482
+ dataset gen: ~3 min
483
+ model load: ~3 min
484
+ training: ~60-90 min
485
+ eval: ~3 min
486
+ plot + upload: ~2 min
487
+ ────────────────────────────
488
+ total: 70-100 min ($3-4)
489
+
490
+ Iter 1 (200 steps): $0.50 ❌ mode collapse (single action)
491
+ Iter 2 (400 steps): $1.50 ❌ mode collapse (2-cycle)
492
+ Iter 3 (800 steps): $5 ⏳ in flight (control)
493
+ Iter 4 (800 steps): $5 ⏳ in flight (with full fixes)
494
+ Final (2000 steps): $4 ⏳ pending iter 3+4 results
495
+ ──────
496
+ ~$16 of $30 budget
497
+ ```