Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka Claude Opus 4.7 (1M context) commited on 12 days ago

Commit

4c69214

1 Parent(s): 105973d

docs: detailed architecture diagrams with concrete values

10 visual diagrams (mostly ASCII trees and box layouts) walking through:

1. Three-layer system separation (agent / orchestration / env)
2. Episode timeline (28 steps = 7 days x 4 slots)
3. State tree (hidden profile + visible meters + history with anomalies)
4. What the agent sees (concrete prompt + completion example)
5. Reward stack with worked example: '3 7 5 DEEP_WORK' for seed=42 step 5
- Layer-by-layer reward computation with actual numbers
- Final reward = +1.49 (sum of 4 weighted layers)
6. GRPO update step (8 completions -> advantages -> backprop)
- Shows why mode collapse happens when reward_std=0
7. Dataset structure (replay metadata, NOT supervised labels)
8. Final episode grader (5 components weighted 0.20/0.25/0.15/0.30/0.10)
9. Three eval conditions (discrete-3, in-dist, OOD)
10. Spend & timing per iter

Every diagram uses concrete values from a real example (sampled_42 profile,
specific belief vector, real meter values, real reward calculations).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show

docs/architecture.md +497 -0

docs/architecture.md ADDED Viewed

	@@ -0,0 +1,497 @@

+# RhythmEnv — Architecture & Training Flow
+Visual deep dive into how the env is structured and how the LLM agent
+learns from it. All examples use concrete values (seed=42, sampled
+profile, real numbers from the reward calculation).
+---
+## 1. System: three-layer separation
+```
+                ┌────────────────────────────────────────────┐
+                │   AGENT  (Qwen 2.5-3B + LoRA r=16, 4-bit)  │
+                │                                            │
+                │   Input:  prompt (state + history)         │
+                │   Output: "3 7 5 DEEP_WORK"                │
+                │           ↑   ↑   ↑     ↑                  │
+                │        social|morn|work action             │
+                │        belief|pref|pref                    │
+                └─────────────────┬──────────────────────────┘
+                                  │
+                                  │  N=8 completions per prompt
+                                  │  (sampling temp=1.5)
+                                  ▼
+              ┌───────────────────────────────────────────────┐
+              │ ORCHESTRATION  (TRL GRPOTrainer + Unsloth)    │
+              │                                               │
+              │   • Picks 1 prompt from dataset (~3000 rows)  │
+              │   • Generates 8 completions                   │
+              │   • Calls 4 reward functions in parallel      │
+              │   • Computes group-relative advantages        │
+              │   • Backprop on LoRA weights only (~30M)      │
+              │   • KL constraint to base Qwen (β=0.04)       │
+              └─────────────────┬─────────────────────────────┘
+                                │
+                                │  env.reset(seed) → step(action)
+                                │  (replay-based reward)
+                                ▼
+              ┌───────────────────────────────────────────────┐
+              │   ENVIRONMENT  (RhythmEnvironment / FastAPI)  │
+              │                                               │
+              │   reset(seed=42)  → samples profile           │
+              │   step(action)    → updates 5 meters,         │
+              │                     returns observation       │
+              │                     + per-step reward         │
+              │                                               │
+              │   Lives at: huggingface.co/spaces/...         │
+              └───────────────────────────────────────────────┘
+```
+The agent never imports env code. They communicate via the OpenEnv
+HTTP/WebSocket contract: `POST /reset`, `POST /step`, `GET /state`.
+---
+## 2. One episode = one week (28 steps)
+```
+                     ONE EPISODE = 1 WEEK = 28 STEPS
+  Day 1 (Mon)        Day 2 (Tue)    ...    Day 7 (Sun)
+ ┌─────┬─────┬─────┬─────┐ ┌─────┬─────┬─────┬─────┐    ┌─────┬─────┬─────┬─────┐
+ │  M  │  A  │  E  │  N  │ │  M  │  A  │  E  │  N  │ .. │  M  │  A  │  E  │  N  │
+ │  0  │  1  │  2  │  3  │ │  4  │  5  │  6  │  7  │    │ 24  │ 25  │ 26  │ 27  │
+ └─────┴─────┴─────┴─────┘ └─────┴─────┴─────┴─────┘    └─────┴─────┴─────┴─────┘
+   ↑                          ↑                                              ↑
+   reset()              random event roll (8% per step)                   done=True
+   meters: V=0.7        meditate? deep_work?                            grader runs
+   C=0.7 P=0 S=0.7      sleep? socialize?                          final_score [0,1]
+   Cn=0.5
+         At each step the agent picks 1 of 10 actions:
+   ┌────────────────┬─────────────────┬───────────────┬────────────────┐
+   │  PRODUCTIVITY  │     RECOVERY    │     SOCIAL    │     LEISURE    │
+   ├────────────────┼─────────────────┼───────────────┼────────────────┤
+   │  DEEP_WORK     │  SLEEP          │  FAMILY_TIME  │  ME_TIME       │
+   │  ADMIN_WORK    │  EXERCISE       │  SOCIALIZE    │  BINGE_WATCH   │
+   │  LEARN         │  MEDITATE       │               │                │
+   └────────────────┴─────────────────┴───────────────┴────────────────┘
+   Time-of-day multipliers (apply to action effects):
+     Morning (M):    cognition gains × 1.2   vitality drain × 0.8
+     Afternoon (A):  cognition gains × 1.0   vitality drain × 1.0
+     Evening (E):    cognition gains × 0.8   vitality drain × 1.1
+     Night (N):      cognition gains × 0.6   vitality drain × 1.3
+                     (sleep BYPASSES these multipliers)
+   Critical thresholds:
+     Any meter < 0.1 at end of step → -0.30 reward penalty (one per meter)
+     Connection decays passively every step (-0.01 to -0.02 per profile)
+```
+---
+## 3. State tree: what the env tracks (and what's hidden)
+```
+RhythmEnvironment instance state
+│
+├── _profile  ◄──── HIDDEN from agent ─────────┐
+│   │                                          │
+│   ├── name: "sampled_42"                     │
+│   ├── reward_weights:                        │
+│   │     {V: 0.05, C: 0.05, P: 0.30,          │
+│   │      S: 0.20, Cn: 0.40}                  │
+│   │                                          │  Used INTERNALLY
+│   ├── social_vitality_multiplier: 1.8        │  to compute reward
+│   ├── morning_cognition_bonus: 1.5           │  and modify action
+│   ├── evening_night_cognition_bonus: None    │  effects.
+│   ├── morning_penalty: None                  │  Belief reward
+│   ├── binge_shame: False                     │  rewards INFERRING
+│   ├── progress_serenity_bonus: 0.04          │  this profile.
+│   ├── idle_serenity_decay: 0.02              │
+│   ├── vitality_decay_rate: 0.01              │
+│   ├── stress_tolerance: 0.22                 │
+│   ├── event_impact_multiplier: 0.7           │
+│   ├── connection_decay_rate: 0.012           │
+│   ├── solo_serenity_bonus: 0.05              │
+│   ├── social_connection_multiplier: 1.4      │
+│   ├── social_serenity_bonus: 0.03            │
+│   ├── work_vitality_recovery: 0.03           │
+│   └── (14 hidden parameters total)           │
+│   ──────────────────────────────────────────┘
+│
+│   Reduces to a 3-D BELIEF VECTOR (the inference target):
+│   profile_to_belief_vector(profile) → [0.30, 0.70, 0.50]
+│                                        │     │     │
+│                                        │     │     work_pref
+│                                        │     morning_pref
+│                                        social_pref
+│
+├── meters  ◄────── visible to agent ────────────────────────────┐
+│   ├── _vitality:    0.62  (range 0–1)                          │
+│   ├── _cognition:   0.51                                       │
+│   ├── _progress:    0.24                                       │  Agent
+│   ├── _serenity:    0.71                                       │  observes
+│   └── _connection:  0.38                                       │  in prompt
+│   ────────────────────────────────────────────────────────────┘
+│
+├── _timestep: 5
+│
+├── _step_history: list[StepRecord]  ◄── visible to agent ──────┐
+│   ├── (last 7 steps)                                          │
+│   │   step 0: deep_work → +0.42                               │
+│   │           deltas: V-0.10 C-0.12 P+0.18 S-0.05 Cn+0.00     │
+│   │           anomalies: V+0.00 C+0.00 P+0.06 S+0.00 Cn+0.00 ◄─ profile fingerprint
+│   │   step 1: sleep → +0.18                                   │
+│   │           deltas: V+0.20 C+0.10 P+0.00 S+0.05 Cn+0.00     │
+│   │           anomalies: V+0.00 C+0.00 P+0.00 S+0.00 Cn+0.00  │
+│   │   ...                                                     │
+│   ─────────────────────────────────────────────────────────────┘
+│
+└── _step_rewards: [+0.42, +0.18, +0.31, +0.55, ...]
+    └── used by grader to compute adaptation_score (late-half - early-half)
+```
+---
+## 4. What the agent sees (concrete prompt)
+```
+┌───────────────────────────────────── SYSTEM PROMPT ─────────────────────────────────┐
+│ You are a life-management agent helping a person whose preferences are HIDDEN.       │
+│ Each step, output ONE LINE in this exact format:                                     │
+│     S M W ACTION_NAME                                                                │
+│ S = social pref (0=hates, 9=loves), M = morning, W = work                            │
+│ ACTION_NAME ∈ {DEEP_WORK, ADMIN_WORK, LEARN, SLEEP, EXERCISE, MEDITATE,              │
+│                FAMILY_TIME, SOCIALIZE, ME_TIME, BINGE_WATCH}                         │
+│ Example: 3 8 7 DEEP_WORK                                                             │
+│ Tactics: probe early, exploit late; don't repeat actions; ...                        │
+└──────────────────────────────────────────────────────────────────────────────────────┘
+┌──────────────────────────────────── USER PROMPT ────────────────────────────────────┐
+│ Step: 5/28 (Tuesday Afternoon)                                                       │
+│ Remaining steps: 22                                                                  │
+│                                                                                      │
+│ Meters:                                                                              │
+│   Vitality:   0.62                                                                   │
+│   Cognition:  0.51                                                                   │
+│   Progress:   0.24                                                                   │
+│   Serenity:   0.71                                                                   │
+│   Connection: 0.38                                                                   │
+│                                                                                      │
+│ Recent history (anom = how this person deviated from neutral baseline):              │
+│   step 0: deep_work -> reward +0.42 (V-0.10 C-0.12 P+0.18 S-0.05 Cn+0.00)            │
+│           [anom V+0.00 C+0.00 P+0.06 S+0.00 Cn+0.00]                                 │
+│   step 1: sleep -> reward +0.18 (V+0.20 C+0.10 P+0.00 S+0.05 Cn+0.00)                │
+│           [anom V+0.00 C+0.00 P+0.00 S+0.00 Cn+0.00]                                 │
+│   step 2: socialize -> reward -0.05 (V-0.11 C-0.03 P+0.00 S+0.04 Cn+0.17)            │
+│           [anom V-0.05 C+0.00 P+0.00 S+0.00 Cn+0.05]    ← strong profile signal     │
+│   step 3: meditate -> reward +0.30 (V+0.03 C+0.08 P+0.00 S+0.20 Cn+0.00)             │
+│           [anom V+0.00 C+0.00 P+0.00 S+0.05 Cn+0.00]    ← solo bonus visible        │
+│   step 4: deep_work -> reward +0.18 (V-0.06 C-0.06 P+0.18 S+0.04 Cn+0.00)            │
+│           [anom V+0.00 C+0.00 P+0.00 S+0.00 Cn+0.00]                                 │
+│                                                                                      │
+│ Output your belief, then your action (format: S M W ACTION_NAME):                    │
+└──────────────────────────────────────────────────────────────────────────────────────┘
+LLM completion (1 of 8 sampled at temp=1.5):
+                ┌──────────────────────┐
+                │  "3 7 5 DEEP_WORK"   │
+                └──────────┬───────────┘
+                           ▼
+        Parsed: belief = [0.33, 0.78, 0.56]
+                action = DEEP_WORK
+```
+---
+## 5. The reward stack (4 layers, with concrete values for the example above)
+```
+                    ┌─────────────────────────────────────┐
+                    │  Completion: "3 7 5 DEEP_WORK"      │
+                    │  for prompt above (seed=42, step 5) │
+                    └─────────────────┬───────────────────┘
+                                      │
+       ┌──────────────────────────────┼──────────────────────────────���
+       │                              │                              │
+       ▼                              ▼                              ▼
+┌──────────────┐            ┌──────────────────┐          ┌────────────────────┐
+│ format_valid │            │   action_legal   │          │     env_reward     │
+├──────────────┤            ├──────────────────┤          ├────────────────────┤
+│ parses?      │            │ action ∈ 10 ?    │          │ env.replay(seed=42,│
+│ has belief?  │            │                  │          │   step5, action)   │
+│              │            │ DEEP_WORK ✓      │          │                    │
+│ ✓ +1.0       │            │   reward 0.0     │          │ deltas:            │
+│ × wt 0.05    │            │ × wt 0.05        │          │   V-0.10 C-0.12    │
+│              │            │                  │          │   P+0.18 S-0.05    │
+│ → +0.05      │            │  → 0.00          │          │   Cn+0.00          │
+└──────────────┘            └──────────────────┘          │                    │
+                                                          │ profile_reward =   │
+                                                          │   sum(deltas ×     │
+                                                          │     prof_weights)  │
+                                                          │     × 15           │
+                                                          │ = (-0.10×0.05      │
+                                                          │   + -0.12×0.05     │
+                                                          │   + 0.18×0.30      │
+                                                          │   + -0.05×0.20     │
+                                                          │   + 0×0.40) × 15   │
+                                                          │ = +0.65            │
+                                                          │                    │
+                                                          │ + grader_bias:     │
+                                                          │   0.5×0.18+0.4×0   │
+                                                          │   = +0.09          │
+                                                          │ + new-action: +0.07│
+                                                          │ + b-act coupling:  │
+                                                          │   work=0.56 (mid), │
+                                                          │   no bonus = 0     │
+                                                          │ - cycle pen: 0     │
+                                                          │ - rep pen: 0       │
+                                                          │                    │
+                                                          │ env_reward = +0.81 │
+                                                          │ × wt 1.5 = +1.21   │
+                                                          └────────────────────┘
+                                                                  │
+                                                                  │
+                                                                  ▼
+                                                  ┌─────────────────────────────┐
+                                                  │     belief_accuracy         │
+                                                  ├─────────────────────────────┤
+                                                  │ true belief (sampled_42)    │
+                                                  │   = [0.30, 0.70, 0.50]      │
+                                                  │ agent belief                │
+                                                  │   = [0.33, 0.78, 0.56]      │
+                                                  │                             │
+                                                  │ MAE = (0.03 + 0.08 + 0.06)  │
+                                                  │       / 3 = 0.057           │
+                                                  │ similarity = 0.943          │
+                                                  │                             │
+                                                  │ baseline (constant 0.5):    │
+                                                  │   MAE = (0.20+0.20+0.00)/3  │
+                                                  │       = 0.133               │
+                                                  │   similarity = 0.867        │
+                                                  │                             │
+                                                  │ reward = 0.943 - 0.867      │
+                                                  │       = +0.076              │
+                                                  │ × wt 3.0 = +0.23            │
+                                                  └─────────────────────────────┘
+  Σ TOTAL REWARD (this completion)
+  = 0.05 + 0.00 + 1.21 + 0.23
+  = +1.49
+```
+---
+## 6. GRPO update step (8 completions → 1 gradient update)
+```
+ONE TRAINING STEP (one row from the dataset → one gradient update)
+┌─ pick prompt ────────────────────────────────────────────────────────┐
+│  prompt#1247 from dataset:                                           │
+│    state: step 5, seed=42, history=[deep_work, sleep, ...]           │
+└─────────────┬────────────────────────────────────────────────────────┘
+              │
+              ▼ generate 8 completions @ temp=1.5
+┌──────────────────────────────────────────────────────────────────────┐
+│  c1: "3 7 5 DEEP_WORK"      → reward +1.49                           │
+│  c2: "5 5 5 SLEEP"          → reward -0.21  (constant belief: -0.03  │
+│                                              + sleep replay: -0.18)  │
+│  c3: "3 6 4 ADMIN_WORK"     → reward +1.10                           │
+│  c4: "4 7 6 LEARN"          → reward +1.32                           │
+│  c5: "2 8 3 MEDITATE"       → reward +0.45  (rep pen: -0.10 since    │
+│                                              meditate at step 3 too) │
+│  c6: "3 7 7 DEEP_WORK"      → reward +1.55  (belief slightly better) │
+│  c7: "5 5 5 EXERCISE"       → reward -0.15                           │
+│  c8: "4 6 5 FAMILY_TIME"    → reward +0.92                           │
+│                                                                      │
+│  group_mean = +0.81                                                  │
+└─────────────┬────────────────────────────────────────────────────────┘
+              │
+              ▼ advantages = reward - group_mean
+┌──────────────────────────────────────────────────────────────────────┐
+│  ADVANTAGE (this is what GRPO actually backprops on)                 │
+│  c1: +0.68  ← strongly preferred                                     │
+│  c2: -1.02  ← strongly discouraged (constant belief)                 │
+│  c3: +0.29                                                           │
+│  c4: +0.51                                                           │
+│  c5: -0.36                                                           │
+│  c6: +0.74  ← most preferred                                         │
+│  c7: -0.96                                                           │
+│  c8: +0.11                                                           │
+│                                                                      │
+│  KEY INSIGHT: only the SPREAD matters. If all 8 had the              │
+│  same reward, advantages would all be 0 → no gradient.               │
+│  This is why iter 1 mode-collapsed: format_valid +1.0 for            │
+│  every completion meant zero variance from that layer.               │
+└─────────────┬────────────────────────────────────────────────────────┘
+              │
+              ▼ policy loss = -E[ adv × log_prob(completion) ]
+              │                + β × KL(policy || base_qwen)
+              │
+              ▼ backprop (only LoRA weights, ~30M params)
+┌──────────────────────────────────────────────────────────────────────┐
+│  Model nudge:                                                        │
+│    push: "3 7 5 DEEP_WORK"-like outputs UP                           │
+│    push: "3 7 7 DEEP_WORK"-like outputs UP                           │
+│    pull: "5 5 5 *"-like outputs DOWN                                 │
+│                                                                      │
+│  KL constraint: prevents the policy from diverging too               │
+│  far from base Qwen (avoids gibberish drift).                        │
+└──────────────────────────────────────────────────────────────────────┘
+  Repeat 800-2000 times. Each step ~3-5 sec on A100.
+```
+---
+## 7. The dataset is just starting positions (not labels)
+```
+DATASET (~3000 rows, generated ONCE before training)
+  For 200-300 episodes:
+    env.reset(seed=N)
+    for step in range(28):
+      record {
+        prompt: [system, user_for_this_state],
+        seed: N,                            ◄─── replay metadata
+        step_index: <current step>,
+        action_history: <actions taken so far>,
+        profile_mode: "continuous",
+      }
+      env.step(rollout_policy(obs))  ◄─ rollout=heuristic OR random
+                                        (only matters for STATE diversity,
+                                         the agent's training generations
+                                         REPLACE these actions)
+  ┌──────────────────────────────────────────────────────────────────┐
+  │ A row from the dataset:                                          │
+  │ {                                                                │
+  │   prompt: [...full state at step 5 of episode seed=42...]        │
+  │   seed: 42,                                                      │
+  │   step_index: 5,                                                 │
+  │   action_history: ["deep_work", "sleep", "socialize",            │
+  │                    "meditate", "deep_work"],                     │
+  │   profile_mode: "continuous"                                     │
+  │ }                                                                │
+  │                                                                  │
+  │ NOTE: NO "correct action" or "label" anywhere.                   │
+  │ The reward function reconstructs the env from this metadata      │
+  │ and scores whatever action the LLM picks.                        │
+  └──────────────────────────────────────────────────────────────────┘
+  This is fundamentally different from supervised learning:
+  - Supervised: (input, target_output) — model learns to mimic target
+  - GRPO:       (prompt, replay_metadata) — model learns to maximize reward
+```
+---
+## 8. Final episode grader (only fires at step 28)
+```
+                      _grade_episode() — runs at done=True
+                              ┌──────────────┐
+                              │  final_score │
+                              │   ∈ [0, 1]   │
+                              └──────┬───────┘
+        ┌──────────────────┬─────────┼──────────────┬──────────────┐
+        │                  │         │              │              │
+        ▼                  ▼         ▼              ▼              ▼
+┌────────────┐    ┌────────────┐  ┌──────────┐  ┌────────────┐  ┌────────────┐
+│ crash_free │    │  progress  │  │ connection│  │ adaptation │  │ efficiency │
+│   × 0.20   │    │   × 0.25   │  │   × 0.15 │  │   × 0.30   │  │   × 0.10   │
+├────────────┤    ├────────────┤  ├──────────┤  ├────────────┤  ├────────────┤
+│ 1 - crashes│    │ final P    │  │ final Cn │  │ late-half  │  │ avg_reward │
+│ /total_ck  │    │ value      │  │ value    │  │ mean reward│  │ normalized │
+│            │    │            │  │          │  │ - early    │  │ to [0,1]   │
+│ e.g. 0.95  │    │ e.g. 0.42  │  │ e.g. 0.51│  │ e.g. +0.18 │  │ e.g. 0.55  │
+│ ×0.20=0.19 │    │ ×0.25=0.105│  │ ×0.15=0.08│ │ ×0.30=0.054│  │ ×0.10=0.055│
+└────────────┘    └────────────┘  └──────────��  └────────────┘  └────────────┘
+                        Σ = 0.19 + 0.105 + 0.08 + 0.054 + 0.055
+                          = 0.484  ← final_score
+Plus iter 4 sparse terminal reward (added to step 27's per-step reward):
+  terminal_bonus = (0.484 - 0.5) × 5 = -0.08
+This means: at step 27, agent gets last per-step reward + bonus from grader.
+This is the only direct gradient signal pointing at the actual episode quality.
+```
+---
+## 9. Three eval conditions (post-training)
+```
+                       inference_eval.py runs ALL THREE
+  ┌────────────────────────────┬────────────────────────────┬────────────────────────────┐
+  │  discrete-3-profiles       │  continuous-in-distribution│  continuous-OOD           │
+  │  (legacy comparison)       │  (was the agent able to    │  (does meta-policy        │
+  │                            │   learn the meta-policy?)  │   generalize?)            │
+  ├────────────────────────────┼────────────────────────────┼────────────────────────────┤
+  │ env.reset(seed=N,          │ env.reset(seed=N)          │ env.reset(seed=10000+N)   │
+  │   profile="introvert_      │   ← samples from training  │   ← samples from a region │
+  │            morning")       │     distribution           │     never seen in training│
+  │                            │                            │                           │
+  │ → 3 hardcoded profiles     │ → ~10 sampled profiles     │ → ~10 sampled profiles    │
+  │   (from PROFILES list)     │   from seeds 100..110      │   from seeds 10000..10010 │
+  │                            │                            │                           │
+  │ Each strategy plays each   │ Each strategy plays each   │ Each strategy plays each  │
+  │ profile 5x = 15 episodes   │ seed 1x = 10 episodes      │ seed 1x = 10 episodes     │
+  │                            │                            │                           │
+  │ Strategies tested:         │ Strategies tested:         │ Strategies tested:        │
+  │  • random                  │  • random                  │  • random                 │
+  │  • heuristic               │  • heuristic               │  • heuristic              │
+  │  • model (trained Qwen)    │  • model (trained Qwen)    │  • model (trained Qwen)   │
+  └────────────────────────────┴────────────────────────────┴────────────────────────────┘
+  THE KEY METRIC: trained model's score on continuous-OOD vs heuristic baseline.
+  Heuristic baseline (profile-blind hand rules): score 0.580 on OOD.
+  Trained meta-RL agent's target: > 0.580 on OOD.
+  If the trained agent beats the heuristic on OOD (profiles never seen in
+  training), that's direct proof it learned the SKILL of profile inference,
+  not just memorized training profiles.
+```
+---
+## 10. Spend & timing (concrete)
+```
+HARDWARE: A100 large (80GB) on HF Jobs at $2.50/hr ($0.0417/min)
+   FAST_MODE (200-800 steps):
+     dataset gen:    ~2 min
+     model load:     ~3 min
+     training:       ~10-25 min  (depends on steps)
+     eval:           ~3 min
+     plot + upload:  ~2 min
+     ────────────────────────────
+     total:          20-35 min   ($0.80-1.50 per iter)
+   FULL RUN (2000 steps):
+     dataset gen:    ~3 min
+     model load:     ~3 min
+     training:       ~60-90 min
+     eval:           ~3 min
+     plot + upload:  ~2 min
+     ────────────────────────────
+     total:          70-100 min  ($3-4)
+  Iter 1 (200 steps):    $0.50  ❌ mode collapse (single action)
+  Iter 2 (400 steps):    $1.50  ❌ mode collapse (2-cycle)
+  Iter 3 (800 steps):    $5     ⏳ in flight (control)
+  Iter 4 (800 steps):    $5     ⏳ in flight (with full fixes)
+  Final (2000 steps):    $4     ⏳ pending iter 3+4 results
+                       ──────
+                       ~$16    of $30 budget
+```