InosLihka Claude Opus 4.7 (1M context) commited on
Commit
d6d9e31
·
1 Parent(s): e12fc69

tooling: scripts/analyze_iter.py + docs/results.md template

Browse files

scripts/analyze_iter.py — one-shot iteration analyzer:
- Pulls trained model + eval JSON + log_history from HF Hub for any iteration
- Prints summary table (3 conditions x 3 strategies)
- Action distribution histogram
- Belief diversity check
- Training trajectory snapshots
- Pass/fail verdict against success criteria

docs/results.md — placeholder for the headline results page. Will be filled
in once iter 3 (or final long run) completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show
  1. docs/results.md +96 -0
  2. scripts/analyze_iter.py +178 -0
docs/results.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training Results
2
+
3
+ This page summarizes the headline numbers from the final training run.
4
+ For the full iteration journey including failed attempts and post-mortems,
5
+ see [`iterations.md`](iterations.md).
6
+
7
+ ## Headline result
8
+
9
+ > **TBD** — populated once iter 3 completes.
10
+
11
+ Template once we have numbers:
12
+
13
+ > "Trained agent scored **{X.XXX}** in-distribution and **{X.XXX}** OOD vs the
14
+ > heuristic baseline at **{0.587}** / **{0.580}**. Belief inference accuracy
15
+ > reached **{X.XX}** (vs neutral baseline 0.50, max 1.00)."
16
+
17
+ ## Final scores by condition
18
+
19
+ | Condition | Random | Heuristic | Trained Qwen | Δ vs Heuristic |
20
+ |---|---|---|---|---|
21
+ | discrete-3-profiles (legacy) | 0.554 | 0.584 | TBD | TBD |
22
+ | **continuous-in-distribution** | 0.516 | **0.587** | TBD | TBD |
23
+ | **continuous-OOD (generalization)** | 0.508 | **0.580** | TBD | TBD |
24
+
25
+ ## Adaptation (mid-episode improvement)
26
+
27
+ The grader's `adaptation_score` measures whether the agent gets BETTER over
28
+ the course of the episode — the direct meta-learning signal.
29
+
30
+ | Condition | Random | Heuristic | Trained Qwen |
31
+ |---|---|---|---|
32
+ | continuous-in-distribution | -0.253 | -0.349 | TBD |
33
+ | continuous-OOD | -0.281 | -0.030 | TBD |
34
+
35
+ All baselines are negative (neither random nor heuristic adapts; both apply
36
+ the same logic from step 0 to step 27). A trained agent showing POSITIVE
37
+ adaptation is direct proof that meta-learning happened.
38
+
39
+ ## Belief learning trajectory
40
+
41
+ Belief MAE over the course of training (lower is better; 0.50 is neutral
42
+ guess, 0.0 is perfect inference):
43
+
44
+ > See `plots/belief_accuracy.png` in the trained model repo.
45
+
46
+ Iter 2 (failed run with mode collapse) still reached belief MAE ≈ 0.36 for
47
+ in-distribution profiles, showing the belief-learning component of the
48
+ pipeline works even when the action policy doesn't. Iter 3 should improve
49
+ on this with the belief-first format change.
50
+
51
+ ## Action diversity
52
+
53
+ Iter 1: 99.7% one action (catastrophic collapse).
54
+ Iter 2: 55%/45% split between MEDITATE and EXERCISE (2-cycle collapse).
55
+ Iter 3: TBD (target: ≥ 5 unique actions per episode).
56
+
57
+ ## Plots
58
+
59
+ The trained model repo contains 5 plots:
60
+
61
+ - `plots/training_loss.png` — GRPO loss over training steps
62
+ - `plots/reward_curve.png` — mean total reward (with ±1 std band)
63
+ - `plots/reward_components.png` — all 4 reward layers overlaid
64
+ - `plots/belief_accuracy.png` — the meta-RL signal (rolling mean)
65
+ - `plots/baseline_vs_trained.png` — final scores + adaptation across 3 conditions
66
+
67
+ Available at https://huggingface.co/InosLihka/rhythm-env-meta-trained-{ITER}/tree/main/plots
68
+
69
+ ## Cost
70
+
71
+ | Iter | Cost | Steps | Outcome |
72
+ |---|---|---|---|
73
+ | 1 | $0.50 | 200 | Mode collapse (single action) |
74
+ | 2 | $1.50 | 400 | Mode collapse (2-cycle) |
75
+ | 3 | ~$5 | 800 | TBD |
76
+ | Final long run (if iter 3 succeeds) | ~$10 | 2000 | TBD |
77
+ | **Total budget used** | **TBD** of $30 | | |
78
+
79
+ ## How to reproduce
80
+
81
+ ```bash
82
+ # Train (requires HF Jobs access + token)
83
+ hf jobs uv run \
84
+ --flavor a100-large \
85
+ --secrets HF_TOKEN \
86
+ -e MODEL_REPO_SUFFIX=myrun \
87
+ scripts/train_on_hf.py
88
+
89
+ # Eval the trained model locally
90
+ python training/inference_eval.py \
91
+ --model_path outputs/rhythmenv_meta_trained \
92
+ --output_file my_eval_results.json
93
+
94
+ # Analyze any iteration's results
95
+ python scripts/analyze_iter.py myrun
96
+ ```
scripts/analyze_iter.py ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Generic iteration analyzer — pulls a trained model repo from HF Hub
3
+ and summarizes results.
4
+
5
+ Usage:
6
+ python scripts/analyze_iter.py iter3
7
+ python scripts/analyze_iter.py iter3 --model-suffix iter3
8
+ """
9
+
10
+ import argparse
11
+ import json
12
+ import os
13
+ import statistics
14
+ import sys
15
+ from collections import Counter
16
+ from pathlib import Path
17
+
18
+
19
+ def main():
20
+ p = argparse.ArgumentParser()
21
+ p.add_argument("iter_name", help="Iteration name, e.g. 'iter3'")
22
+ p.add_argument("--model-suffix", default=None,
23
+ help="Model repo suffix (defaults to iter_name)")
24
+ p.add_argument("--owner", default="InosLihka")
25
+ p.add_argument("--no-download", action="store_true",
26
+ help="Skip download, use local files only")
27
+ args = p.parse_args()
28
+
29
+ suffix = args.model_suffix or args.iter_name
30
+ repo = f"{args.owner}/rhythm-env-meta-trained-{suffix}"
31
+ local_dir = Path(f"./{args.iter_name}_results")
32
+ local_dir.mkdir(exist_ok=True)
33
+
34
+ if not args.no_download:
35
+ for fname in ["eval_results.json", "log_history.json", "training_config.json"]:
36
+ cmd = f"hf download {repo} {fname} --repo-type=model --local-dir {local_dir}"
37
+ print(f">>> {cmd}")
38
+ os.system(cmd)
39
+
40
+ # ---- training config ----
41
+ cfg_path = local_dir / "training_config.json"
42
+ if cfg_path.exists():
43
+ with open(cfg_path) as f:
44
+ cfg = json.load(f)
45
+ print()
46
+ print("=== Training config ===")
47
+ for k in ("max_steps", "num_episodes", "max_samples", "num_generations",
48
+ "lora_rank", "beta", "learning_rate", "hint_fraction"):
49
+ print(f" {k}: {cfg.get(k)}")
50
+
51
+ # ---- eval results ----
52
+ eval_path = local_dir / "eval_results.json"
53
+ if not eval_path.exists():
54
+ print(f"FATAL: {eval_path} not found")
55
+ sys.exit(1)
56
+ with open(eval_path) as f:
57
+ data = json.load(f)
58
+
59
+ conditions = sorted(set(r["condition"] for r in data))
60
+ strategies = sorted(set(r["strategy"] for r in data))
61
+
62
+ print()
63
+ print("=" * 80)
64
+ print("EVAL SUMMARY")
65
+ print("=" * 80)
66
+ print(f"{'Condition':<40} {'Strategy':<10} {'score':>7} {'adapt':>7} {'belief_mae':>11}")
67
+ print("-" * 80)
68
+ avg_table = {}
69
+ for c in conditions:
70
+ for s in strategies:
71
+ rs = [r for r in data if r["condition"] == c and r["strategy"] == s]
72
+ if not rs:
73
+ continue
74
+ score = statistics.mean(r["final_score"] for r in rs)
75
+ adapt = statistics.mean(r["adaptation"] for r in rs)
76
+ belief = [r["belief_mae"] for r in rs if r.get("belief_mae") is not None]
77
+ belief_str = f"{statistics.mean(belief):.3f}" if belief else "-"
78
+ print(f"{c:<40} {s:<10} {score:>7.3f} {adapt:>+7.3f} {belief_str:>11}")
79
+ avg_table[(c, s)] = (score, adapt, belief)
80
+
81
+ # ---- model action distribution ----
82
+ print()
83
+ model_runs = [r for r in data if r["strategy"] == "model"]
84
+ all_actions = []
85
+ for r in model_runs:
86
+ all_actions.extend(r["actions"])
87
+ if all_actions:
88
+ print(f"=== Trained-model action distribution ({len(all_actions)} actions over {len(model_runs)} episodes) ===")
89
+ counts = Counter(all_actions)
90
+ total = len(all_actions)
91
+ for action, cnt in counts.most_common():
92
+ print(f" {action:15s} {cnt:4d} ({100*cnt/total:5.1f}%)")
93
+ print(f"Unique actions used: {len(counts)} of 10")
94
+
95
+ # ---- belief diversity ----
96
+ beliefs = [tuple(r["final_belief"]) for r in model_runs if r.get("final_belief")]
97
+ if beliefs:
98
+ unique_beliefs = len(set(beliefs))
99
+ print(f"Unique final beliefs: {unique_beliefs} of {len(beliefs)}")
100
+
101
+ # ---- training trajectory ----
102
+ log_path = local_dir / "log_history.json"
103
+ if log_path.exists():
104
+ with open(log_path) as f:
105
+ log = json.load(f)
106
+ print()
107
+ print("=== Training trajectory ===")
108
+
109
+ def series(*keys):
110
+ for k in keys:
111
+ steps, vals = [], []
112
+ for e in log:
113
+ if k in e:
114
+ steps.append(e.get("step", len(steps)))
115
+ vals.append(e[k])
116
+ if vals:
117
+ return steps, vals
118
+ return [], []
119
+
120
+ max_step = max((e.get("step", 0) for e in log), default=0)
121
+ snapshots = sorted(set([1, max_step // 4, max_step // 2, 3 * max_step // 4, max_step]))
122
+ print(f"\n{'metric':<32} " + " ".join(f"step{s:>5}" for s in snapshots))
123
+ print("-" * (32 + len(snapshots) * 11))
124
+ for label, *keys in [
125
+ ("loss", "loss", "train/loss"),
126
+ ("reward (mean)", "reward", "rewards/mean"),
127
+ ("reward_std", "reward_std", "rewards/std"),
128
+ ("frac_zero_std", "frac_reward_zero_std"),
129
+ ("format_valid mean", "rewards/format_valid/mean"),
130
+ ("action_legal mean", "rewards/action_legal/mean"),
131
+ ("env_reward mean", "rewards/env_reward/mean"),
132
+ ("belief_accuracy mean", "rewards/belief_accuracy/mean"),
133
+ ("kl", "kl"),
134
+ ("completion_length", "completions/mean_length", "completion_length"),
135
+ ]:
136
+ steps, vals = series(*keys)
137
+ if not vals:
138
+ continue
139
+ row = []
140
+ for tgt in snapshots:
141
+ if not steps:
142
+ row.append("-")
143
+ continue
144
+ ix = min(range(len(steps)), key=lambda i: abs(steps[i] - tgt))
145
+ v = vals[ix]
146
+ row.append(f"{v:+.3f}" if isinstance(v, (int, float)) else str(v)[:8])
147
+ print(f"{label:<32} " + " ".join(f"{r:>10}" for r in row))
148
+
149
+ # ---- success criteria ----
150
+ print()
151
+ print("=" * 80)
152
+ print("VERDICT")
153
+ print("=" * 80)
154
+ targets = {
155
+ ("continuous-in-distribution", "model", "score"): 0.587, # heuristic in-dist
156
+ ("continuous-OOD (generalization)", "model", "score"): 0.580, # heuristic OOD
157
+ }
158
+ for (cond, strat, metric), target in targets.items():
159
+ if (cond, strat) in avg_table:
160
+ score, _, _ = avg_table[(cond, strat)]
161
+ symbol = "PASS" if score > target else "FAIL"
162
+ print(f" [{symbol}] {cond} score: {score:.3f} (need > {target} = heuristic)")
163
+
164
+ if all_actions:
165
+ n_unique = len(Counter(all_actions))
166
+ symbol = "PASS" if n_unique >= 5 else "FAIL"
167
+ print(f" [{symbol}] Action diversity: {n_unique} unique actions (need >= 5)")
168
+
169
+ # belief MAE check
170
+ in_dist_belief = avg_table.get(("continuous-in-distribution", "model"))
171
+ if in_dist_belief and in_dist_belief[2]:
172
+ avg_mae = statistics.mean(in_dist_belief[2])
173
+ symbol = "PASS" if avg_mae < 0.30 else "FAIL"
174
+ print(f" [{symbol}] Belief MAE in-dist: {avg_mae:.3f} (need < 0.30, lower is better)")
175
+
176
+
177
+ if __name__ == "__main__":
178
+ main()