Spaces:
Sleeping
Sleeping
tooling: scripts/analyze_iter.py + docs/results.md template
Browse filesscripts/analyze_iter.py — one-shot iteration analyzer:
- Pulls trained model + eval JSON + log_history from HF Hub for any iteration
- Prints summary table (3 conditions x 3 strategies)
- Action distribution histogram
- Belief diversity check
- Training trajectory snapshots
- Pass/fail verdict against success criteria
docs/results.md — placeholder for the headline results page. Will be filled
in once iter 3 (or final long run) completes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docs/results.md +96 -0
- scripts/analyze_iter.py +178 -0
docs/results.md
ADDED
|
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Training Results
|
| 2 |
+
|
| 3 |
+
This page summarizes the headline numbers from the final training run.
|
| 4 |
+
For the full iteration journey including failed attempts and post-mortems,
|
| 5 |
+
see [`iterations.md`](iterations.md).
|
| 6 |
+
|
| 7 |
+
## Headline result
|
| 8 |
+
|
| 9 |
+
> **TBD** — populated once iter 3 completes.
|
| 10 |
+
|
| 11 |
+
Template once we have numbers:
|
| 12 |
+
|
| 13 |
+
> "Trained agent scored **{X.XXX}** in-distribution and **{X.XXX}** OOD vs the
|
| 14 |
+
> heuristic baseline at **{0.587}** / **{0.580}**. Belief inference accuracy
|
| 15 |
+
> reached **{X.XX}** (vs neutral baseline 0.50, max 1.00)."
|
| 16 |
+
|
| 17 |
+
## Final scores by condition
|
| 18 |
+
|
| 19 |
+
| Condition | Random | Heuristic | Trained Qwen | Δ vs Heuristic |
|
| 20 |
+
|---|---|---|---|---|
|
| 21 |
+
| discrete-3-profiles (legacy) | 0.554 | 0.584 | TBD | TBD |
|
| 22 |
+
| **continuous-in-distribution** | 0.516 | **0.587** | TBD | TBD |
|
| 23 |
+
| **continuous-OOD (generalization)** | 0.508 | **0.580** | TBD | TBD |
|
| 24 |
+
|
| 25 |
+
## Adaptation (mid-episode improvement)
|
| 26 |
+
|
| 27 |
+
The grader's `adaptation_score` measures whether the agent gets BETTER over
|
| 28 |
+
the course of the episode — the direct meta-learning signal.
|
| 29 |
+
|
| 30 |
+
| Condition | Random | Heuristic | Trained Qwen |
|
| 31 |
+
|---|---|---|---|
|
| 32 |
+
| continuous-in-distribution | -0.253 | -0.349 | TBD |
|
| 33 |
+
| continuous-OOD | -0.281 | -0.030 | TBD |
|
| 34 |
+
|
| 35 |
+
All baselines are negative (neither random nor heuristic adapts; both apply
|
| 36 |
+
the same logic from step 0 to step 27). A trained agent showing POSITIVE
|
| 37 |
+
adaptation is direct proof that meta-learning happened.
|
| 38 |
+
|
| 39 |
+
## Belief learning trajectory
|
| 40 |
+
|
| 41 |
+
Belief MAE over the course of training (lower is better; 0.50 is neutral
|
| 42 |
+
guess, 0.0 is perfect inference):
|
| 43 |
+
|
| 44 |
+
> See `plots/belief_accuracy.png` in the trained model repo.
|
| 45 |
+
|
| 46 |
+
Iter 2 (failed run with mode collapse) still reached belief MAE ≈ 0.36 for
|
| 47 |
+
in-distribution profiles, showing the belief-learning component of the
|
| 48 |
+
pipeline works even when the action policy doesn't. Iter 3 should improve
|
| 49 |
+
on this with the belief-first format change.
|
| 50 |
+
|
| 51 |
+
## Action diversity
|
| 52 |
+
|
| 53 |
+
Iter 1: 99.7% one action (catastrophic collapse).
|
| 54 |
+
Iter 2: 55%/45% split between MEDITATE and EXERCISE (2-cycle collapse).
|
| 55 |
+
Iter 3: TBD (target: ≥ 5 unique actions per episode).
|
| 56 |
+
|
| 57 |
+
## Plots
|
| 58 |
+
|
| 59 |
+
The trained model repo contains 5 plots:
|
| 60 |
+
|
| 61 |
+
- `plots/training_loss.png` — GRPO loss over training steps
|
| 62 |
+
- `plots/reward_curve.png` — mean total reward (with ±1 std band)
|
| 63 |
+
- `plots/reward_components.png` — all 4 reward layers overlaid
|
| 64 |
+
- `plots/belief_accuracy.png` — the meta-RL signal (rolling mean)
|
| 65 |
+
- `plots/baseline_vs_trained.png` — final scores + adaptation across 3 conditions
|
| 66 |
+
|
| 67 |
+
Available at https://huggingface.co/InosLihka/rhythm-env-meta-trained-{ITER}/tree/main/plots
|
| 68 |
+
|
| 69 |
+
## Cost
|
| 70 |
+
|
| 71 |
+
| Iter | Cost | Steps | Outcome |
|
| 72 |
+
|---|---|---|---|
|
| 73 |
+
| 1 | $0.50 | 200 | Mode collapse (single action) |
|
| 74 |
+
| 2 | $1.50 | 400 | Mode collapse (2-cycle) |
|
| 75 |
+
| 3 | ~$5 | 800 | TBD |
|
| 76 |
+
| Final long run (if iter 3 succeeds) | ~$10 | 2000 | TBD |
|
| 77 |
+
| **Total budget used** | **TBD** of $30 | | |
|
| 78 |
+
|
| 79 |
+
## How to reproduce
|
| 80 |
+
|
| 81 |
+
```bash
|
| 82 |
+
# Train (requires HF Jobs access + token)
|
| 83 |
+
hf jobs uv run \
|
| 84 |
+
--flavor a100-large \
|
| 85 |
+
--secrets HF_TOKEN \
|
| 86 |
+
-e MODEL_REPO_SUFFIX=myrun \
|
| 87 |
+
scripts/train_on_hf.py
|
| 88 |
+
|
| 89 |
+
# Eval the trained model locally
|
| 90 |
+
python training/inference_eval.py \
|
| 91 |
+
--model_path outputs/rhythmenv_meta_trained \
|
| 92 |
+
--output_file my_eval_results.json
|
| 93 |
+
|
| 94 |
+
# Analyze any iteration's results
|
| 95 |
+
python scripts/analyze_iter.py myrun
|
| 96 |
+
```
|
scripts/analyze_iter.py
ADDED
|
@@ -0,0 +1,178 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Generic iteration analyzer — pulls a trained model repo from HF Hub
|
| 3 |
+
and summarizes results.
|
| 4 |
+
|
| 5 |
+
Usage:
|
| 6 |
+
python scripts/analyze_iter.py iter3
|
| 7 |
+
python scripts/analyze_iter.py iter3 --model-suffix iter3
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
import argparse
|
| 11 |
+
import json
|
| 12 |
+
import os
|
| 13 |
+
import statistics
|
| 14 |
+
import sys
|
| 15 |
+
from collections import Counter
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def main():
|
| 20 |
+
p = argparse.ArgumentParser()
|
| 21 |
+
p.add_argument("iter_name", help="Iteration name, e.g. 'iter3'")
|
| 22 |
+
p.add_argument("--model-suffix", default=None,
|
| 23 |
+
help="Model repo suffix (defaults to iter_name)")
|
| 24 |
+
p.add_argument("--owner", default="InosLihka")
|
| 25 |
+
p.add_argument("--no-download", action="store_true",
|
| 26 |
+
help="Skip download, use local files only")
|
| 27 |
+
args = p.parse_args()
|
| 28 |
+
|
| 29 |
+
suffix = args.model_suffix or args.iter_name
|
| 30 |
+
repo = f"{args.owner}/rhythm-env-meta-trained-{suffix}"
|
| 31 |
+
local_dir = Path(f"./{args.iter_name}_results")
|
| 32 |
+
local_dir.mkdir(exist_ok=True)
|
| 33 |
+
|
| 34 |
+
if not args.no_download:
|
| 35 |
+
for fname in ["eval_results.json", "log_history.json", "training_config.json"]:
|
| 36 |
+
cmd = f"hf download {repo} {fname} --repo-type=model --local-dir {local_dir}"
|
| 37 |
+
print(f">>> {cmd}")
|
| 38 |
+
os.system(cmd)
|
| 39 |
+
|
| 40 |
+
# ---- training config ----
|
| 41 |
+
cfg_path = local_dir / "training_config.json"
|
| 42 |
+
if cfg_path.exists():
|
| 43 |
+
with open(cfg_path) as f:
|
| 44 |
+
cfg = json.load(f)
|
| 45 |
+
print()
|
| 46 |
+
print("=== Training config ===")
|
| 47 |
+
for k in ("max_steps", "num_episodes", "max_samples", "num_generations",
|
| 48 |
+
"lora_rank", "beta", "learning_rate", "hint_fraction"):
|
| 49 |
+
print(f" {k}: {cfg.get(k)}")
|
| 50 |
+
|
| 51 |
+
# ---- eval results ----
|
| 52 |
+
eval_path = local_dir / "eval_results.json"
|
| 53 |
+
if not eval_path.exists():
|
| 54 |
+
print(f"FATAL: {eval_path} not found")
|
| 55 |
+
sys.exit(1)
|
| 56 |
+
with open(eval_path) as f:
|
| 57 |
+
data = json.load(f)
|
| 58 |
+
|
| 59 |
+
conditions = sorted(set(r["condition"] for r in data))
|
| 60 |
+
strategies = sorted(set(r["strategy"] for r in data))
|
| 61 |
+
|
| 62 |
+
print()
|
| 63 |
+
print("=" * 80)
|
| 64 |
+
print("EVAL SUMMARY")
|
| 65 |
+
print("=" * 80)
|
| 66 |
+
print(f"{'Condition':<40} {'Strategy':<10} {'score':>7} {'adapt':>7} {'belief_mae':>11}")
|
| 67 |
+
print("-" * 80)
|
| 68 |
+
avg_table = {}
|
| 69 |
+
for c in conditions:
|
| 70 |
+
for s in strategies:
|
| 71 |
+
rs = [r for r in data if r["condition"] == c and r["strategy"] == s]
|
| 72 |
+
if not rs:
|
| 73 |
+
continue
|
| 74 |
+
score = statistics.mean(r["final_score"] for r in rs)
|
| 75 |
+
adapt = statistics.mean(r["adaptation"] for r in rs)
|
| 76 |
+
belief = [r["belief_mae"] for r in rs if r.get("belief_mae") is not None]
|
| 77 |
+
belief_str = f"{statistics.mean(belief):.3f}" if belief else "-"
|
| 78 |
+
print(f"{c:<40} {s:<10} {score:>7.3f} {adapt:>+7.3f} {belief_str:>11}")
|
| 79 |
+
avg_table[(c, s)] = (score, adapt, belief)
|
| 80 |
+
|
| 81 |
+
# ---- model action distribution ----
|
| 82 |
+
print()
|
| 83 |
+
model_runs = [r for r in data if r["strategy"] == "model"]
|
| 84 |
+
all_actions = []
|
| 85 |
+
for r in model_runs:
|
| 86 |
+
all_actions.extend(r["actions"])
|
| 87 |
+
if all_actions:
|
| 88 |
+
print(f"=== Trained-model action distribution ({len(all_actions)} actions over {len(model_runs)} episodes) ===")
|
| 89 |
+
counts = Counter(all_actions)
|
| 90 |
+
total = len(all_actions)
|
| 91 |
+
for action, cnt in counts.most_common():
|
| 92 |
+
print(f" {action:15s} {cnt:4d} ({100*cnt/total:5.1f}%)")
|
| 93 |
+
print(f"Unique actions used: {len(counts)} of 10")
|
| 94 |
+
|
| 95 |
+
# ---- belief diversity ----
|
| 96 |
+
beliefs = [tuple(r["final_belief"]) for r in model_runs if r.get("final_belief")]
|
| 97 |
+
if beliefs:
|
| 98 |
+
unique_beliefs = len(set(beliefs))
|
| 99 |
+
print(f"Unique final beliefs: {unique_beliefs} of {len(beliefs)}")
|
| 100 |
+
|
| 101 |
+
# ---- training trajectory ----
|
| 102 |
+
log_path = local_dir / "log_history.json"
|
| 103 |
+
if log_path.exists():
|
| 104 |
+
with open(log_path) as f:
|
| 105 |
+
log = json.load(f)
|
| 106 |
+
print()
|
| 107 |
+
print("=== Training trajectory ===")
|
| 108 |
+
|
| 109 |
+
def series(*keys):
|
| 110 |
+
for k in keys:
|
| 111 |
+
steps, vals = [], []
|
| 112 |
+
for e in log:
|
| 113 |
+
if k in e:
|
| 114 |
+
steps.append(e.get("step", len(steps)))
|
| 115 |
+
vals.append(e[k])
|
| 116 |
+
if vals:
|
| 117 |
+
return steps, vals
|
| 118 |
+
return [], []
|
| 119 |
+
|
| 120 |
+
max_step = max((e.get("step", 0) for e in log), default=0)
|
| 121 |
+
snapshots = sorted(set([1, max_step // 4, max_step // 2, 3 * max_step // 4, max_step]))
|
| 122 |
+
print(f"\n{'metric':<32} " + " ".join(f"step{s:>5}" for s in snapshots))
|
| 123 |
+
print("-" * (32 + len(snapshots) * 11))
|
| 124 |
+
for label, *keys in [
|
| 125 |
+
("loss", "loss", "train/loss"),
|
| 126 |
+
("reward (mean)", "reward", "rewards/mean"),
|
| 127 |
+
("reward_std", "reward_std", "rewards/std"),
|
| 128 |
+
("frac_zero_std", "frac_reward_zero_std"),
|
| 129 |
+
("format_valid mean", "rewards/format_valid/mean"),
|
| 130 |
+
("action_legal mean", "rewards/action_legal/mean"),
|
| 131 |
+
("env_reward mean", "rewards/env_reward/mean"),
|
| 132 |
+
("belief_accuracy mean", "rewards/belief_accuracy/mean"),
|
| 133 |
+
("kl", "kl"),
|
| 134 |
+
("completion_length", "completions/mean_length", "completion_length"),
|
| 135 |
+
]:
|
| 136 |
+
steps, vals = series(*keys)
|
| 137 |
+
if not vals:
|
| 138 |
+
continue
|
| 139 |
+
row = []
|
| 140 |
+
for tgt in snapshots:
|
| 141 |
+
if not steps:
|
| 142 |
+
row.append("-")
|
| 143 |
+
continue
|
| 144 |
+
ix = min(range(len(steps)), key=lambda i: abs(steps[i] - tgt))
|
| 145 |
+
v = vals[ix]
|
| 146 |
+
row.append(f"{v:+.3f}" if isinstance(v, (int, float)) else str(v)[:8])
|
| 147 |
+
print(f"{label:<32} " + " ".join(f"{r:>10}" for r in row))
|
| 148 |
+
|
| 149 |
+
# ---- success criteria ----
|
| 150 |
+
print()
|
| 151 |
+
print("=" * 80)
|
| 152 |
+
print("VERDICT")
|
| 153 |
+
print("=" * 80)
|
| 154 |
+
targets = {
|
| 155 |
+
("continuous-in-distribution", "model", "score"): 0.587, # heuristic in-dist
|
| 156 |
+
("continuous-OOD (generalization)", "model", "score"): 0.580, # heuristic OOD
|
| 157 |
+
}
|
| 158 |
+
for (cond, strat, metric), target in targets.items():
|
| 159 |
+
if (cond, strat) in avg_table:
|
| 160 |
+
score, _, _ = avg_table[(cond, strat)]
|
| 161 |
+
symbol = "PASS" if score > target else "FAIL"
|
| 162 |
+
print(f" [{symbol}] {cond} score: {score:.3f} (need > {target} = heuristic)")
|
| 163 |
+
|
| 164 |
+
if all_actions:
|
| 165 |
+
n_unique = len(Counter(all_actions))
|
| 166 |
+
symbol = "PASS" if n_unique >= 5 else "FAIL"
|
| 167 |
+
print(f" [{symbol}] Action diversity: {n_unique} unique actions (need >= 5)")
|
| 168 |
+
|
| 169 |
+
# belief MAE check
|
| 170 |
+
in_dist_belief = avg_table.get(("continuous-in-distribution", "model"))
|
| 171 |
+
if in_dist_belief and in_dist_belief[2]:
|
| 172 |
+
avg_mae = statistics.mean(in_dist_belief[2])
|
| 173 |
+
symbol = "PASS" if avg_mae < 0.30 else "FAIL"
|
| 174 |
+
print(f" [{symbol}] Belief MAE in-dist: {avg_mae:.3f} (need < 0.30, lower is better)")
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
if __name__ == "__main__":
|
| 178 |
+
main()
|