Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka Claude Opus 4.7 (1M context) commited on 12 days ago

Commit

d6d9e31

1 Parent(s): e12fc69

tooling: scripts/analyze_iter.py + docs/results.md template

scripts/analyze_iter.py — one-shot iteration analyzer:
- Pulls trained model + eval JSON + log_history from HF Hub for any iteration
- Prints summary table (3 conditions x 3 strategies)
- Action distribution histogram
- Belief diversity check
- Training trajectory snapshots
- Pass/fail verdict against success criteria

docs/results.md — placeholder for the headline results page. Will be filled
in once iter 3 (or final long run) completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show

docs/results.md +96 -0
scripts/analyze_iter.py +178 -0

docs/results.md ADDED Viewed

	@@ -0,0 +1,96 @@

+# Training Results
+This page summarizes the headline numbers from the final training run.
+For the full iteration journey including failed attempts and post-mortems,
+see [`iterations.md`](iterations.md).
+## Headline result
+> **TBD** — populated once iter 3 completes.
+Template once we have numbers:
+> "Trained agent scored **{X.XXX}** in-distribution and **{X.XXX}** OOD vs the
+> heuristic baseline at **{0.587}** / **{0.580}**. Belief inference accuracy
+> reached **{X.XX}** (vs neutral baseline 0.50, max 1.00)."
+## Final scores by condition
+| Condition | Random | Heuristic | Trained Qwen | Δ vs Heuristic |
+|---|---|---|---|---|
+| discrete-3-profiles (legacy) | 0.554 | 0.584 | TBD | TBD |
+| **continuous-in-distribution** | 0.516 | **0.587** | TBD | TBD |
+| **continuous-OOD (generalization)** | 0.508 | **0.580** | TBD | TBD |
+## Adaptation (mid-episode improvement)
+The grader's `adaptation_score` measures whether the agent gets BETTER over
+the course of the episode — the direct meta-learning signal.
+| Condition | Random | Heuristic | Trained Qwen |
+|---|---|---|---|
+| continuous-in-distribution | -0.253 | -0.349 | TBD |
+| continuous-OOD | -0.281 | -0.030 | TBD |
+All baselines are negative (neither random nor heuristic adapts; both apply
+the same logic from step 0 to step 27). A trained agent showing POSITIVE
+adaptation is direct proof that meta-learning happened.
+## Belief learning trajectory
+Belief MAE over the course of training (lower is better; 0.50 is neutral
+guess, 0.0 is perfect inference):
+> See `plots/belief_accuracy.png` in the trained model repo.
+Iter 2 (failed run with mode collapse) still reached belief MAE ≈ 0.36 for
+in-distribution profiles, showing the belief-learning component of the
+pipeline works even when the action policy doesn't. Iter 3 should improve
+on this with the belief-first format change.
+## Action diversity
+Iter 1: 99.7% one action (catastrophic collapse).
+Iter 2: 55%/45% split between MEDITATE and EXERCISE (2-cycle collapse).
+Iter 3: TBD (target: ≥ 5 unique actions per episode).
+## Plots
+The trained model repo contains 5 plots:
+- `plots/training_loss.png` — GRPO loss over training steps
+- `plots/reward_curve.png` — mean total reward (with ±1 std band)
+- `plots/reward_components.png` — all 4 reward layers overlaid
+- `plots/belief_accuracy.png` — the meta-RL signal (rolling mean)
+- `plots/baseline_vs_trained.png` — final scores + adaptation across 3 conditions
+Available at https://huggingface.co/InosLihka/rhythm-env-meta-trained-{ITER}/tree/main/plots
+## Cost
+| Iter | Cost | Steps | Outcome |
+|---|---|---|---|
+| 1 | $0.50 | 200 | Mode collapse (single action) |
+| 2 | $1.50 | 400 | Mode collapse (2-cycle) |
+| 3 | ~$5 | 800 | TBD |
+| Final long run (if iter 3 succeeds) | ~$10 | 2000 | TBD |
+| **Total budget used** | **TBD** of $30 | | |
+## How to reproduce
+```bash
+# Train (requires HF Jobs access + token)
+hf jobs uv run \
+  --flavor a100-large \
+  --secrets HF_TOKEN \
+  -e MODEL_REPO_SUFFIX=myrun \
+  scripts/train_on_hf.py
+# Eval the trained model locally
+python training/inference_eval.py \
+  --model_path outputs/rhythmenv_meta_trained \
+  --output_file my_eval_results.json
+# Analyze any iteration's results
+python scripts/analyze_iter.py myrun
+```

scripts/analyze_iter.py ADDED Viewed

	@@ -0,0 +1,178 @@

+"""
+Generic iteration analyzer — pulls a trained model repo from HF Hub
+and summarizes results.
+Usage:
+    python scripts/analyze_iter.py iter3
+    python scripts/analyze_iter.py iter3 --model-suffix iter3
+"""
+import argparse
+import json
+import os
+import statistics
+import sys
+from collections import Counter
+from pathlib import Path
+def main():
+    p = argparse.ArgumentParser()
+    p.add_argument("iter_name", help="Iteration name, e.g. 'iter3'")
+    p.add_argument("--model-suffix", default=None,
+                   help="Model repo suffix (defaults to iter_name)")
+    p.add_argument("--owner", default="InosLihka")
+    p.add_argument("--no-download", action="store_true",
+                   help="Skip download, use local files only")
+    args = p.parse_args()
+    suffix = args.model_suffix or args.iter_name
+    repo = f"{args.owner}/rhythm-env-meta-trained-{suffix}"
+    local_dir = Path(f"./{args.iter_name}_results")
+    local_dir.mkdir(exist_ok=True)
+    if not args.no_download:
+        for fname in ["eval_results.json", "log_history.json", "training_config.json"]:
+            cmd = f"hf download {repo} {fname} --repo-type=model --local-dir {local_dir}"
+            print(f">>> {cmd}")
+            os.system(cmd)
+    # ---- training config ----
+    cfg_path = local_dir / "training_config.json"
+    if cfg_path.exists():
+        with open(cfg_path) as f:
+            cfg = json.load(f)
+        print()
+        print("=== Training config ===")
+        for k in ("max_steps", "num_episodes", "max_samples", "num_generations",
+                  "lora_rank", "beta", "learning_rate", "hint_fraction"):
+            print(f"  {k}: {cfg.get(k)}")
+    # ---- eval results ----
+    eval_path = local_dir / "eval_results.json"
+    if not eval_path.exists():
+        print(f"FATAL: {eval_path} not found")
+        sys.exit(1)
+    with open(eval_path) as f:
+        data = json.load(f)
+    conditions = sorted(set(r["condition"] for r in data))
+    strategies = sorted(set(r["strategy"] for r in data))
+    print()
+    print("=" * 80)
+    print("EVAL SUMMARY")
+    print("=" * 80)
+    print(f"{'Condition':<40} {'Strategy':<10} {'score':>7} {'adapt':>7} {'belief_mae':>11}")
+    print("-" * 80)
+    avg_table = {}
+    for c in conditions:
+        for s in strategies:
+            rs = [r for r in data if r["condition"] == c and r["strategy"] == s]
+            if not rs:
+                continue
+            score = statistics.mean(r["final_score"] for r in rs)
+            adapt = statistics.mean(r["adaptation"] for r in rs)
+            belief = [r["belief_mae"] for r in rs if r.get("belief_mae") is not None]
+            belief_str = f"{statistics.mean(belief):.3f}" if belief else "-"
+            print(f"{c:<40} {s:<10} {score:>7.3f} {adapt:>+7.3f} {belief_str:>11}")
+            avg_table[(c, s)] = (score, adapt, belief)
+    # ---- model action distribution ----
+    print()
+    model_runs = [r for r in data if r["strategy"] == "model"]
+    all_actions = []
+    for r in model_runs:
+        all_actions.extend(r["actions"])
+    if all_actions:
+        print(f"=== Trained-model action distribution ({len(all_actions)} actions over {len(model_runs)} episodes) ===")
+        counts = Counter(all_actions)
+        total = len(all_actions)
+        for action, cnt in counts.most_common():
+            print(f"  {action:15s} {cnt:4d} ({100*cnt/total:5.1f}%)")
+        print(f"Unique actions used: {len(counts)} of 10")
+    # ---- belief diversity ----
+    beliefs = [tuple(r["final_belief"]) for r in model_runs if r.get("final_belief")]
+    if beliefs:
+        unique_beliefs = len(set(beliefs))
+        print(f"Unique final beliefs: {unique_beliefs} of {len(beliefs)}")
+    # ---- training trajectory ----
+    log_path = local_dir / "log_history.json"
+    if log_path.exists():
+        with open(log_path) as f:
+            log = json.load(f)
+        print()
+        print("=== Training trajectory ===")
+        def series(*keys):
+            for k in keys:
+                steps, vals = [], []
+                for e in log:
+                    if k in e:
+                        steps.append(e.get("step", len(steps)))
+                        vals.append(e[k])
+                if vals:
+                    return steps, vals
+            return [], []
+        max_step = max((e.get("step", 0) for e in log), default=0)
+        snapshots = sorted(set([1, max_step // 4, max_step // 2, 3 * max_step // 4, max_step]))
+        print(f"\n{'metric':<32} " + " ".join(f"step{s:>5}" for s in snapshots))
+        print("-" * (32 + len(snapshots) * 11))
+        for label, *keys in [
+            ("loss", "loss", "train/loss"),
+            ("reward (mean)", "reward", "rewards/mean"),
+            ("reward_std", "reward_std", "rewards/std"),
+            ("frac_zero_std", "frac_reward_zero_std"),
+            ("format_valid mean", "rewards/format_valid/mean"),
+            ("action_legal mean", "rewards/action_legal/mean"),
+            ("env_reward mean", "rewards/env_reward/mean"),
+            ("belief_accuracy mean", "rewards/belief_accuracy/mean"),
+            ("kl", "kl"),
+            ("completion_length", "completions/mean_length", "completion_length"),
+        ]:
+            steps, vals = series(*keys)
+            if not vals:
+                continue
+            row = []
+            for tgt in snapshots:
+                if not steps:
+                    row.append("-")
+                    continue
+                ix = min(range(len(steps)), key=lambda i: abs(steps[i] - tgt))
+                v = vals[ix]
+                row.append(f"{v:+.3f}" if isinstance(v, (int, float)) else str(v)[:8])
+            print(f"{label:<32} " + " ".join(f"{r:>10}" for r in row))
+    # ---- success criteria ----
+    print()
+    print("=" * 80)
+    print("VERDICT")
+    print("=" * 80)
+    targets = {
+        ("continuous-in-distribution", "model", "score"): 0.587,  # heuristic in-dist
+        ("continuous-OOD (generalization)", "model", "score"): 0.580,  # heuristic OOD
+    }
+    for (cond, strat, metric), target in targets.items():
+        if (cond, strat) in avg_table:
+            score, _, _ = avg_table[(cond, strat)]
+            symbol = "PASS" if score > target else "FAIL"
+            print(f"  [{symbol}] {cond} score: {score:.3f} (need > {target} = heuristic)")
+    if all_actions:
+        n_unique = len(Counter(all_actions))
+        symbol = "PASS" if n_unique >= 5 else "FAIL"
+        print(f"  [{symbol}] Action diversity: {n_unique} unique actions (need >= 5)")
+    # belief MAE check
+    in_dist_belief = avg_table.get(("continuous-in-distribution", "model"))
+    if in_dist_belief and in_dist_belief[2]:
+        avg_mae = statistics.mean(in_dist_belief[2])
+        symbol = "PASS" if avg_mae < 0.30 else "FAIL"
+        print(f"  [{symbol}] Belief MAE in-dist: {avg_mae:.3f} (need < 0.30, lower is better)")
+if __name__ == "__main__":
+    main()