InosLihka Claude Opus 4.7 (1M context) commited on
Commit
7bb9278
·
1 Parent(s): 63216a8

docs: handoff bundle for new chat session + iter 4 partial analysis

Browse files

- docs/handoff_for_next_session.md: comprehensive context dump for resuming
in a new Claude session. Covers project state, all 5 iterations, both
rounds of fixes (round 1 = 7 fixes for iter 2, round 2 = 7 fixes for
iter 4+), what's broken, what's not, pending iter 6+ candidates, key
files, hardware notes.
- docs/logdump.txt: HF Jobs UI export from iter 4 cancelled run (235 steps,
3272 lines, the only data we have for the round-2 fixes in action).
- docs/iter4_partial_analysis.txt: parsed trajectory + trends + final-step
metrics. Shows belief_accuracy FLAT at -0.10 — the unsolved core issue.
- scripts/analyze_logdump.py: parser used to produce the analysis.

Honest meta-note: this project was hit-and-trial. Each iter exposed bugs
we should have caught upfront. Future iterations should compute reward
variance across IDENTICAL completions before training, analyze belief
target distribution analytically, and run 50-step micro-smoke-trains
before committing to 200+ step runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs/handoff_for_next_session.md ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Handoff for new chat session
2
+
3
+ Paste the contents of this file (or just point the new chat at it) to bring
4
+ a fresh Claude session up to speed.
5
+
6
+ ---
7
+
8
+ ## Project + vision
9
+
10
+ **Meta OpenEnv Hackathon submission** — `RhythmEnv`, a meta-RL environment
11
+ where an LLM agent learns the *skill of inferring a user's hidden personality*
12
+ from observation alone. POC trains in simulation; production replaces the
13
+ synthetic meters with real wearable signals (HRV, calendar, accept/ignore taps).
14
+
15
+ **The aim** is the agent should require ~no explicit input from users — works
16
+ purely from passive sensor data + tap responses. SensorLM (Google, 2025) is
17
+ the proven input layer for production.
18
+
19
+ ## Where the code lives
20
+
21
+ - Local: `c:/Users/guptapri/Downloads/Akhil/Repos/hackathon/rhythm_env/`
22
+ - HF Space (deployed): `https://huggingface.co/spaces/InosLihka/rhythm_env`
23
+ - Git remote: `hf` → that HF Space (main branch). Local working branch: `round2`.
24
+ - HF token: `~/.cache/huggingface/token`
25
+
26
+ ## Current state — iteration 5 running
27
+
28
+ | iter | Hardware | Config | Result |
29
+ |---|---|---|---|
30
+ | 1 | a100 | 200 steps, LoRA 4, num_gen 4 | Mode collapse → `EXERCISE 5 5 5` |
31
+ | 2 | a100 | 400 steps + 7 fixes (temp 1.5, weights, action_legal=0, repetition penalty, late_quality math, hint=0, seed-mix) | Mode collapse → MEDITATE-EXERCISE 2-cycle |
32
+ | 3 | n/a | 800 steps + 7 fixes | Cancelled before run (stale code) |
33
+ | 4 (a100/l40s/h200 attempts) | various | various | Capacity-cancelled or H200/Unsloth incompat |
34
+ | **4 (a10g)** | a10g-large | LoRA 16, num_gen 8, 800 steps + further 7 fixes from external bug review | **CANCELLED at step 235 by mistake** based on stale API logs. UI showed it was healthy. ~$2.10 wasted. |
35
+ | **5 (a10g)** | a10g-large | LoRA 8, num_gen 4, 500 steps + same 7 fixes as iter 4 | **RUNNING** — job `69eda027d70108f37acdf9a7` |
36
+
37
+ **Spend: ~$5.60 of $30 budget.**
38
+
39
+ ## The two rounds of fixes applied
40
+
41
+ ### Round 1 (after iter 1 collapse, applied for iter 2): the 7 hyperparameter/reward fixes
42
+ 1. `temperature` 1.0 → 1.5 (force diverse rollouts)
43
+ 2. `reward_weights` `[0.3, 0.3, 1.0, 1.0]` → `[0.05, 0.05, 1.5, 3.0]` (suppress saturated layers)
44
+ 3. `action_legal` returns 0 (was +0.5) for valid — drop the constant-reward layer
45
+ 4. Explicit repetition penalty in env_reward (-0.3 if action 3+ in row)
46
+ 5. `_grade_episode` `late_quality` normalization fix ([-3, +3] not [-1, +1])
47
+ 6. `hint_fraction` 0.15 → 0.0 (eliminate train-eval distribution mismatch)
48
+ 7. `env_reward` seed-fallback hardening (`(i*17)^0xBEEF` mix to break clusters)
49
+
50
+ ### Round 2 (after iter 2 collapse + external bug review, applied for iter 4 onward): 7 deeper fixes
51
+ 1. **Anomalies surfaced in prompt** (StepRecord + format_observation_prompt + inference.py) — was computed but never visible to agent
52
+ 2. **Belief baseline subtraction** (`belief_accuracy`): reward = similarity − constant_baseline_similarity, so constant `5 5 5` no longer earns +1/step free reward
53
+ 3. **Profile weight cap 0.80 → 0.45** (`sample_profile`) — forces multi-meter profiles
54
+ 4. **Scaled-down shaping**: -0.10/-0.15/+0.07 (was -0.30/-0.40/+0.20)
55
+ 5. **Step-0 belief reward = 0** (no info to commit on)
56
+ 6. **Belief-action coupling reward**: ±0.15 if action matches/contradicts emitted belief
57
+ 7. **`grader_bias` moved out of `_compute_reward` into `env_reward`** — keeps env per-step reward pure for inference signal
58
+
59
+ ## What iter 4 partial data (235/800 steps) tells us
60
+
61
+ Logs at `docs/logdump.txt`. Analysis at `docs/iter4_partial_analysis.txt`.
62
+
63
+ **Working:**
64
+ - Total reward: -3.4 → +0.39 (climbing)
65
+ - format_valid: -1.20 → +0.44 (slow but climbing)
66
+ - env_reward: -2.01 → +0.44 (climbing)
67
+ - grad_norm normalized to ~10
68
+ - No catastrophic mode collapse
69
+
70
+ **Still broken — the unsolved core:**
71
+ - `belief_accuracy/mean` flat at **-0.10** throughout 235 steps
72
+ - Linear slope: +0.0007 per 100 steps (essentially zero)
73
+ - Agent emits beliefs SLIGHTLY WORSE than constant baseline
74
+
75
+ **Root-cause hypothesis** (from analysis):
76
+
77
+ The profile cap (0.80 → 0.45) compressed the belief target distribution.
78
+ With balanced profiles, sampled belief vectors land near `[0.5, 0.5, 0.5]`, so
79
+ the constant `5 5 5` baseline already has high similarity. Real learning has
80
+ a tiny ceiling. **Two fixes (profile cap + baseline subtraction) interact
81
+ negatively.**
82
+
83
+ ## Pending fixes NOT yet attempted (priority order for iter 6+)
84
+
85
+ 1. **HIGHEST PRIORITY: revert profile cap to 0.80** — restores belief target
86
+ spread; the new `grader_bias` term handles the original "spam recovery
87
+ actions" exploit independently. Single-line fix in `sample_profile`.
88
+
89
+ 2. **Multi-step plan generation** — deferred from external agent's analysis.
90
+ Completion = 3-action plan instead of 1 action; `env_reward` replays the
91
+ plan cumulatively. Addresses the structural per-step-GRPO vs per-episode-
92
+ grader mismatch (`adaptation_score` is 30% of grade but only ~3% of training
93
+ rows see it via terminal bonus). This is the highest-leverage missing fix.
94
+
95
+ 3. **Captioned anomaly history** — describe per-meter anomalies in language
96
+ ("vitality dropped 5% MORE than baseline") instead of `[anom V-0.05]`.
97
+ Don't bake in conclusions ("this person is introverted") — just describe.
98
+
99
+ ## What I would do differently next time (the methodology lesson)
100
+
101
+ The user (Akhil) explicitly called this out 2026-04-26: "I don't want to do
102
+ hit and trial." Each iter exposed a bug we should have caught upfront.
103
+
104
+ Things that would have prevented the wasted iters:
105
+ - **Compute reward variance across IDENTICAL completions** before training.
106
+ GRPO advantage = reward − group_mean; if all 4 completions get the same
107
+ reward from a layer, that layer contributes ZERO gradient. `pipeline_dryrun`
108
+ tested DIFFERENT actions per kind, missing this.
109
+ - **Analytically check belief target distribution** under continuous profiles
110
+ before tightening the cap. Would have caught the iter 4 issue.
111
+ - **Run a 50-step micro-smoke-train ($0.10)** on the smallest possible config
112
+ before committing to 200+ step runs.
113
+ - **Address structural issues (multi-step plans) before tweaking hyperparams.**
114
+
115
+ ## Key files for new session
116
+
117
+ | File | What it has |
118
+ |---|---|
119
+ | `docs/architecture.md` | 10 ASCII diagrams with concrete values, full pipeline |
120
+ | `docs/iterations.md` | Iter 0-3 journey doc (NEEDS update for iter 4+5) |
121
+ | `docs/logdump.txt` | Iter 4 raw UI logs through step 235 (3272 lines) |
122
+ | `docs/iter4_partial_analysis.txt` | Parsed iter 4 trajectory snapshots + trends |
123
+ | `docs/results.md` | Template for headline results (NOT yet filled) |
124
+ | `docs/blog_post.md` | Strong narrative, pre-meta-RL refactor |
125
+ | `docs/references/judging_criteria.md` | Hackathon scoring (40% innovation, 30% storytelling, 20% improvement, 10% reward pipeline) |
126
+ | `training/WhatMakesAGoodSubmission.md` | Hackathon criteria, mentions OpenEnv Rubric system (we don't use it — gap) |
127
+ | `server/rhythm_environment.py` | Env, sample_profile, _compute_reward, _grade_episode |
128
+ | `training/reward_functions.py` | 4-layer reward stack, parser, belief_accuracy |
129
+ | `training/dataset.py` | Prompt builder, episode generator |
130
+ | `training/train.py` | GRPOConfig setup, reward_weights wiring |
131
+ | `scripts/train_on_hf.py` | HF Jobs orchestrator |
132
+ | `scripts/analyze_logdump.py` | Parse iter 4 UI export |
133
+ | `scripts/analyze_iter.py` | Pull + analyze any iter's HF Hub repo |
134
+ | `eval_baselines_meta.json` | Heuristic in-dist 0.587, OOD 0.580 — bars to beat |
135
+
136
+ ## Open monitor + job state at handoff
137
+
138
+ - **Active monitor**: task `b10lozxd8` watching iter 5
139
+ - **Iter 5 job**: `69eda027d70108f37acdf9a7` on a10g-large, RUNNING
140
+ - **HF Job UI**: `https://huggingface.co/jobs/InosLihka/69eda027d70108f37acdf9a7`
141
+ - **Iter 5 model repo (when complete)**: `https://huggingface.co/InosLihka/rhythm-env-meta-trained-iter5`
142
+
143
+ ## What to do in new session
144
+
145
+ **If iter 5 results are landed:**
146
+ 1. Check `docs/logdump.txt` for the new logs OR pull from `InosLihka/rhythm-env-meta-trained-iter5`
147
+ 2. Run `python scripts/analyze_iter.py iter5`
148
+ 3. Decide: ship it / iter 6 with profile-cap revert / iter 6 with multi-step plans
149
+
150
+ **If iter 5 is still running:**
151
+ 1. Check status via `https://huggingface.co/jobs/InosLihka/69eda027d70108f37acdf9a7`
152
+ 2. Trust the UI, NOT the API (lag is severe)
153
+
154
+ **If you're starting fresh after iter 5 completed:**
155
+ 1. Read this file
156
+ 2. Check the iter 5 model repo
157
+ 3. Decide path forward based on the belief_accuracy outcome
158
+
159
+ ## Commands you'll need
160
+
161
+ ```bash
162
+ # HF Jobs status
163
+ hf jobs ps
164
+ hf jobs inspect <id>
165
+ hf jobs logs <id> # use UI instead — API lags
166
+
167
+ # Submit a new training run
168
+ cd c:/Users/guptapri/Downloads/Akhil/Repos/hackathon/rhythm_env
169
+ hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
170
+ -e FAST_MODE=1 \
171
+ -e MODEL_REPO_SUFFIX=iter6 \
172
+ -e LORA_RANK=8 -e NUM_GENERATIONS=4 -e MAX_STEPS=500 \
173
+ -d scripts/train_on_hf.py
174
+
175
+ # Pull and analyze a finished iter
176
+ python scripts/analyze_iter.py iter5
177
+
178
+ # Tests still pass:
179
+ python -m pytest tests/ -q
180
+ ```
181
+
182
+ ## Hardware notes (learned the hard way)
183
+
184
+ - **a100-large**: best perf but capacity-limited at peak hours
185
+ - **a10g-large**: reliable, ~30% slower, well-tested with Unsloth
186
+ - **l40sx1**: also capacity-limited
187
+ - **h200**: Unsloth doesn't detect the GPU (sm_90 incompat) — DO NOT USE
188
+ - **HF Jobs API `/logs` endpoint lags severely** — always cross-check via the live UI
docs/iter4_partial_analysis.txt ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Parsed 241 metric rows
2
+
3
+ metric step~ 0 step~ 30 step~ 60 step~ 120 step~ 180 step~ 240
4
+ ------------------------------------------------------------------------------------
5
+ loss -0.000 +0.007 +0.056 +0.081 +0.058 +0.036
6
+ reward -3.189 -3.726 -0.771 -0.125 +0.928 +0.937
7
+ reward_std +2.168 +2.272 +1.452 +0.691 +0.099 +0.193
8
+ frac_zero_std +0.000 +0.000 +0.000 +0.250 +0.250 +0.000
9
+ format_valid -1.125 -1.375 +0.375 +0.344 +0.672 +0.688
10
+ action_legal -0.656 -0.750 -0.125 -0.062 +0.000 +0.000
11
+ env_reward -1.875 -2.213 -0.348 +0.107 +0.759 +0.812
12
+ belief_accuracy -0.096 -0.100 -0.087 -0.100 -0.081 -0.105
13
+ kl -0.000 +0.171 +1.398 +2.033 +1.445 +0.903
14
+ compl_length +32.000 +32.000 +32.000 +32.000 +32.000 +32.000
15
+ grad_norm +36.083 +59.430 +5.959 +1.448 +1.559 +9.748
16
+
17
+ === Linear trend (slope) over the run � units per 100 steps ===
18
+ loss slope/100steps=+0.0163 first20-mean=+0.005 last20-mean=+0.040 delta=+0.035 [UP]
19
+ reward slope/100steps=+1.7239 first20-mean=-3.400 last20-mean=+0.390 delta=+3.791 [UP]
20
+ reward_std slope/100steps=-0.7742 first20-mean=+1.554 last20-mean=+0.334 delta=-1.220 [DOWN]
21
+ frac_zero_std slope/100steps=+0.2577 first20-mean=+0.163 last20-mean=+0.388 delta=+0.225 [UP]
22
+ format_valid slope/100steps=+0.6736 first20-mean=-1.195 last20-mean=+0.438 delta=+1.633 [UP]
23
+ action_legal slope/100steps=+0.2732 first20-mean=-0.702 last20-mean=-0.056 delta=+0.645 [UP]
24
+ env_reward slope/100steps=+1.1163 first20-mean=-2.013 last20-mean=+0.438 delta=+2.451 [UP]
25
+ belief_accuracy slope/100steps=+0.0007 first20-mean=-0.095 last20-mean=-0.095 delta=+0.000 [FLAT]
26
+ kl slope/100steps=+0.4085 first20-mean=+0.123 last20-mean=+0.995 delta=+0.872 [UP]
27
+ compl_length slope/100steps=-0.0297 first20-mean=+32.000 last20-mean=+32.000 delta=+0.000 [FLAT]
28
+ grad_norm slope/100steps=-474.6836 first20-mean=+2388.734 last20-mean=+11.383 delta=-2377.351 [DOWN]
29
+
30
+ === Mode-collapse warning signs ===
31
+ Last-50 mean frac_reward_zero_std: 0.56 (1.0 = full collapse)
32
+ Traceback (most recent call last):
33
+ File "C:\Users\guptapri\Downloads\Akhil\Repos\hackathon\rhythm_env\scripts\analyze_logdump.py", line 124, in <module>
34
+ main()
35
+ File "C:\Users\guptapri\Downloads\Akhil\Repos\hackathon\rhythm_env\scripts\analyze_logdump.py", line 105, in main
36
+ print(f"Last-50 mean reward_std: {avg_reward_std:.3f} (\u22650.3 = healthy variance)")
37
+ File "C:\Users\guptapri\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1252.py", line 19, in encode
38
+ return codecs.charmap_encode(input,self.errors,encoding_table)[0]
39
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
40
+ UnicodeEncodeError: 'charmap' codec can't encode character '\u2265' in position 43: character maps to <undefined>
41
+ ---
docs/logdump.txt ADDED
The diff for this file is too large to render. See raw diff
 
scripts/analyze_logdump.py ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Parse logdump.txt (HF Jobs UI export) and produce a training trajectory analysis."""
2
+
3
+ import json
4
+ import re
5
+ from pathlib import Path
6
+
7
+ LOG_PATH = Path("docs/logdump.txt")
8
+
9
+ # Each metric line looks like a Python dict literal — parse with eval-ish via JSON
10
+ # (HF prints them as Python repr, so single quotes → need to handle)
11
+ DICT_RE = re.compile(r"^\{'loss':.*\}$")
12
+
13
+
14
+ def parse_dict_line(line: str) -> dict | None:
15
+ line = line.strip()
16
+ if not line.startswith("{") or "'loss'" not in line:
17
+ return None
18
+ # Convert Python single-quoted dict to JSON: replace ' with " (naive but works for our shape)
19
+ try:
20
+ py_dict = eval(line, {"__builtins__": {}}, {})
21
+ if isinstance(py_dict, dict):
22
+ return py_dict
23
+ except Exception:
24
+ return None
25
+ return None
26
+
27
+
28
+ def main():
29
+ with open(LOG_PATH, encoding="utf-8") as f:
30
+ lines = f.readlines()
31
+
32
+ rows = []
33
+ for ln in lines:
34
+ d = parse_dict_line(ln)
35
+ if d is not None:
36
+ rows.append(d)
37
+
38
+ print(f"Parsed {len(rows)} metric rows")
39
+ if not rows:
40
+ return
41
+
42
+ # Snapshots at percentiles
43
+ n = len(rows)
44
+ snaps = [0, n // 8, n // 4, n // 2, 3 * n // 4, n - 1]
45
+ snaps = sorted(set(snaps))
46
+
47
+ metrics = [
48
+ ("loss", lambda r: r.get("loss")),
49
+ ("reward", lambda r: r.get("reward")),
50
+ ("reward_std", lambda r: r.get("reward_std")),
51
+ ("frac_zero_std", lambda r: r.get("frac_reward_zero_std")),
52
+ ("format_valid", lambda r: r.get("rewards/format_valid/mean")),
53
+ ("action_legal", lambda r: r.get("rewards/action_legal/mean")),
54
+ ("env_reward", lambda r: r.get("rewards/env_reward/mean")),
55
+ ("belief_accuracy", lambda r: r.get("rewards/belief_accuracy/mean")),
56
+ ("kl", lambda r: r.get("kl")),
57
+ ("compl_length", lambda r: r.get("completion_length")),
58
+ ("grad_norm", lambda r: r.get("grad_norm")),
59
+ ]
60
+
61
+ print()
62
+ header = f"{'metric':<18} " + " ".join(f"step~{s:>5}" for s in snaps)
63
+ print(header)
64
+ print("-" * len(header))
65
+ for label, getter in metrics:
66
+ vals = []
67
+ for s in snaps:
68
+ v = getter(rows[s])
69
+ vals.append(f"{v:+.3f}" if isinstance(v, (int, float)) else "-")
70
+ print(f"{label:<18} " + " ".join(f"{v:>10}" for v in vals))
71
+
72
+ # Compute trends: linear-fit slope per metric (eyeball trend direction)
73
+ import statistics
74
+
75
+ print()
76
+ print("=== Linear trend (slope) over the run — units per 100 steps ===")
77
+ n = len(rows)
78
+ xs = list(range(n))
79
+ for label, getter in metrics:
80
+ ys = [getter(r) for r in rows]
81
+ ys_clean = [y for y in ys if isinstance(y, (int, float))]
82
+ xs_clean = [x for x, y in zip(xs, ys) if isinstance(y, (int, float))]
83
+ if len(ys_clean) < 5:
84
+ continue
85
+ # Simple least-squares
86
+ x_mean = statistics.mean(xs_clean)
87
+ y_mean = statistics.mean(ys_clean)
88
+ num = sum((x - x_mean) * (y - y_mean) for x, y in zip(xs_clean, ys_clean))
89
+ den = sum((x - x_mean) ** 2 for x in xs_clean)
90
+ slope = (num / den) * 100 if den > 0 else 0
91
+ # Mean of last 20 vs mean of first 20 (more robust signal)
92
+ first_mean = statistics.mean(ys_clean[:20]) if len(ys_clean) >= 20 else float("nan")
93
+ last_mean = statistics.mean(ys_clean[-20:]) if len(ys_clean) >= 20 else float("nan")
94
+ delta = last_mean - first_mean
95
+ direction = "UP" if delta > 0.01 else ("DOWN" if delta < -0.01 else "FLAT")
96
+ print(f" {label:<18} slope/100steps={slope:+.4f} first20-mean={first_mean:+.3f} last20-mean={last_mean:+.3f} delta={delta:+.3f} [{direction}]")
97
+
98
+ # Look at action-distribution implicit signal: completion_length stability
99
+ print()
100
+ print("=== Mode-collapse warning signs ===")
101
+ last_50 = rows[-50:] if len(rows) >= 50 else rows
102
+ avg_zero_std = statistics.mean(r.get("frac_reward_zero_std", 0) for r in last_50 if isinstance(r.get("frac_reward_zero_std"), (int, float)))
103
+ avg_reward_std = statistics.mean(r.get("reward_std", 0) for r in last_50 if isinstance(r.get("reward_std"), (int, float)))
104
+ print(f"Last-50 mean frac_reward_zero_std: {avg_zero_std:.2f} (1.0 = full collapse)")
105
+ print(f"Last-50 mean reward_std: {avg_reward_std:.3f} (≥0.3 = healthy variance)")
106
+
107
+ print()
108
+ print(f"Final step in log: {n} (iter 4 was canceled at ~step 235)")
109
+ if n >= 1:
110
+ last = rows[-1]
111
+ print()
112
+ print("=== Final-step metrics ===")
113
+ for k in [
114
+ "loss", "reward", "reward_std", "frac_reward_zero_std",
115
+ "rewards/format_valid/mean", "rewards/action_legal/mean",
116
+ "rewards/env_reward/mean", "rewards/belief_accuracy/mean",
117
+ "kl", "completion_length", "grad_norm",
118
+ ]:
119
+ v = last.get(k)
120
+ print(f" {k:<35} {v}")
121
+
122
+
123
+ if __name__ == "__main__":
124
+ main()