Spaces:
Running
docs: handoff bundle for new chat session + iter 4 partial analysis
Browse files- docs/handoff_for_next_session.md: comprehensive context dump for resuming
in a new Claude session. Covers project state, all 5 iterations, both
rounds of fixes (round 1 = 7 fixes for iter 2, round 2 = 7 fixes for
iter 4+), what's broken, what's not, pending iter 6+ candidates, key
files, hardware notes.
- docs/logdump.txt: HF Jobs UI export from iter 4 cancelled run (235 steps,
3272 lines, the only data we have for the round-2 fixes in action).
- docs/iter4_partial_analysis.txt: parsed trajectory + trends + final-step
metrics. Shows belief_accuracy FLAT at -0.10 — the unsolved core issue.
- scripts/analyze_logdump.py: parser used to produce the analysis.
Honest meta-note: this project was hit-and-trial. Each iter exposed bugs
we should have caught upfront. Future iterations should compute reward
variance across IDENTICAL completions before training, analyze belief
target distribution analytically, and run 50-step micro-smoke-trains
before committing to 200+ step runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docs/handoff_for_next_session.md +188 -0
- docs/iter4_partial_analysis.txt +41 -0
- docs/logdump.txt +0 -0
- scripts/analyze_logdump.py +124 -0
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Handoff for new chat session
|
| 2 |
+
|
| 3 |
+
Paste the contents of this file (or just point the new chat at it) to bring
|
| 4 |
+
a fresh Claude session up to speed.
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Project + vision
|
| 9 |
+
|
| 10 |
+
**Meta OpenEnv Hackathon submission** — `RhythmEnv`, a meta-RL environment
|
| 11 |
+
where an LLM agent learns the *skill of inferring a user's hidden personality*
|
| 12 |
+
from observation alone. POC trains in simulation; production replaces the
|
| 13 |
+
synthetic meters with real wearable signals (HRV, calendar, accept/ignore taps).
|
| 14 |
+
|
| 15 |
+
**The aim** is the agent should require ~no explicit input from users — works
|
| 16 |
+
purely from passive sensor data + tap responses. SensorLM (Google, 2025) is
|
| 17 |
+
the proven input layer for production.
|
| 18 |
+
|
| 19 |
+
## Where the code lives
|
| 20 |
+
|
| 21 |
+
- Local: `c:/Users/guptapri/Downloads/Akhil/Repos/hackathon/rhythm_env/`
|
| 22 |
+
- HF Space (deployed): `https://huggingface.co/spaces/InosLihka/rhythm_env`
|
| 23 |
+
- Git remote: `hf` → that HF Space (main branch). Local working branch: `round2`.
|
| 24 |
+
- HF token: `~/.cache/huggingface/token`
|
| 25 |
+
|
| 26 |
+
## Current state — iteration 5 running
|
| 27 |
+
|
| 28 |
+
| iter | Hardware | Config | Result |
|
| 29 |
+
|---|---|---|---|
|
| 30 |
+
| 1 | a100 | 200 steps, LoRA 4, num_gen 4 | Mode collapse → `EXERCISE 5 5 5` |
|
| 31 |
+
| 2 | a100 | 400 steps + 7 fixes (temp 1.5, weights, action_legal=0, repetition penalty, late_quality math, hint=0, seed-mix) | Mode collapse → MEDITATE-EXERCISE 2-cycle |
|
| 32 |
+
| 3 | n/a | 800 steps + 7 fixes | Cancelled before run (stale code) |
|
| 33 |
+
| 4 (a100/l40s/h200 attempts) | various | various | Capacity-cancelled or H200/Unsloth incompat |
|
| 34 |
+
| **4 (a10g)** | a10g-large | LoRA 16, num_gen 8, 800 steps + further 7 fixes from external bug review | **CANCELLED at step 235 by mistake** based on stale API logs. UI showed it was healthy. ~$2.10 wasted. |
|
| 35 |
+
| **5 (a10g)** | a10g-large | LoRA 8, num_gen 4, 500 steps + same 7 fixes as iter 4 | **RUNNING** — job `69eda027d70108f37acdf9a7` |
|
| 36 |
+
|
| 37 |
+
**Spend: ~$5.60 of $30 budget.**
|
| 38 |
+
|
| 39 |
+
## The two rounds of fixes applied
|
| 40 |
+
|
| 41 |
+
### Round 1 (after iter 1 collapse, applied for iter 2): the 7 hyperparameter/reward fixes
|
| 42 |
+
1. `temperature` 1.0 → 1.5 (force diverse rollouts)
|
| 43 |
+
2. `reward_weights` `[0.3, 0.3, 1.0, 1.0]` → `[0.05, 0.05, 1.5, 3.0]` (suppress saturated layers)
|
| 44 |
+
3. `action_legal` returns 0 (was +0.5) for valid — drop the constant-reward layer
|
| 45 |
+
4. Explicit repetition penalty in env_reward (-0.3 if action 3+ in row)
|
| 46 |
+
5. `_grade_episode` `late_quality` normalization fix ([-3, +3] not [-1, +1])
|
| 47 |
+
6. `hint_fraction` 0.15 → 0.0 (eliminate train-eval distribution mismatch)
|
| 48 |
+
7. `env_reward` seed-fallback hardening (`(i*17)^0xBEEF` mix to break clusters)
|
| 49 |
+
|
| 50 |
+
### Round 2 (after iter 2 collapse + external bug review, applied for iter 4 onward): 7 deeper fixes
|
| 51 |
+
1. **Anomalies surfaced in prompt** (StepRecord + format_observation_prompt + inference.py) — was computed but never visible to agent
|
| 52 |
+
2. **Belief baseline subtraction** (`belief_accuracy`): reward = similarity − constant_baseline_similarity, so constant `5 5 5` no longer earns +1/step free reward
|
| 53 |
+
3. **Profile weight cap 0.80 → 0.45** (`sample_profile`) — forces multi-meter profiles
|
| 54 |
+
4. **Scaled-down shaping**: -0.10/-0.15/+0.07 (was -0.30/-0.40/+0.20)
|
| 55 |
+
5. **Step-0 belief reward = 0** (no info to commit on)
|
| 56 |
+
6. **Belief-action coupling reward**: ±0.15 if action matches/contradicts emitted belief
|
| 57 |
+
7. **`grader_bias` moved out of `_compute_reward` into `env_reward`** — keeps env per-step reward pure for inference signal
|
| 58 |
+
|
| 59 |
+
## What iter 4 partial data (235/800 steps) tells us
|
| 60 |
+
|
| 61 |
+
Logs at `docs/logdump.txt`. Analysis at `docs/iter4_partial_analysis.txt`.
|
| 62 |
+
|
| 63 |
+
**Working:**
|
| 64 |
+
- Total reward: -3.4 → +0.39 (climbing)
|
| 65 |
+
- format_valid: -1.20 → +0.44 (slow but climbing)
|
| 66 |
+
- env_reward: -2.01 → +0.44 (climbing)
|
| 67 |
+
- grad_norm normalized to ~10
|
| 68 |
+
- No catastrophic mode collapse
|
| 69 |
+
|
| 70 |
+
**Still broken — the unsolved core:**
|
| 71 |
+
- `belief_accuracy/mean` flat at **-0.10** throughout 235 steps
|
| 72 |
+
- Linear slope: +0.0007 per 100 steps (essentially zero)
|
| 73 |
+
- Agent emits beliefs SLIGHTLY WORSE than constant baseline
|
| 74 |
+
|
| 75 |
+
**Root-cause hypothesis** (from analysis):
|
| 76 |
+
|
| 77 |
+
The profile cap (0.80 → 0.45) compressed the belief target distribution.
|
| 78 |
+
With balanced profiles, sampled belief vectors land near `[0.5, 0.5, 0.5]`, so
|
| 79 |
+
the constant `5 5 5` baseline already has high similarity. Real learning has
|
| 80 |
+
a tiny ceiling. **Two fixes (profile cap + baseline subtraction) interact
|
| 81 |
+
negatively.**
|
| 82 |
+
|
| 83 |
+
## Pending fixes NOT yet attempted (priority order for iter 6+)
|
| 84 |
+
|
| 85 |
+
1. **HIGHEST PRIORITY: revert profile cap to 0.80** — restores belief target
|
| 86 |
+
spread; the new `grader_bias` term handles the original "spam recovery
|
| 87 |
+
actions" exploit independently. Single-line fix in `sample_profile`.
|
| 88 |
+
|
| 89 |
+
2. **Multi-step plan generation** — deferred from external agent's analysis.
|
| 90 |
+
Completion = 3-action plan instead of 1 action; `env_reward` replays the
|
| 91 |
+
plan cumulatively. Addresses the structural per-step-GRPO vs per-episode-
|
| 92 |
+
grader mismatch (`adaptation_score` is 30% of grade but only ~3% of training
|
| 93 |
+
rows see it via terminal bonus). This is the highest-leverage missing fix.
|
| 94 |
+
|
| 95 |
+
3. **Captioned anomaly history** — describe per-meter anomalies in language
|
| 96 |
+
("vitality dropped 5% MORE than baseline") instead of `[anom V-0.05]`.
|
| 97 |
+
Don't bake in conclusions ("this person is introverted") — just describe.
|
| 98 |
+
|
| 99 |
+
## What I would do differently next time (the methodology lesson)
|
| 100 |
+
|
| 101 |
+
The user (Akhil) explicitly called this out 2026-04-26: "I don't want to do
|
| 102 |
+
hit and trial." Each iter exposed a bug we should have caught upfront.
|
| 103 |
+
|
| 104 |
+
Things that would have prevented the wasted iters:
|
| 105 |
+
- **Compute reward variance across IDENTICAL completions** before training.
|
| 106 |
+
GRPO advantage = reward − group_mean; if all 4 completions get the same
|
| 107 |
+
reward from a layer, that layer contributes ZERO gradient. `pipeline_dryrun`
|
| 108 |
+
tested DIFFERENT actions per kind, missing this.
|
| 109 |
+
- **Analytically check belief target distribution** under continuous profiles
|
| 110 |
+
before tightening the cap. Would have caught the iter 4 issue.
|
| 111 |
+
- **Run a 50-step micro-smoke-train ($0.10)** on the smallest possible config
|
| 112 |
+
before committing to 200+ step runs.
|
| 113 |
+
- **Address structural issues (multi-step plans) before tweaking hyperparams.**
|
| 114 |
+
|
| 115 |
+
## Key files for new session
|
| 116 |
+
|
| 117 |
+
| File | What it has |
|
| 118 |
+
|---|---|
|
| 119 |
+
| `docs/architecture.md` | 10 ASCII diagrams with concrete values, full pipeline |
|
| 120 |
+
| `docs/iterations.md` | Iter 0-3 journey doc (NEEDS update for iter 4+5) |
|
| 121 |
+
| `docs/logdump.txt` | Iter 4 raw UI logs through step 235 (3272 lines) |
|
| 122 |
+
| `docs/iter4_partial_analysis.txt` | Parsed iter 4 trajectory snapshots + trends |
|
| 123 |
+
| `docs/results.md` | Template for headline results (NOT yet filled) |
|
| 124 |
+
| `docs/blog_post.md` | Strong narrative, pre-meta-RL refactor |
|
| 125 |
+
| `docs/references/judging_criteria.md` | Hackathon scoring (40% innovation, 30% storytelling, 20% improvement, 10% reward pipeline) |
|
| 126 |
+
| `training/WhatMakesAGoodSubmission.md` | Hackathon criteria, mentions OpenEnv Rubric system (we don't use it — gap) |
|
| 127 |
+
| `server/rhythm_environment.py` | Env, sample_profile, _compute_reward, _grade_episode |
|
| 128 |
+
| `training/reward_functions.py` | 4-layer reward stack, parser, belief_accuracy |
|
| 129 |
+
| `training/dataset.py` | Prompt builder, episode generator |
|
| 130 |
+
| `training/train.py` | GRPOConfig setup, reward_weights wiring |
|
| 131 |
+
| `scripts/train_on_hf.py` | HF Jobs orchestrator |
|
| 132 |
+
| `scripts/analyze_logdump.py` | Parse iter 4 UI export |
|
| 133 |
+
| `scripts/analyze_iter.py` | Pull + analyze any iter's HF Hub repo |
|
| 134 |
+
| `eval_baselines_meta.json` | Heuristic in-dist 0.587, OOD 0.580 — bars to beat |
|
| 135 |
+
|
| 136 |
+
## Open monitor + job state at handoff
|
| 137 |
+
|
| 138 |
+
- **Active monitor**: task `b10lozxd8` watching iter 5
|
| 139 |
+
- **Iter 5 job**: `69eda027d70108f37acdf9a7` on a10g-large, RUNNING
|
| 140 |
+
- **HF Job UI**: `https://huggingface.co/jobs/InosLihka/69eda027d70108f37acdf9a7`
|
| 141 |
+
- **Iter 5 model repo (when complete)**: `https://huggingface.co/InosLihka/rhythm-env-meta-trained-iter5`
|
| 142 |
+
|
| 143 |
+
## What to do in new session
|
| 144 |
+
|
| 145 |
+
**If iter 5 results are landed:**
|
| 146 |
+
1. Check `docs/logdump.txt` for the new logs OR pull from `InosLihka/rhythm-env-meta-trained-iter5`
|
| 147 |
+
2. Run `python scripts/analyze_iter.py iter5`
|
| 148 |
+
3. Decide: ship it / iter 6 with profile-cap revert / iter 6 with multi-step plans
|
| 149 |
+
|
| 150 |
+
**If iter 5 is still running:**
|
| 151 |
+
1. Check status via `https://huggingface.co/jobs/InosLihka/69eda027d70108f37acdf9a7`
|
| 152 |
+
2. Trust the UI, NOT the API (lag is severe)
|
| 153 |
+
|
| 154 |
+
**If you're starting fresh after iter 5 completed:**
|
| 155 |
+
1. Read this file
|
| 156 |
+
2. Check the iter 5 model repo
|
| 157 |
+
3. Decide path forward based on the belief_accuracy outcome
|
| 158 |
+
|
| 159 |
+
## Commands you'll need
|
| 160 |
+
|
| 161 |
+
```bash
|
| 162 |
+
# HF Jobs status
|
| 163 |
+
hf jobs ps
|
| 164 |
+
hf jobs inspect <id>
|
| 165 |
+
hf jobs logs <id> # use UI instead — API lags
|
| 166 |
+
|
| 167 |
+
# Submit a new training run
|
| 168 |
+
cd c:/Users/guptapri/Downloads/Akhil/Repos/hackathon/rhythm_env
|
| 169 |
+
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
|
| 170 |
+
-e FAST_MODE=1 \
|
| 171 |
+
-e MODEL_REPO_SUFFIX=iter6 \
|
| 172 |
+
-e LORA_RANK=8 -e NUM_GENERATIONS=4 -e MAX_STEPS=500 \
|
| 173 |
+
-d scripts/train_on_hf.py
|
| 174 |
+
|
| 175 |
+
# Pull and analyze a finished iter
|
| 176 |
+
python scripts/analyze_iter.py iter5
|
| 177 |
+
|
| 178 |
+
# Tests still pass:
|
| 179 |
+
python -m pytest tests/ -q
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
## Hardware notes (learned the hard way)
|
| 183 |
+
|
| 184 |
+
- **a100-large**: best perf but capacity-limited at peak hours
|
| 185 |
+
- **a10g-large**: reliable, ~30% slower, well-tested with Unsloth
|
| 186 |
+
- **l40sx1**: also capacity-limited
|
| 187 |
+
- **h200**: Unsloth doesn't detect the GPU (sm_90 incompat) — DO NOT USE
|
| 188 |
+
- **HF Jobs API `/logs` endpoint lags severely** — always cross-check via the live UI
|
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Parsed 241 metric rows
|
| 2 |
+
|
| 3 |
+
metric step~ 0 step~ 30 step~ 60 step~ 120 step~ 180 step~ 240
|
| 4 |
+
------------------------------------------------------------------------------------
|
| 5 |
+
loss -0.000 +0.007 +0.056 +0.081 +0.058 +0.036
|
| 6 |
+
reward -3.189 -3.726 -0.771 -0.125 +0.928 +0.937
|
| 7 |
+
reward_std +2.168 +2.272 +1.452 +0.691 +0.099 +0.193
|
| 8 |
+
frac_zero_std +0.000 +0.000 +0.000 +0.250 +0.250 +0.000
|
| 9 |
+
format_valid -1.125 -1.375 +0.375 +0.344 +0.672 +0.688
|
| 10 |
+
action_legal -0.656 -0.750 -0.125 -0.062 +0.000 +0.000
|
| 11 |
+
env_reward -1.875 -2.213 -0.348 +0.107 +0.759 +0.812
|
| 12 |
+
belief_accuracy -0.096 -0.100 -0.087 -0.100 -0.081 -0.105
|
| 13 |
+
kl -0.000 +0.171 +1.398 +2.033 +1.445 +0.903
|
| 14 |
+
compl_length +32.000 +32.000 +32.000 +32.000 +32.000 +32.000
|
| 15 |
+
grad_norm +36.083 +59.430 +5.959 +1.448 +1.559 +9.748
|
| 16 |
+
|
| 17 |
+
=== Linear trend (slope) over the run � units per 100 steps ===
|
| 18 |
+
loss slope/100steps=+0.0163 first20-mean=+0.005 last20-mean=+0.040 delta=+0.035 [UP]
|
| 19 |
+
reward slope/100steps=+1.7239 first20-mean=-3.400 last20-mean=+0.390 delta=+3.791 [UP]
|
| 20 |
+
reward_std slope/100steps=-0.7742 first20-mean=+1.554 last20-mean=+0.334 delta=-1.220 [DOWN]
|
| 21 |
+
frac_zero_std slope/100steps=+0.2577 first20-mean=+0.163 last20-mean=+0.388 delta=+0.225 [UP]
|
| 22 |
+
format_valid slope/100steps=+0.6736 first20-mean=-1.195 last20-mean=+0.438 delta=+1.633 [UP]
|
| 23 |
+
action_legal slope/100steps=+0.2732 first20-mean=-0.702 last20-mean=-0.056 delta=+0.645 [UP]
|
| 24 |
+
env_reward slope/100steps=+1.1163 first20-mean=-2.013 last20-mean=+0.438 delta=+2.451 [UP]
|
| 25 |
+
belief_accuracy slope/100steps=+0.0007 first20-mean=-0.095 last20-mean=-0.095 delta=+0.000 [FLAT]
|
| 26 |
+
kl slope/100steps=+0.4085 first20-mean=+0.123 last20-mean=+0.995 delta=+0.872 [UP]
|
| 27 |
+
compl_length slope/100steps=-0.0297 first20-mean=+32.000 last20-mean=+32.000 delta=+0.000 [FLAT]
|
| 28 |
+
grad_norm slope/100steps=-474.6836 first20-mean=+2388.734 last20-mean=+11.383 delta=-2377.351 [DOWN]
|
| 29 |
+
|
| 30 |
+
=== Mode-collapse warning signs ===
|
| 31 |
+
Last-50 mean frac_reward_zero_std: 0.56 (1.0 = full collapse)
|
| 32 |
+
Traceback (most recent call last):
|
| 33 |
+
File "C:\Users\guptapri\Downloads\Akhil\Repos\hackathon\rhythm_env\scripts\analyze_logdump.py", line 124, in <module>
|
| 34 |
+
main()
|
| 35 |
+
File "C:\Users\guptapri\Downloads\Akhil\Repos\hackathon\rhythm_env\scripts\analyze_logdump.py", line 105, in main
|
| 36 |
+
print(f"Last-50 mean reward_std: {avg_reward_std:.3f} (\u22650.3 = healthy variance)")
|
| 37 |
+
File "C:\Users\guptapri\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1252.py", line 19, in encode
|
| 38 |
+
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
|
| 39 |
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 40 |
+
UnicodeEncodeError: 'charmap' codec can't encode character '\u2265' in position 43: character maps to <undefined>
|
| 41 |
+
---
|
|
The diff for this file is too large to render.
See raw diff
|
|
|
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Parse logdump.txt (HF Jobs UI export) and produce a training trajectory analysis."""
|
| 2 |
+
|
| 3 |
+
import json
|
| 4 |
+
import re
|
| 5 |
+
from pathlib import Path
|
| 6 |
+
|
| 7 |
+
LOG_PATH = Path("docs/logdump.txt")
|
| 8 |
+
|
| 9 |
+
# Each metric line looks like a Python dict literal — parse with eval-ish via JSON
|
| 10 |
+
# (HF prints them as Python repr, so single quotes → need to handle)
|
| 11 |
+
DICT_RE = re.compile(r"^\{'loss':.*\}$")
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def parse_dict_line(line: str) -> dict | None:
|
| 15 |
+
line = line.strip()
|
| 16 |
+
if not line.startswith("{") or "'loss'" not in line:
|
| 17 |
+
return None
|
| 18 |
+
# Convert Python single-quoted dict to JSON: replace ' with " (naive but works for our shape)
|
| 19 |
+
try:
|
| 20 |
+
py_dict = eval(line, {"__builtins__": {}}, {})
|
| 21 |
+
if isinstance(py_dict, dict):
|
| 22 |
+
return py_dict
|
| 23 |
+
except Exception:
|
| 24 |
+
return None
|
| 25 |
+
return None
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def main():
|
| 29 |
+
with open(LOG_PATH, encoding="utf-8") as f:
|
| 30 |
+
lines = f.readlines()
|
| 31 |
+
|
| 32 |
+
rows = []
|
| 33 |
+
for ln in lines:
|
| 34 |
+
d = parse_dict_line(ln)
|
| 35 |
+
if d is not None:
|
| 36 |
+
rows.append(d)
|
| 37 |
+
|
| 38 |
+
print(f"Parsed {len(rows)} metric rows")
|
| 39 |
+
if not rows:
|
| 40 |
+
return
|
| 41 |
+
|
| 42 |
+
# Snapshots at percentiles
|
| 43 |
+
n = len(rows)
|
| 44 |
+
snaps = [0, n // 8, n // 4, n // 2, 3 * n // 4, n - 1]
|
| 45 |
+
snaps = sorted(set(snaps))
|
| 46 |
+
|
| 47 |
+
metrics = [
|
| 48 |
+
("loss", lambda r: r.get("loss")),
|
| 49 |
+
("reward", lambda r: r.get("reward")),
|
| 50 |
+
("reward_std", lambda r: r.get("reward_std")),
|
| 51 |
+
("frac_zero_std", lambda r: r.get("frac_reward_zero_std")),
|
| 52 |
+
("format_valid", lambda r: r.get("rewards/format_valid/mean")),
|
| 53 |
+
("action_legal", lambda r: r.get("rewards/action_legal/mean")),
|
| 54 |
+
("env_reward", lambda r: r.get("rewards/env_reward/mean")),
|
| 55 |
+
("belief_accuracy", lambda r: r.get("rewards/belief_accuracy/mean")),
|
| 56 |
+
("kl", lambda r: r.get("kl")),
|
| 57 |
+
("compl_length", lambda r: r.get("completion_length")),
|
| 58 |
+
("grad_norm", lambda r: r.get("grad_norm")),
|
| 59 |
+
]
|
| 60 |
+
|
| 61 |
+
print()
|
| 62 |
+
header = f"{'metric':<18} " + " ".join(f"step~{s:>5}" for s in snaps)
|
| 63 |
+
print(header)
|
| 64 |
+
print("-" * len(header))
|
| 65 |
+
for label, getter in metrics:
|
| 66 |
+
vals = []
|
| 67 |
+
for s in snaps:
|
| 68 |
+
v = getter(rows[s])
|
| 69 |
+
vals.append(f"{v:+.3f}" if isinstance(v, (int, float)) else "-")
|
| 70 |
+
print(f"{label:<18} " + " ".join(f"{v:>10}" for v in vals))
|
| 71 |
+
|
| 72 |
+
# Compute trends: linear-fit slope per metric (eyeball trend direction)
|
| 73 |
+
import statistics
|
| 74 |
+
|
| 75 |
+
print()
|
| 76 |
+
print("=== Linear trend (slope) over the run — units per 100 steps ===")
|
| 77 |
+
n = len(rows)
|
| 78 |
+
xs = list(range(n))
|
| 79 |
+
for label, getter in metrics:
|
| 80 |
+
ys = [getter(r) for r in rows]
|
| 81 |
+
ys_clean = [y for y in ys if isinstance(y, (int, float))]
|
| 82 |
+
xs_clean = [x for x, y in zip(xs, ys) if isinstance(y, (int, float))]
|
| 83 |
+
if len(ys_clean) < 5:
|
| 84 |
+
continue
|
| 85 |
+
# Simple least-squares
|
| 86 |
+
x_mean = statistics.mean(xs_clean)
|
| 87 |
+
y_mean = statistics.mean(ys_clean)
|
| 88 |
+
num = sum((x - x_mean) * (y - y_mean) for x, y in zip(xs_clean, ys_clean))
|
| 89 |
+
den = sum((x - x_mean) ** 2 for x in xs_clean)
|
| 90 |
+
slope = (num / den) * 100 if den > 0 else 0
|
| 91 |
+
# Mean of last 20 vs mean of first 20 (more robust signal)
|
| 92 |
+
first_mean = statistics.mean(ys_clean[:20]) if len(ys_clean) >= 20 else float("nan")
|
| 93 |
+
last_mean = statistics.mean(ys_clean[-20:]) if len(ys_clean) >= 20 else float("nan")
|
| 94 |
+
delta = last_mean - first_mean
|
| 95 |
+
direction = "UP" if delta > 0.01 else ("DOWN" if delta < -0.01 else "FLAT")
|
| 96 |
+
print(f" {label:<18} slope/100steps={slope:+.4f} first20-mean={first_mean:+.3f} last20-mean={last_mean:+.3f} delta={delta:+.3f} [{direction}]")
|
| 97 |
+
|
| 98 |
+
# Look at action-distribution implicit signal: completion_length stability
|
| 99 |
+
print()
|
| 100 |
+
print("=== Mode-collapse warning signs ===")
|
| 101 |
+
last_50 = rows[-50:] if len(rows) >= 50 else rows
|
| 102 |
+
avg_zero_std = statistics.mean(r.get("frac_reward_zero_std", 0) for r in last_50 if isinstance(r.get("frac_reward_zero_std"), (int, float)))
|
| 103 |
+
avg_reward_std = statistics.mean(r.get("reward_std", 0) for r in last_50 if isinstance(r.get("reward_std"), (int, float)))
|
| 104 |
+
print(f"Last-50 mean frac_reward_zero_std: {avg_zero_std:.2f} (1.0 = full collapse)")
|
| 105 |
+
print(f"Last-50 mean reward_std: {avg_reward_std:.3f} (≥0.3 = healthy variance)")
|
| 106 |
+
|
| 107 |
+
print()
|
| 108 |
+
print(f"Final step in log: {n} (iter 4 was canceled at ~step 235)")
|
| 109 |
+
if n >= 1:
|
| 110 |
+
last = rows[-1]
|
| 111 |
+
print()
|
| 112 |
+
print("=== Final-step metrics ===")
|
| 113 |
+
for k in [
|
| 114 |
+
"loss", "reward", "reward_std", "frac_reward_zero_std",
|
| 115 |
+
"rewards/format_valid/mean", "rewards/action_legal/mean",
|
| 116 |
+
"rewards/env_reward/mean", "rewards/belief_accuracy/mean",
|
| 117 |
+
"kl", "completion_length", "grad_norm",
|
| 118 |
+
]:
|
| 119 |
+
v = last.get(k)
|
| 120 |
+
print(f" {k:<35} {v}")
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
if __name__ == "__main__":
|
| 124 |
+
main()
|