Spaces:

InosLihka
/

rhythm_env

Running

InosLihka Claude Opus 4.7 (1M context) commited on 17 days ago

Commit

7bb9278

1 Parent(s): 63216a8

docs: handoff bundle for new chat session + iter 4 partial analysis

- docs/handoff_for_next_session.md: comprehensive context dump for resuming
in a new Claude session. Covers project state, all 5 iterations, both
rounds of fixes (round 1 = 7 fixes for iter 2, round 2 = 7 fixes for
iter 4+), what's broken, what's not, pending iter 6+ candidates, key
files, hardware notes.
- docs/logdump.txt: HF Jobs UI export from iter 4 cancelled run (235 steps,
3272 lines, the only data we have for the round-2 fixes in action).
- docs/iter4_partial_analysis.txt: parsed trajectory + trends + final-step
metrics. Shows belief_accuracy FLAT at -0.10 — the unsolved core issue.
- scripts/analyze_logdump.py: parser used to produce the analysis.

Honest meta-note: this project was hit-and-trial. Each iter exposed bugs
we should have caught upfront. Future iterations should compute reward
variance across IDENTICAL completions before training, analyze belief
target distribution analytically, and run 50-step micro-smoke-trains
before committing to 200+ step runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (4) hide show

docs/handoff_for_next_session.md +188 -0
docs/iter4_partial_analysis.txt +41 -0
docs/logdump.txt +0 -0
scripts/analyze_logdump.py +124 -0

docs/handoff_for_next_session.md ADDED Viewed

	@@ -0,0 +1,188 @@

+# Handoff for new chat session
+Paste the contents of this file (or just point the new chat at it) to bring
+a fresh Claude session up to speed.
+---
+## Project + vision
+**Meta OpenEnv Hackathon submission** — `RhythmEnv`, a meta-RL environment
+where an LLM agent learns the *skill of inferring a user's hidden personality*
+from observation alone. POC trains in simulation; production replaces the
+synthetic meters with real wearable signals (HRV, calendar, accept/ignore taps).
+**The aim** is the agent should require ~no explicit input from users — works
+purely from passive sensor data + tap responses. SensorLM (Google, 2025) is
+the proven input layer for production.
+## Where the code lives
+- Local: `c:/Users/guptapri/Downloads/Akhil/Repos/hackathon/rhythm_env/`
+- HF Space (deployed): `https://huggingface.co/spaces/InosLihka/rhythm_env`
+- Git remote: `hf` → that HF Space (main branch). Local working branch: `round2`.
+- HF token: `~/.cache/huggingface/token`
+## Current state — iteration 5 running
+| iter | Hardware | Config | Result |
+|---|---|---|---|
+| 1 | a100 | 200 steps, LoRA 4, num_gen 4 | Mode collapse → `EXERCISE 5 5 5` |
+| 2 | a100 | 400 steps + 7 fixes (temp 1.5, weights, action_legal=0, repetition penalty, late_quality math, hint=0, seed-mix) | Mode collapse → MEDITATE-EXERCISE 2-cycle |
+| 3 | n/a | 800 steps + 7 fixes | Cancelled before run (stale code) |
+| 4 (a100/l40s/h200 attempts) | various | various | Capacity-cancelled or H200/Unsloth incompat |
+| **4 (a10g)** | a10g-large | LoRA 16, num_gen 8, 800 steps + further 7 fixes from external bug review | **CANCELLED at step 235 by mistake** based on stale API logs. UI showed it was healthy. ~$2.10 wasted. |
+| **5 (a10g)** | a10g-large | LoRA 8, num_gen 4, 500 steps + same 7 fixes as iter 4 | **RUNNING** — job `69eda027d70108f37acdf9a7` |
+**Spend: ~$5.60 of $30 budget.**
+## The two rounds of fixes applied
+### Round 1 (after iter 1 collapse, applied for iter 2): the 7 hyperparameter/reward fixes
+1. `temperature` 1.0 → 1.5 (force diverse rollouts)
+2. `reward_weights` `[0.3, 0.3, 1.0, 1.0]` → `[0.05, 0.05, 1.5, 3.0]` (suppress saturated layers)
+3. `action_legal` returns 0 (was +0.5) for valid — drop the constant-reward layer
+4. Explicit repetition penalty in env_reward (-0.3 if action 3+ in row)
+5. `_grade_episode` `late_quality` normalization fix ([-3, +3] not [-1, +1])
+6. `hint_fraction` 0.15 → 0.0 (eliminate train-eval distribution mismatch)
+7. `env_reward` seed-fallback hardening (`(i*17)^0xBEEF` mix to break clusters)
+### Round 2 (after iter 2 collapse + external bug review, applied for iter 4 onward): 7 deeper fixes
+1. **Anomalies surfaced in prompt** (StepRecord + format_observation_prompt + inference.py) — was computed but never visible to agent
+2. **Belief baseline subtraction** (`belief_accuracy`): reward = similarity − constant_baseline_similarity, so constant `5 5 5` no longer earns +1/step free reward
+3. **Profile weight cap 0.80 → 0.45** (`sample_profile`) — forces multi-meter profiles
+4. **Scaled-down shaping**: -0.10/-0.15/+0.07 (was -0.30/-0.40/+0.20)
+5. **Step-0 belief reward = 0** (no info to commit on)
+6. **Belief-action coupling reward**: ±0.15 if action matches/contradicts emitted belief
+7. **`grader_bias` moved out of `_compute_reward` into `env_reward`** — keeps env per-step reward pure for inference signal
+## What iter 4 partial data (235/800 steps) tells us
+Logs at `docs/logdump.txt`. Analysis at `docs/iter4_partial_analysis.txt`.
+**Working:**
+- Total reward: -3.4 → +0.39 (climbing)
+- format_valid: -1.20 → +0.44 (slow but climbing)
+- env_reward: -2.01 → +0.44 (climbing)
+- grad_norm normalized to ~10
+- No catastrophic mode collapse
+**Still broken — the unsolved core:**
+- `belief_accuracy/mean` flat at **-0.10** throughout 235 steps
+- Linear slope: +0.0007 per 100 steps (essentially zero)
+- Agent emits beliefs SLIGHTLY WORSE than constant baseline
+**Root-cause hypothesis** (from analysis):
+The profile cap (0.80 → 0.45) compressed the belief target distribution.
+With balanced profiles, sampled belief vectors land near `[0.5, 0.5, 0.5]`, so
+the constant `5 5 5` baseline already has high similarity. Real learning has
+a tiny ceiling. **Two fixes (profile cap + baseline subtraction) interact
+negatively.**
+## Pending fixes NOT yet attempted (priority order for iter 6+)
+1. **HIGHEST PRIORITY: revert profile cap to 0.80** — restores belief target
+   spread; the new `grader_bias` term handles the original "spam recovery
+   actions" exploit independently. Single-line fix in `sample_profile`.
+2. **Multi-step plan generation** — deferred from external agent's analysis.
+   Completion = 3-action plan instead of 1 action; `env_reward` replays the
+   plan cumulatively. Addresses the structural per-step-GRPO vs per-episode-
+   grader mismatch (`adaptation_score` is 30% of grade but only ~3% of training
+   rows see it via terminal bonus). This is the highest-leverage missing fix.
+3. **Captioned anomaly history** — describe per-meter anomalies in language
+   ("vitality dropped 5% MORE than baseline") instead of `[anom V-0.05]`.
+   Don't bake in conclusions ("this person is introverted") — just describe.
+## What I would do differently next time (the methodology lesson)
+The user (Akhil) explicitly called this out 2026-04-26: "I don't want to do
+hit and trial." Each iter exposed a bug we should have caught upfront.
+Things that would have prevented the wasted iters:
+- **Compute reward variance across IDENTICAL completions** before training.
+  GRPO advantage = reward − group_mean; if all 4 completions get the same
+  reward from a layer, that layer contributes ZERO gradient. `pipeline_dryrun`
+  tested DIFFERENT actions per kind, missing this.
+- **Analytically check belief target distribution** under continuous profiles
+  before tightening the cap. Would have caught the iter 4 issue.
+- **Run a 50-step micro-smoke-train ($0.10)** on the smallest possible config
+  before committing to 200+ step runs.
+- **Address structural issues (multi-step plans) before tweaking hyperparams.**
+## Key files for new session
+| File | What it has |
+|---|---|
+| `docs/architecture.md` | 10 ASCII diagrams with concrete values, full pipeline |
+| `docs/iterations.md` | Iter 0-3 journey doc (NEEDS update for iter 4+5) |
+| `docs/logdump.txt` | Iter 4 raw UI logs through step 235 (3272 lines) |
+| `docs/iter4_partial_analysis.txt` | Parsed iter 4 trajectory snapshots + trends |
+| `docs/results.md` | Template for headline results (NOT yet filled) |
+| `docs/blog_post.md` | Strong narrative, pre-meta-RL refactor |
+| `docs/references/judging_criteria.md` | Hackathon scoring (40% innovation, 30% storytelling, 20% improvement, 10% reward pipeline) |
+| `training/WhatMakesAGoodSubmission.md` | Hackathon criteria, mentions OpenEnv Rubric system (we don't use it — gap) |
+| `server/rhythm_environment.py` | Env, sample_profile, _compute_reward, _grade_episode |
+| `training/reward_functions.py` | 4-layer reward stack, parser, belief_accuracy |
+| `training/dataset.py` | Prompt builder, episode generator |
+| `training/train.py` | GRPOConfig setup, reward_weights wiring |
+| `scripts/train_on_hf.py` | HF Jobs orchestrator |
+| `scripts/analyze_logdump.py` | Parse iter 4 UI export |
+| `scripts/analyze_iter.py` | Pull + analyze any iter's HF Hub repo |
+| `eval_baselines_meta.json` | Heuristic in-dist 0.587, OOD 0.580 — bars to beat |
+## Open monitor + job state at handoff
+- **Active monitor**: task `b10lozxd8` watching iter 5
+- **Iter 5 job**: `69eda027d70108f37acdf9a7` on a10g-large, RUNNING
+- **HF Job UI**: `https://huggingface.co/jobs/InosLihka/69eda027d70108f37acdf9a7`
+- **Iter 5 model repo (when complete)**: `https://huggingface.co/InosLihka/rhythm-env-meta-trained-iter5`
+## What to do in new session
+**If iter 5 results are landed:**
+1. Check `docs/logdump.txt` for the new logs OR pull from `InosLihka/rhythm-env-meta-trained-iter5`
+2. Run `python scripts/analyze_iter.py iter5`
+3. Decide: ship it / iter 6 with profile-cap revert / iter 6 with multi-step plans
+**If iter 5 is still running:**
+1. Check status via `https://huggingface.co/jobs/InosLihka/69eda027d70108f37acdf9a7`
+2. Trust the UI, NOT the API (lag is severe)
+**If you're starting fresh after iter 5 completed:**
+1. Read this file
+2. Check the iter 5 model repo
+3. Decide path forward based on the belief_accuracy outcome
+## Commands you'll need
+```bash
+# HF Jobs status
+hf jobs ps
+hf jobs inspect <id>
+hf jobs logs <id>            # use UI instead — API lags
+# Submit a new training run
+cd c:/Users/guptapri/Downloads/Akhil/Repos/hackathon/rhythm_env
+hf jobs uv run --flavor a10g-large --secrets HF_TOKEN \
+  -e FAST_MODE=1 \
+  -e MODEL_REPO_SUFFIX=iter6 \
+  -e LORA_RANK=8 -e NUM_GENERATIONS=4 -e MAX_STEPS=500 \
+  -d scripts/train_on_hf.py
+# Pull and analyze a finished iter
+python scripts/analyze_iter.py iter5
+# Tests still pass:
+python -m pytest tests/ -q
+```
+## Hardware notes (learned the hard way)
+- **a100-large**: best perf but capacity-limited at peak hours
+- **a10g-large**: reliable, ~30% slower, well-tested with Unsloth
+- **l40sx1**: also capacity-limited
+- **h200**: Unsloth doesn't detect the GPU (sm_90 incompat) — DO NOT USE
+- **HF Jobs API `/logs` endpoint lags severely** — always cross-check via the live UI

docs/iter4_partial_analysis.txt ADDED Viewed

	@@ -0,0 +1,41 @@

+Parsed 241 metric rows
+metric             step~    0 step~   30 step~   60 step~  120 step~  180 step~  240
+------------------------------------------------------------------------------------
+loss                   -0.000     +0.007     +0.056     +0.081     +0.058     +0.036
+reward                 -3.189     -3.726     -0.771     -0.125     +0.928     +0.937
+reward_std             +2.168     +2.272     +1.452     +0.691     +0.099     +0.193
+frac_zero_std          +0.000     +0.000     +0.000     +0.250     +0.250     +0.000
+format_valid           -1.125     -1.375     +0.375     +0.344     +0.672     +0.688
+action_legal           -0.656     -0.750     -0.125     -0.062     +0.000     +0.000
+env_reward             -1.875     -2.213     -0.348     +0.107     +0.759     +0.812
+belief_accuracy        -0.096     -0.100     -0.087     -0.100     -0.081     -0.105
+kl                     -0.000     +0.171     +1.398     +2.033     +1.445     +0.903
+compl_length          +32.000    +32.000    +32.000    +32.000    +32.000    +32.000
+grad_norm             +36.083    +59.430     +5.959     +1.448     +1.559     +9.748
+=== Linear trend (slope) over the run � units per 100 steps ===
+  loss               slope/100steps=+0.0163  first20-mean=+0.005  last20-mean=+0.040  delta=+0.035 [UP]
+  reward             slope/100steps=+1.7239  first20-mean=-3.400  last20-mean=+0.390  delta=+3.791 [UP]
+  reward_std         slope/100steps=-0.7742  first20-mean=+1.554  last20-mean=+0.334  delta=-1.220 [DOWN]
+  frac_zero_std      slope/100steps=+0.2577  first20-mean=+0.163  last20-mean=+0.388  delta=+0.225 [UP]
+  format_valid       slope/100steps=+0.6736  first20-mean=-1.195  last20-mean=+0.438  delta=+1.633 [UP]
+  action_legal       slope/100steps=+0.2732  first20-mean=-0.702  last20-mean=-0.056  delta=+0.645 [UP]
+  env_reward         slope/100steps=+1.1163  first20-mean=-2.013  last20-mean=+0.438  delta=+2.451 [UP]
+  belief_accuracy    slope/100steps=+0.0007  first20-mean=-0.095  last20-mean=-0.095  delta=+0.000 [FLAT]
+  kl                 slope/100steps=+0.4085  first20-mean=+0.123  last20-mean=+0.995  delta=+0.872 [UP]
+  compl_length       slope/100steps=-0.0297  first20-mean=+32.000  last20-mean=+32.000  delta=+0.000 [FLAT]
+  grad_norm          slope/100steps=-474.6836  first20-mean=+2388.734  last20-mean=+11.383  delta=-2377.351 [DOWN]
+=== Mode-collapse warning signs ===
+Last-50 mean frac_reward_zero_std: 0.56  (1.0 = full collapse)
+Traceback (most recent call last):
+  File "C:\Users\guptapri\Downloads\Akhil\Repos\hackathon\rhythm_env\scripts\analyze_logdump.py", line 124, in <module>
+    main()
+  File "C:\Users\guptapri\Downloads\Akhil\Repos\hackathon\rhythm_env\scripts\analyze_logdump.py", line 105, in main
+    print(f"Last-50 mean reward_std:           {avg_reward_std:.3f}  (\u22650.3 = healthy variance)")
+  File "C:\Users\guptapri\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1252.py", line 19, in encode
+    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+UnicodeEncodeError: 'charmap' codec can't encode character '\u2265' in position 43: character maps to <undefined>
+---

docs/logdump.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

scripts/analyze_logdump.py ADDED Viewed

	@@ -0,0 +1,124 @@

+"""Parse logdump.txt (HF Jobs UI export) and produce a training trajectory analysis."""
+import json
+import re
+from pathlib import Path
+LOG_PATH = Path("docs/logdump.txt")
+# Each metric line looks like a Python dict literal — parse with eval-ish via JSON
+# (HF prints them as Python repr, so single quotes → need to handle)
+DICT_RE = re.compile(r"^\{'loss':.*\}$")
+def parse_dict_line(line: str) -> dict | None:
+    line = line.strip()
+    if not line.startswith("{") or "'loss'" not in line:
+        return None
+    # Convert Python single-quoted dict to JSON: replace ' with " (naive but works for our shape)
+    try:
+        py_dict = eval(line, {"__builtins__": {}}, {})
+        if isinstance(py_dict, dict):
+            return py_dict
+    except Exception:
+        return None
+    return None
+def main():
+    with open(LOG_PATH, encoding="utf-8") as f:
+        lines = f.readlines()
+    rows = []
+    for ln in lines:
+        d = parse_dict_line(ln)
+        if d is not None:
+            rows.append(d)
+    print(f"Parsed {len(rows)} metric rows")
+    if not rows:
+        return
+    # Snapshots at percentiles
+    n = len(rows)
+    snaps = [0, n // 8, n // 4, n // 2, 3 * n // 4, n - 1]
+    snaps = sorted(set(snaps))
+    metrics = [
+        ("loss", lambda r: r.get("loss")),
+        ("reward", lambda r: r.get("reward")),
+        ("reward_std", lambda r: r.get("reward_std")),
+        ("frac_zero_std", lambda r: r.get("frac_reward_zero_std")),
+        ("format_valid", lambda r: r.get("rewards/format_valid/mean")),
+        ("action_legal", lambda r: r.get("rewards/action_legal/mean")),
+        ("env_reward", lambda r: r.get("rewards/env_reward/mean")),
+        ("belief_accuracy", lambda r: r.get("rewards/belief_accuracy/mean")),
+        ("kl", lambda r: r.get("kl")),
+        ("compl_length", lambda r: r.get("completion_length")),
+        ("grad_norm", lambda r: r.get("grad_norm")),
+    ]
+    print()
+    header = f"{'metric':<18} " + " ".join(f"step~{s:>5}" for s in snaps)
+    print(header)
+    print("-" * len(header))
+    for label, getter in metrics:
+        vals = []
+        for s in snaps:
+            v = getter(rows[s])
+            vals.append(f"{v:+.3f}" if isinstance(v, (int, float)) else "-")
+        print(f"{label:<18} " + " ".join(f"{v:>10}" for v in vals))
+    # Compute trends: linear-fit slope per metric (eyeball trend direction)
+    import statistics
+    print()
+    print("=== Linear trend (slope) over the run — units per 100 steps ===")
+    n = len(rows)
+    xs = list(range(n))
+    for label, getter in metrics:
+        ys = [getter(r) for r in rows]
+        ys_clean = [y for y in ys if isinstance(y, (int, float))]
+        xs_clean = [x for x, y in zip(xs, ys) if isinstance(y, (int, float))]
+        if len(ys_clean) < 5:
+            continue
+        # Simple least-squares
+        x_mean = statistics.mean(xs_clean)
+        y_mean = statistics.mean(ys_clean)
+        num = sum((x - x_mean) * (y - y_mean) for x, y in zip(xs_clean, ys_clean))
+        den = sum((x - x_mean) ** 2 for x in xs_clean)
+        slope = (num / den) * 100 if den > 0 else 0
+        # Mean of last 20 vs mean of first 20 (more robust signal)
+        first_mean = statistics.mean(ys_clean[:20]) if len(ys_clean) >= 20 else float("nan")
+        last_mean = statistics.mean(ys_clean[-20:]) if len(ys_clean) >= 20 else float("nan")
+        delta = last_mean - first_mean
+        direction = "UP" if delta > 0.01 else ("DOWN" if delta < -0.01 else "FLAT")
+        print(f"  {label:<18} slope/100steps={slope:+.4f}  first20-mean={first_mean:+.3f}  last20-mean={last_mean:+.3f}  delta={delta:+.3f} [{direction}]")
+    # Look at action-distribution implicit signal: completion_length stability
+    print()
+    print("=== Mode-collapse warning signs ===")
+    last_50 = rows[-50:] if len(rows) >= 50 else rows
+    avg_zero_std = statistics.mean(r.get("frac_reward_zero_std", 0) for r in last_50 if isinstance(r.get("frac_reward_zero_std"), (int, float)))
+    avg_reward_std = statistics.mean(r.get("reward_std", 0) for r in last_50 if isinstance(r.get("reward_std"), (int, float)))
+    print(f"Last-50 mean frac_reward_zero_std: {avg_zero_std:.2f}  (1.0 = full collapse)")
+    print(f"Last-50 mean reward_std:           {avg_reward_std:.3f}  (≥0.3 = healthy variance)")
+    print()
+    print(f"Final step in log: {n} (iter 4 was canceled at ~step 235)")
+    if n >= 1:
+        last = rows[-1]
+        print()
+        print("=== Final-step metrics ===")
+        for k in [
+            "loss", "reward", "reward_std", "frac_reward_zero_std",
+            "rewards/format_valid/mean", "rewards/action_legal/mean",
+            "rewards/env_reward/mean", "rewards/belief_accuracy/mean",
+            "kl", "completion_length", "grad_norm",
+        ]:
+            v = last.get(k)
+            print(f"  {k:<35} {v}")
+if __name__ == "__main__":
+    main()