Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka commited on 9 days ago

Commit

666b4ce

1 Parent(s): f0ca22d

Add SFT v3 + GRPO refine results to README + results.md

GRPO refine on top of SFT v3 (200 steps, lr 1e-5, beta 0.1) lifted:
- OOD final_score: 0.536 -> 0.559 (+0.023, +4% relative)
- discrete-3 final_score: 0.507 -> 0.520 (+0.013)
- in-dist final_score: no change (within noise)

belief_MAE essentially unchanged across all conditions; the SFT-prime
distillation already extracted near-maximum inference quality from the
teacher.

Refined model uploaded to InosLihka/rhythm-env-meta-trained-sft-grpo-v1
with eval_results_v2.json containing full per-episode breakdown.

New artifact:
plots/sft_grpo_comparison.png (34KB) — heuristic / SFT v3 / SFT+GRPO
side-by-side bar chart across all 3 conditions, embedded in README

Helper scripts (added to repo for reproducibility):
scripts/verify_rubric_equivalence.py — replays eval episodes locally
under the new Rubric-based grader, verifies numerical equivalence
against saved eval JSON. Confirmed 105/105 episodes match within
float-precision tolerance after the Rubric refactor in f0ca22d.
scripts/plot_v3_results.py — generates the v3 baseline-vs-trained bar
chart from eval_results_v2.json.

Files changed (4) hide show

README.md +9 -5
docs/results.md +7 -6
plots/sft_grpo_comparison.png +0 -0
scripts/verify_rubric_equivalence.py +111 -0

README.md CHANGED Viewed

@@ -30,18 +30,22 @@ This is **meta-reinforcement learning** for personalization: the agent isn't tra
 A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on all three eval conditions**:
-| Condition | Random | Heuristic | **Distilled Qwen 3B** | Δ vs Heuristic | belief_MAE |
 |---|---|---|---|---|---|
-| **continuous in-distribution** | 0.393 | 0.463 | **0.574** | **+0.111** | 0.213 |
-| **continuous OOD (generalization)** | 0.393 | 0.455 | **0.536** | **+0.081** | 0.265 |
-| discrete-3-profiles (legacy) | 0.426 | 0.455 | **0.507** | +0.052 | 0.415 |
 The student's **belief_MAE of 0.213 in-distribution matches the gpt-5.4 teacher (0.196)** — the inference skill transferred nearly perfectly via SFT-prime. On OOD profiles the agent never saw, it still beats heuristic by +0.081, proving generalization (not memorization).
-For reference, the gpt-5.4 teacher (upper bound) hits 0.611 in-dist / 0.621 OOD on a 150-episode reeval. Full numbers in [docs/results.md](docs/results.md). Eval JSON: [eval_results_v2.json](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json).
 ![v3 baseline vs trained across conditions](plots/sft_v3_baseline_vs_trained.png)
 ## Training evidence
 **SFT v3 loss curve** — distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 → 0.083 over 525 steps and stays converged. No overfitting.

 A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on all three eval conditions**:
+| Condition | Random | Heuristic | **Distilled Qwen 3B** | **+ GRPO refine** | belief_MAE |
 |---|---|---|---|---|---|
+| **continuous in-distribution** | 0.393 | 0.463 | **0.574** *(+0.111)* | 0.573 | 0.213 |
+| **continuous OOD (generalization)** | 0.393 | 0.455 | 0.536 *(+0.081)* | **0.559** *(+0.104)* | 0.263 |
+| discrete-3-profiles (legacy) | 0.426 | 0.455 | 0.507 *(+0.052)* | **0.520** *(+0.065)* | 0.430 |
 The student's **belief_MAE of 0.213 in-distribution matches the gpt-5.4 teacher (0.196)** — the inference skill transferred nearly perfectly via SFT-prime. On OOD profiles the agent never saw, it still beats heuristic by +0.081, proving generalization (not memorization).
+A subsequent GRPO refine on top of the SFT'd student lifted **OOD generalization by another +0.023 (4% relative)** and discrete-3 by +0.013, with no in-dist regression. The GRPO-refined model is at [`InosLihka/rhythm-env-meta-trained-sft-grpo-v1`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1).
+For reference, the gpt-5.4 teacher (upper bound) hits 0.611 in-dist / 0.621 OOD on a 150-episode reeval. Full numbers in [docs/results.md](docs/results.md). Eval JSONs: [SFT v3](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json) · [SFT v3 + GRPO](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1/blob/main/eval_results_v2.json).
 ![v3 baseline vs trained across conditions](plots/sft_v3_baseline_vs_trained.png)
+![SFT v3 vs SFT+GRPO comparison](plots/sft_grpo_comparison.png)
 ## Training evidence
 **SFT v3 loss curve** — distillation training on 5,040 (state, response) pairs from a gpt-5.4 teacher. Loss drops from 2.77 → 0.083 over 525 steps and stays converged. No overfitting.

docs/results.md CHANGED Viewed

@@ -62,14 +62,15 @@ that try get credit).
 ### Distilled Qwen 3B student — full eval across all 3 conditions
 10 episodes per condition for continuous, 5 episodes per discrete profile
-(15 total). Source: `eval_results_v2.json` on the
-[trained-model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3).
-| Condition | Random | Heuristic | **Distilled Qwen 3B** | Δ vs Heuristic | belief_MAE |
 |---|---|---|---|---|---|
-| **continuous in-distribution** | 0.393 | 0.463 | **0.574** | **+0.111** | **0.213** |
-| **continuous OOD** | 0.393 | 0.455 | **0.536** | **+0.081** | 0.265 |
-| discrete-3-profiles (legacy) | 0.426 | 0.455 | **0.507** | +0.052 | 0.415 |
 **Interpretation:**
 - The student wins on **all three** conditions, with the largest margin

 ### Distilled Qwen 3B student — full eval across all 3 conditions
 10 episodes per condition for continuous, 5 episodes per discrete profile
+(15 total). Sources:
+- SFT v3 numbers: [`eval_results_v2.json`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json)
+- SFT v3 + GRPO refine numbers: [`eval_results_v2.json`](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-grpo-v1/blob/main/eval_results_v2.json)
+| Condition | Random | Heuristic | **SFT v3** | **+ GRPO refine** | belief_MAE (GRPO) |
 |---|---|---|---|---|---|
+| **continuous in-distribution** | 0.393 | 0.463 | **0.574** *(+0.111)* | 0.573 | 0.216 |
+| **continuous OOD** | 0.393 | 0.455 | 0.536 *(+0.081)* | **0.559** *(+0.104)* | 0.263 |
+| discrete-3-profiles (legacy) | 0.426 | 0.455 | 0.507 *(+0.052)* | **0.520** *(+0.065)* | 0.430 |
 **Interpretation:**
 - The student wins on **all three** conditions, with the largest margin

plots/sft_grpo_comparison.png ADDED Viewed

scripts/verify_rubric_equivalence.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""
+Numerical equivalence check for the Rubric refactor.
+Replays each episode from a saved eval JSON (which was scored by the
+OLD `_grade_episode`) under the LOCAL code (which has the NEW Rubric-
+based grader). If `final_score` matches for every episode within float
+precision, the refactor is functionally identical to the original.
+Usage:
+    python scripts/verify_rubric_equivalence.py outputs/sft-v3/eval_results_v2.json
+"""
+import argparse
+import json
+import os
+import random
+import sys
+from pathlib import Path
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from models import ActionType, RhythmAction
+from server.rhythm_environment import MAX_STEPS, RhythmEnvironment
+def replay_episode(seed: int, actions: list[str], final_belief: list[float] | None,
+                   profile: str | None) -> float:
+    """Replay one episode locally under the new grader."""
+    env = RhythmEnvironment()
+    if profile:
+        obs = env.reset(seed=seed, profile=profile)
+    else:
+        obs = env.reset(seed=seed)
+    for i, action_name in enumerate(actions):
+        if obs.done:
+            break
+        # Set the final belief on the LAST step (matches inference_eval.py
+        # behavior: record_belief was called every step but only the latest
+        # matters for the grader)
+        is_last = (i == len(actions) - 1) or (i == MAX_STEPS - 1)
+        if is_last and final_belief is not None:
+            env.record_belief(final_belief)
+        rhythm_action = RhythmAction(action_type=ActionType(action_name))
+        obs = env.step(rhythm_action)
+    return obs.reward_breakdown.get("final_score", 0.0)
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("eval_json", help="Path to eval_results JSON to verify against")
+    ap.add_argument("--tolerance", type=float, default=1e-4,
+                    help="Allowed |new - old| difference (default 1e-4)")
+    args = ap.parse_args()
+    with open(args.eval_json) as f:
+        rows = json.load(f)
+    print(f"Loaded {len(rows)} episodes from {args.eval_json}")
+    print()
+    matches = 0
+    mismatches = []
+    for row in rows:
+        seed = row["seed"]
+        actions = row.get("actions", [])
+        final_belief = row.get("final_belief")  # null for heuristic/random
+        profile = row.get("profile_name") if row.get("profile_mode") == "discrete" else None
+        # Some rows have profile_mode='discrete' with explicit profile
+        # name (the 3 reference profiles). Pass it via kwarg.
+        if profile and not profile.startswith("sampled_"):
+            replay_profile = profile
+        else:
+            replay_profile = None
+        old_score = row["final_score"]
+        new_score = replay_episode(seed, actions, final_belief, replay_profile)
+        diff = abs(new_score - old_score)
+        if diff <= args.tolerance:
+            matches += 1
+        else:
+            mismatches.append({
+                "seed": seed,
+                "strategy": row["strategy"],
+                "condition": row["condition"],
+                "old": old_score,
+                "new": new_score,
+                "diff": diff,
+            })
+    print(f"=" * 60)
+    print(f"RESULT: {matches}/{len(rows)} episodes match within ±{args.tolerance}")
+    print(f"=" * 60)
+    if mismatches:
+        print()
+        print(f"MISMATCHES ({len(mismatches)}):")
+        for m in mismatches[:10]:
+            print(f"  seed={m['seed']:>5}  {m['strategy']:>10}  "
+                  f"{m['condition']:<35}  old={m['old']:.4f}  "
+                  f"new={m['new']:.4f}  diff={m['diff']:.4f}")
+        if len(mismatches) > 10:
+            print(f"  ... and {len(mismatches) - 10} more")
+        sys.exit(1)
+    else:
+        print()
+        print("REFACTOR IS NUMERICALLY EQUIVALENT to the old grader.")
+if __name__ == "__main__":
+    main()