Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka commited on 11 days ago

Commit

d64efa6

1 Parent(s): 4dd50e0

Post-deadline: full eval results + bigger plots via Git LFS

Eval retry (HF Job 69edf4ddd70108f37ace0165) completed and uploaded
eval_results_v2.json. Pulling those numbers in:

in-distribution: Heuristic 0.463 -> Distilled 0.574 (+0.111)
OOD generalization: Heuristic 0.455 -> Distilled 0.536 (+0.081)
discrete-3: Heuristic 0.455 -> Distilled 0.507 (+0.052)
belief_MAE in-dist: 0.213 (matches teacher 0.196)
belief_MAE OOD: 0.265

Student wins all 3 conditions; in-dist belief_MAE matches the teacher
within 0.02 (distillation transferred the inference skill cleanly).

Changes:
- README: replace single-condition headline table with full 3-condition
table; add v3 baseline-vs-trained bar chart inline; add reward
components + belief_accuracy plots inline
- docs/results.md: fill in student column for all 3 conditions; clean
interpretation paragraph
- plots/sft_v3_baseline_vs_trained.png: new bar chart from v3 eval data
- plots/grpo_iter2_reward_curve.png, reward_components.png,
belief_accuracy.png: now stored via Git LFS so HF accepts them
- .gitattributes: LFS filters for the 3 larger PNGs
- scripts/plot_v3_results.py: generator for the new bar chart

Files changed (8) hide show

.gitattributes +3 -0
README.md +20 -9
docs/results.md +32 -31
plots/grpo_iter2_belief_accuracy.png +3 -0
plots/grpo_iter2_reward_components.png +3 -0
plots/grpo_iter2_reward_curve.png +0 -0
plots/sft_v3_baseline_vs_trained.png +0 -0
scripts/plot_v3_results.py +67 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,3 @@

+plots/grpo_iter2_reward_curve.png filter=lfs diff=lfs merge=lfs -text
+plots/grpo_iter2_reward_components.png filter=lfs diff=lfs merge=lfs -text
+plots/grpo_iter2_belief_accuracy.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -28,16 +28,19 @@ This is **meta-reinforcement learning** for personalization: the agent isn't tra
 ## Headline result
-A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on a real held-out eval condition**:
-| Strategy | avg_score | adaptation | belief_MAE |
-|---|---|---|---|
-| Random | 0.426 | -0.174 | n/a |
-| Heuristic | 0.455 | -0.192 | n/a |
-| **Distilled Qwen 3B (ours)** | **0.527** | -0.267 | 0.379 |
-| | **+0.072 vs heuristic** | | |
-Plus the gpt-5.4 teacher (the upper-bound reference) hits **0.611 in-dist / 0.621 OOD** with belief_MAE 0.196 on continuous profiles — confirming the env distinguishes inference quality cleanly. Full numbers in [docs/results.md](docs/results.md).
 ## Training evidence
@@ -49,7 +52,15 @@ Plus the gpt-5.4 teacher (the upper-bound reference) hits **0.611 in-dist / 0.62
 ![Reward curve](plots/grpo_iter2_reward_curve.png)
-**Baseline vs trained** comparison is in the Headline result table above. Numbers source: `eval_results.json` in the [trained model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3).
 ## Why a Life Simulator?

 ## Headline result
+A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on all three eval conditions**:
+| Condition | Random | Heuristic | **Distilled Qwen 3B** | Δ vs Heuristic | belief_MAE |
+|---|---|---|---|---|---|
+| **continuous in-distribution** | 0.393 | 0.463 | **0.574** | **+0.111** | 0.213 |
+| **continuous OOD (generalization)** | 0.393 | 0.455 | **0.536** | **+0.081** | 0.265 |
+| discrete-3-profiles (legacy) | 0.426 | 0.455 | **0.507** | +0.052 | 0.415 |
+The student's **belief_MAE of 0.213 in-distribution matches the gpt-5.4 teacher (0.196)** — the inference skill transferred nearly perfectly via SFT-prime. On OOD profiles the agent never saw, it still beats heuristic by +0.081, proving generalization (not memorization).
+For reference, the gpt-5.4 teacher (upper bound) hits 0.611 in-dist / 0.621 OOD on a 150-episode reeval. Full numbers in [docs/results.md](docs/results.md). Eval JSON: [eval_results_v2.json](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json).
+![v3 baseline vs trained across conditions](plots/sft_v3_baseline_vs_trained.png)
 ## Training evidence
 ![Reward curve](plots/grpo_iter2_reward_curve.png)
+**Reward components** — all 4 reward layers overlaid (format_valid, action_legal, env_reward, belief_accuracy). Lets you read what each layer is contributing as training progresses.
+![Reward components](plots/grpo_iter2_reward_components.png)
+**Belief-accuracy curve** — the meta-RL signal. Rolling mean of how close the agent's emitted belief vector is to the true profile vector, per step.
+![Belief accuracy](plots/grpo_iter2_belief_accuracy.png)
+Numbers source: `eval_results_v2.json` in the [trained model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3).
 ## Why a Life Simulator?

docs/results.md CHANGED Viewed

@@ -59,37 +59,38 @@ Under the v2 grader. Heuristic + random emit no belief and score 0 on
 that component (by design — the meta-RL skill is inference, only agents
 that try get credit).
-### Discrete-3-profiles eval (5 episodes per profile, 15 total)
-The distilled Qwen 3B student **beats heuristic on the legacy 3-profile
-condition** — direct evidence the SFT pipeline transferred a real
-inference + action skill, not memorization.
-| Strategy | avg_score | avg_adaptation | avg_belief_mae |
-|---|---|---|---|
-| Random | 0.426 | -0.174 | n/a |
-| Heuristic | 0.455 | -0.192 | n/a |
-| **Distilled Qwen 3B (SFT v3)** | **0.527** | -0.267 | 0.379 |
-|  | **+0.072 vs heuristic** | | |
-### Continuous conditions — teacher numbers, student re-eval in progress
-Teacher numbers are from a 150-episode evaluation under the v2 grader.
-The full continuous-condition eval for the distilled student is being
-re-run on a longer-budget HF Job; final numbers will be appended to
-`eval_results_v2.json` in the trained-model repo when the run completes.
-| Condition | Random | Heuristic | **gpt-5.4 Teacher** |
-|---|---|---|---|
-| in-distribution (100 eps, seeds 0-99) | 0.402 | 0.449 | **0.611** *(100/100 wins)* |
-| OOD (50 eps, seeds 10000-10049) | 0.397 | 0.454 | **0.621** *(50/50 wins)* |
-The teacher's belief_MAE is **0.196 in-dist, 0.214 OOD** — meaningfully
-better than the constant `[0.5, 0.5, 0.5]` baseline (~0.20). The student
-inherits this skill via SFT-prime distillation; preliminary indication
-from the discrete-3 condition above (student belief_MAE 0.379, weaker than
-teacher but still informative) suggests partial transfer with room for
-additional GRPO refinement.
 The teacher generalizes — same +0.16 margin in-dist as OOD. Both within
 ~0.01 of each other. The hidden profile space we sample from clearly

 that component (by design — the meta-RL skill is inference, only agents
 that try get credit).
+### Distilled Qwen 3B student — full eval across all 3 conditions
+10 episodes per condition for continuous, 5 episodes per discrete profile
+(15 total). Source: `eval_results_v2.json` on the
+[trained-model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3).
+| Condition | Random | Heuristic | **Distilled Qwen 3B** | Δ vs Heuristic | belief_MAE |
+|---|---|---|---|---|---|
+| **continuous in-distribution** | 0.393 | 0.463 | **0.574** | **+0.111** | **0.213** |
+| **continuous OOD** | 0.393 | 0.455 | **0.536** | **+0.081** | 0.265 |
+| discrete-3-profiles (legacy) | 0.426 | 0.455 | **0.507** | +0.052 | 0.415 |
+**Interpretation:**
+- The student wins on **all three** conditions, with the largest margin
+  on the meta-RL test condition (continuous in-dist, +0.111).
+- **`belief_MAE` 0.213 in-distribution matches the gpt-5.4 teacher (0.196)**
+  to within 0.02 — the inference skill transferred nearly perfectly via
+  SFT-prime distillation.
+- OOD margin (+0.081) on profiles the agent never saw demonstrates real
+  generalization, not memorization.
+- Discrete-3 belief_MAE (0.415) is weaker because the student was trained
+  on continuous profiles only. The action quality still wins (+0.052).
+### Teacher (gpt-5.4) ceiling — 150-episode reeval
+| Condition | Random | Heuristic | **gpt-5.4 Teacher** | belief_MAE |
+|---|---|---|---|---|
+| in-distribution (100 eps, seeds 0-99) | 0.402 | 0.449 | **0.611** *(100/100 wins)* | 0.196 |
+| OOD (50 eps, seeds 10000-10049) | 0.397 | 0.454 | **0.621** *(50/50 wins)* | 0.214 |
+The teacher beats heuristic by ~0.16 across both conditions — confirming
+the v2 grader cleanly distinguishes inference from reflex.
 The teacher generalizes — same +0.16 margin in-dist as OOD. Both within
 ~0.01 of each other. The hidden profile space we sample from clearly

plots/grpo_iter2_belief_accuracy.png ADDED Viewed

Git LFS Details

SHA256: 898fd97779fb2b6a27b49335b381a1c21b3aaf9c6f7bdd9cadcd1792dfc8ec54
Pointer size: 131 Bytes
Size of remote file: 189 kB

plots/grpo_iter2_reward_components.png ADDED Viewed

Git LFS Details

SHA256: a6cb8c28c4f56b8e74bf835dd244801891b58570a6c6f3ee1c7e2c032eb02f04
Pointer size: 131 Bytes
Size of remote file: 263 kB

plots/grpo_iter2_reward_curve.png CHANGED Viewed

Git LFS Details

SHA256: 0287869c876864215b324baa46d0e4b35c7d1ae3611163d93c3d885b264fbafd
Pointer size: 131 Bytes
Size of remote file: 179 kB

plots/sft_v3_baseline_vs_trained.png ADDED Viewed

scripts/plot_v3_results.py ADDED Viewed

	@@ -0,0 +1,67 @@

+"""Generate the headline 'student vs baselines across conditions' bar chart
+from the v3 eval JSON. Output goes to plots/sft_v3_baseline_vs_trained.png.
+"""
+import json
+from collections import defaultdict
+from pathlib import Path
+import matplotlib.pyplot as plt
+import numpy as np
+EVAL_PATH = Path("outputs/sft-v3/eval_results_v2.json")
+OUT_PATH = Path("plots/sft_v3_baseline_vs_trained.png")
+def main() -> None:
+    rows = json.loads(EVAL_PATH.read_text())
+    agg: dict = defaultdict(lambda: defaultdict(list))
+    for r in rows:
+        agg[r["condition"]][r["strategy"]].append(r["final_score"])
+    # Order conditions for display
+    cond_order = [
+        ("continuous-in-distribution", "in-distribution"),
+        ("continuous-OOD (generalization)", "OOD generalization"),
+        ("discrete-3-profiles", "discrete-3-profiles"),
+    ]
+    strat_order = ["random", "heuristic", "model"]
+    strat_labels = ["Random", "Heuristic", "Distilled Qwen 3B"]
+    strat_colors = ["#888888", "#5B8FF9", "#5AD8A6"]
+    means = {strat: [] for strat in strat_order}
+    for cond_key, _ in cond_order:
+        for strat in strat_order:
+            scores = agg[cond_key][strat]
+            means[strat].append(sum(scores) / len(scores) if scores else 0.0)
+    fig, ax = plt.subplots(figsize=(8, 5))
+    x = np.arange(len(cond_order))
+    width = 0.27
+    for i, (strat, label, color) in enumerate(zip(strat_order, strat_labels, strat_colors)):
+        offset = (i - 1) * width
+        bars = ax.bar(x + offset, means[strat], width, label=label, color=color)
+        for bar in bars:
+            ax.text(
+                bar.get_x() + bar.get_width() / 2,
+                bar.get_height() + 0.005,
+                f"{bar.get_height():.3f}",
+                ha="center", va="bottom", fontsize=9,
+            )
+    ax.set_xlabel("Eval condition")
+    ax.set_ylabel("Final score (v2 grader, 0–1)")
+    ax.set_title("RhythmEnv: Distilled Qwen 3B beats heuristic on all 3 conditions")
+    ax.set_xticks(x)
+    ax.set_xticklabels([label for _, label in cond_order])
+    ax.set_ylim(0, max(max(v) for v in means.values()) * 1.15)
+    ax.grid(axis="y", alpha=0.3)
+    ax.legend(loc="upper right", framealpha=0.95)
+    plt.tight_layout()
+    plt.savefig(OUT_PATH, dpi=120, bbox_inches="tight")
+    print(f"Saved {OUT_PATH} ({OUT_PATH.stat().st_size // 1024} KB)")
+if __name__ == "__main__":
+    main()