Spaces:
Sleeping
Post-deadline: full eval results + bigger plots via Git LFS
Browse filesEval retry (HF Job 69edf4ddd70108f37ace0165) completed and uploaded
eval_results_v2.json. Pulling those numbers in:
in-distribution: Heuristic 0.463 -> Distilled 0.574 (+0.111)
OOD generalization: Heuristic 0.455 -> Distilled 0.536 (+0.081)
discrete-3: Heuristic 0.455 -> Distilled 0.507 (+0.052)
belief_MAE in-dist: 0.213 (matches teacher 0.196)
belief_MAE OOD: 0.265
Student wins all 3 conditions; in-dist belief_MAE matches the teacher
within 0.02 (distillation transferred the inference skill cleanly).
Changes:
- README: replace single-condition headline table with full 3-condition
table; add v3 baseline-vs-trained bar chart inline; add reward
components + belief_accuracy plots inline
- docs/results.md: fill in student column for all 3 conditions; clean
interpretation paragraph
- plots/sft_v3_baseline_vs_trained.png: new bar chart from v3 eval data
- plots/grpo_iter2_reward_curve.png, reward_components.png,
belief_accuracy.png: now stored via Git LFS so HF accepts them
- .gitattributes: LFS filters for the 3 larger PNGs
- scripts/plot_v3_results.py: generator for the new bar chart
- .gitattributes +3 -0
- README.md +20 -9
- docs/results.md +32 -31
- plots/grpo_iter2_belief_accuracy.png +3 -0
- plots/grpo_iter2_reward_components.png +3 -0
- plots/grpo_iter2_reward_curve.png +0 -0
- plots/sft_v3_baseline_vs_trained.png +0 -0
- scripts/plot_v3_results.py +67 -0
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
plots/grpo_iter2_reward_curve.png filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
plots/grpo_iter2_reward_components.png filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
plots/grpo_iter2_belief_accuracy.png filter=lfs diff=lfs merge=lfs -text
|
|
@@ -28,16 +28,19 @@ This is **meta-reinforcement learning** for personalization: the agent isn't tra
|
|
| 28 |
|
| 29 |
## Headline result
|
| 30 |
|
| 31 |
-
A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on
|
| 32 |
|
| 33 |
-
|
|
| 34 |
-
|---|---|---|---|
|
| 35 |
-
|
|
| 36 |
-
|
|
| 37 |
-
|
|
| 38 |
-
|
|
|
|
| 39 |
|
| 40 |
-
|
|
|
|
|
|
|
| 41 |
|
| 42 |
## Training evidence
|
| 43 |
|
|
@@ -49,7 +52,15 @@ Plus the gpt-5.4 teacher (the upper-bound reference) hits **0.611 in-dist / 0.62
|
|
| 49 |
|
| 50 |

|
| 51 |
|
| 52 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
## Why a Life Simulator?
|
| 55 |
|
|
|
|
| 28 |
|
| 29 |
## Headline result
|
| 30 |
|
| 31 |
+
A small (Qwen 2.5-3B + 4-bit + LoRA) student, distilled from a gpt-5.4 teacher via Algorithm Distillation, **beats the heuristic baseline on all three eval conditions**:
|
| 32 |
|
| 33 |
+
| Condition | Random | Heuristic | **Distilled Qwen 3B** | Δ vs Heuristic | belief_MAE |
|
| 34 |
+
|---|---|---|---|---|---|
|
| 35 |
+
| **continuous in-distribution** | 0.393 | 0.463 | **0.574** | **+0.111** | 0.213 |
|
| 36 |
+
| **continuous OOD (generalization)** | 0.393 | 0.455 | **0.536** | **+0.081** | 0.265 |
|
| 37 |
+
| discrete-3-profiles (legacy) | 0.426 | 0.455 | **0.507** | +0.052 | 0.415 |
|
| 38 |
+
|
| 39 |
+
The student's **belief_MAE of 0.213 in-distribution matches the gpt-5.4 teacher (0.196)** — the inference skill transferred nearly perfectly via SFT-prime. On OOD profiles the agent never saw, it still beats heuristic by +0.081, proving generalization (not memorization).
|
| 40 |
|
| 41 |
+
For reference, the gpt-5.4 teacher (upper bound) hits 0.611 in-dist / 0.621 OOD on a 150-episode reeval. Full numbers in [docs/results.md](docs/results.md). Eval JSON: [eval_results_v2.json](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3/blob/main/eval_results_v2.json).
|
| 42 |
+
|
| 43 |
+

|
| 44 |
|
| 45 |
## Training evidence
|
| 46 |
|
|
|
|
| 52 |
|
| 53 |

|
| 54 |
|
| 55 |
+
**Reward components** — all 4 reward layers overlaid (format_valid, action_legal, env_reward, belief_accuracy). Lets you read what each layer is contributing as training progresses.
|
| 56 |
+
|
| 57 |
+

|
| 58 |
+
|
| 59 |
+
**Belief-accuracy curve** — the meta-RL signal. Rolling mean of how close the agent's emitted belief vector is to the true profile vector, per step.
|
| 60 |
+
|
| 61 |
+

|
| 62 |
+
|
| 63 |
+
Numbers source: `eval_results_v2.json` in the [trained model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3).
|
| 64 |
|
| 65 |
## Why a Life Simulator?
|
| 66 |
|
|
@@ -59,37 +59,38 @@ Under the v2 grader. Heuristic + random emit no belief and score 0 on
|
|
| 59 |
that component (by design — the meta-RL skill is inference, only agents
|
| 60 |
that try get credit).
|
| 61 |
|
| 62 |
-
###
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
|
| 69 |
-
|---|---|---|---|
|
| 70 |
-
|
|
| 71 |
-
|
|
| 72 |
-
|
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
|
|
|
| 93 |
|
| 94 |
The teacher generalizes — same +0.16 margin in-dist as OOD. Both within
|
| 95 |
~0.01 of each other. The hidden profile space we sample from clearly
|
|
|
|
| 59 |
that component (by design — the meta-RL skill is inference, only agents
|
| 60 |
that try get credit).
|
| 61 |
|
| 62 |
+
### Distilled Qwen 3B student — full eval across all 3 conditions
|
| 63 |
+
|
| 64 |
+
10 episodes per condition for continuous, 5 episodes per discrete profile
|
| 65 |
+
(15 total). Source: `eval_results_v2.json` on the
|
| 66 |
+
[trained-model repo](https://huggingface.co/InosLihka/rhythm-env-meta-trained-sft-v3).
|
| 67 |
+
|
| 68 |
+
| Condition | Random | Heuristic | **Distilled Qwen 3B** | Δ vs Heuristic | belief_MAE |
|
| 69 |
+
|---|---|---|---|---|---|
|
| 70 |
+
| **continuous in-distribution** | 0.393 | 0.463 | **0.574** | **+0.111** | **0.213** |
|
| 71 |
+
| **continuous OOD** | 0.393 | 0.455 | **0.536** | **+0.081** | 0.265 |
|
| 72 |
+
| discrete-3-profiles (legacy) | 0.426 | 0.455 | **0.507** | +0.052 | 0.415 |
|
| 73 |
+
|
| 74 |
+
**Interpretation:**
|
| 75 |
+
- The student wins on **all three** conditions, with the largest margin
|
| 76 |
+
on the meta-RL test condition (continuous in-dist, +0.111).
|
| 77 |
+
- **`belief_MAE` 0.213 in-distribution matches the gpt-5.4 teacher (0.196)**
|
| 78 |
+
to within 0.02 — the inference skill transferred nearly perfectly via
|
| 79 |
+
SFT-prime distillation.
|
| 80 |
+
- OOD margin (+0.081) on profiles the agent never saw demonstrates real
|
| 81 |
+
generalization, not memorization.
|
| 82 |
+
- Discrete-3 belief_MAE (0.415) is weaker because the student was trained
|
| 83 |
+
on continuous profiles only. The action quality still wins (+0.052).
|
| 84 |
+
|
| 85 |
+
### Teacher (gpt-5.4) ceiling — 150-episode reeval
|
| 86 |
+
|
| 87 |
+
| Condition | Random | Heuristic | **gpt-5.4 Teacher** | belief_MAE |
|
| 88 |
+
|---|---|---|---|---|
|
| 89 |
+
| in-distribution (100 eps, seeds 0-99) | 0.402 | 0.449 | **0.611** *(100/100 wins)* | 0.196 |
|
| 90 |
+
| OOD (50 eps, seeds 10000-10049) | 0.397 | 0.454 | **0.621** *(50/50 wins)* | 0.214 |
|
| 91 |
+
|
| 92 |
+
The teacher beats heuristic by ~0.16 across both conditions — confirming
|
| 93 |
+
the v2 grader cleanly distinguishes inference from reflex.
|
| 94 |
|
| 95 |
The teacher generalizes — same +0.16 margin in-dist as OOD. Both within
|
| 96 |
~0.01 of each other. The hidden profile space we sample from clearly
|
|
Git LFS Details
|
|
Git LFS Details
|
|
|
Git LFS Details
|
|
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Generate the headline 'student vs baselines across conditions' bar chart
|
| 2 |
+
from the v3 eval JSON. Output goes to plots/sft_v3_baseline_vs_trained.png.
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import json
|
| 6 |
+
from collections import defaultdict
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
|
| 9 |
+
import matplotlib.pyplot as plt
|
| 10 |
+
import numpy as np
|
| 11 |
+
|
| 12 |
+
EVAL_PATH = Path("outputs/sft-v3/eval_results_v2.json")
|
| 13 |
+
OUT_PATH = Path("plots/sft_v3_baseline_vs_trained.png")
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def main() -> None:
|
| 17 |
+
rows = json.loads(EVAL_PATH.read_text())
|
| 18 |
+
agg: dict = defaultdict(lambda: defaultdict(list))
|
| 19 |
+
for r in rows:
|
| 20 |
+
agg[r["condition"]][r["strategy"]].append(r["final_score"])
|
| 21 |
+
|
| 22 |
+
# Order conditions for display
|
| 23 |
+
cond_order = [
|
| 24 |
+
("continuous-in-distribution", "in-distribution"),
|
| 25 |
+
("continuous-OOD (generalization)", "OOD generalization"),
|
| 26 |
+
("discrete-3-profiles", "discrete-3-profiles"),
|
| 27 |
+
]
|
| 28 |
+
strat_order = ["random", "heuristic", "model"]
|
| 29 |
+
strat_labels = ["Random", "Heuristic", "Distilled Qwen 3B"]
|
| 30 |
+
strat_colors = ["#888888", "#5B8FF9", "#5AD8A6"]
|
| 31 |
+
|
| 32 |
+
means = {strat: [] for strat in strat_order}
|
| 33 |
+
for cond_key, _ in cond_order:
|
| 34 |
+
for strat in strat_order:
|
| 35 |
+
scores = agg[cond_key][strat]
|
| 36 |
+
means[strat].append(sum(scores) / len(scores) if scores else 0.0)
|
| 37 |
+
|
| 38 |
+
fig, ax = plt.subplots(figsize=(8, 5))
|
| 39 |
+
x = np.arange(len(cond_order))
|
| 40 |
+
width = 0.27
|
| 41 |
+
|
| 42 |
+
for i, (strat, label, color) in enumerate(zip(strat_order, strat_labels, strat_colors)):
|
| 43 |
+
offset = (i - 1) * width
|
| 44 |
+
bars = ax.bar(x + offset, means[strat], width, label=label, color=color)
|
| 45 |
+
for bar in bars:
|
| 46 |
+
ax.text(
|
| 47 |
+
bar.get_x() + bar.get_width() / 2,
|
| 48 |
+
bar.get_height() + 0.005,
|
| 49 |
+
f"{bar.get_height():.3f}",
|
| 50 |
+
ha="center", va="bottom", fontsize=9,
|
| 51 |
+
)
|
| 52 |
+
|
| 53 |
+
ax.set_xlabel("Eval condition")
|
| 54 |
+
ax.set_ylabel("Final score (v2 grader, 0–1)")
|
| 55 |
+
ax.set_title("RhythmEnv: Distilled Qwen 3B beats heuristic on all 3 conditions")
|
| 56 |
+
ax.set_xticks(x)
|
| 57 |
+
ax.set_xticklabels([label for _, label in cond_order])
|
| 58 |
+
ax.set_ylim(0, max(max(v) for v in means.values()) * 1.15)
|
| 59 |
+
ax.grid(axis="y", alpha=0.3)
|
| 60 |
+
ax.legend(loc="upper right", framealpha=0.95)
|
| 61 |
+
plt.tight_layout()
|
| 62 |
+
plt.savefig(OUT_PATH, dpi=120, bbox_inches="tight")
|
| 63 |
+
print(f"Saved {OUT_PATH} ({OUT_PATH.stat().st_size // 1024} KB)")
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
if __name__ == "__main__":
|
| 67 |
+
main()
|