Spaces:

InosLihka
/

rhythm_env

Sleeping

InosLihka commited on 12 days ago

Commit

dc5658d

1 Parent(s): efe2271

Acknowledge OpenEnv Rubric system conformance gap

We use a custom multi-component grader in _grade_episode that's
functionally equivalent to openenv.core.rubrics.WeightedSum (composable
weighted scoring, same independent components, same explicit weights)
but doesn't use the literal Rubric class. The Rubric API is per-step
(action, observation), while our grader reads aggregated episode-end
state — a clean refactor would use TrajectoryRubric.

Decision: don't refactor in the last 30 min before deadline. Risk of
breaking the live HF Space (which IS the submission) outweighs the
literal-conformance benefit. Documented as a v2 cleanup task in the
code docstring + iterations.md.

Files changed (2) hide show

docs/iterations.md +14 -0
server/rhythm_environment.py +13 -0

docs/iterations.md CHANGED Viewed

@@ -407,3 +407,17 @@ Five rounds of GRPO patches couldn't beat heuristic because the grader
 didn't measure inference. Reading the model's reasoning surfaced the
 mismatch. Fixing the grader and switching to Algorithm Distillation got
 us a real result. The journey is the writeup.

 didn't measure inference. Reading the model's reasoning surfaced the
 mismatch. Fixing the grader and switching to Algorithm Distillation got
 us a real result. The journey is the writeup.
+## Acknowledged gap: OpenEnv Rubric system
+We don't literally use `openenv.core.rubrics.Rubric` / `WeightedSum`. Our
+`_grade_episode` in `server/rhythm_environment.py` is functionally
+equivalent (composable weighted multi-component scorer) but it reads
+episode-end aggregated state (`_step_rewards`, `_crash_count`,
+`_final_belief`) while the Rubric API expects per-(action, observation)
+inputs. A clean refactor would use `TrajectoryRubric` for cumulative
+components and per-step `Rubric` for crash_free / belief_accuracy.
+Why not refactored: prioritized debugging mode collapse → bug fixes →
+distillation pivot → eval bugs over the cosmetic conformance work.
+Honest about it; v2 cleanup task.

server/rhythm_environment.py CHANGED Viewed

@@ -753,6 +753,19 @@ class RhythmEnvironment(Environment):
           0.10 — efficiency: bounded normalized average reward
           0.20 — belief_accuracy: how close last-emitted belief is to true profile
         belief_accuracy is the explicit meta-RL inference signal: an agent
         that doesn't emit a belief scores 0 here, and an agent that emits
         a belief close to the hidden profile vector scores up to 1. Without

           0.10 — efficiency: bounded normalized average reward
           0.20 — belief_accuracy: how close last-emitted belief is to true profile
+        DESIGN NOTE — Acknowledged conformance gap with OpenEnv:
+        This grader is functionally equivalent to a `WeightedSum` Rubric
+        (from `openenv.core.rubrics`) over 6 child Rubrics — same
+        composability, same independent components, same explicit weights.
+        We did not refactor to use the Rubric class literal because the
+        grader reads aggregated episode-end state (per-step rewards buffer,
+        crash_count, terminal belief) while OpenEnv's `Rubric.forward`
+        expects per-(action, observation) inputs. A clean refactor would
+        use `TrajectoryRubric` for the cumulative components and the
+        per-step `Rubric` for crash_free / belief_accuracy. Tracked as
+        a v2 cleanup task; not blocking on the meta-RL skill we're
+        evaluating.
         belief_accuracy is the explicit meta-RL inference signal: an agent
         that doesn't emit a belief scores 0 here, and an agent that emits
         a belief close to the hidden profile vector scores up to 1. Without