Spaces:
Sleeping
Sleeping
Acknowledge OpenEnv Rubric system conformance gap
Browse filesWe use a custom multi-component grader in _grade_episode that's
functionally equivalent to openenv.core.rubrics.WeightedSum (composable
weighted scoring, same independent components, same explicit weights)
but doesn't use the literal Rubric class. The Rubric API is per-step
(action, observation), while our grader reads aggregated episode-end
state β a clean refactor would use TrajectoryRubric.
Decision: don't refactor in the last 30 min before deadline. Risk of
breaking the live HF Space (which IS the submission) outweighs the
literal-conformance benefit. Documented as a v2 cleanup task in the
code docstring + iterations.md.
- docs/iterations.md +14 -0
- server/rhythm_environment.py +13 -0
docs/iterations.md
CHANGED
|
@@ -407,3 +407,17 @@ Five rounds of GRPO patches couldn't beat heuristic because the grader
|
|
| 407 |
didn't measure inference. Reading the model's reasoning surfaced the
|
| 408 |
mismatch. Fixing the grader and switching to Algorithm Distillation got
|
| 409 |
us a real result. The journey is the writeup.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 407 |
didn't measure inference. Reading the model's reasoning surfaced the
|
| 408 |
mismatch. Fixing the grader and switching to Algorithm Distillation got
|
| 409 |
us a real result. The journey is the writeup.
|
| 410 |
+
|
| 411 |
+
## Acknowledged gap: OpenEnv Rubric system
|
| 412 |
+
|
| 413 |
+
We don't literally use `openenv.core.rubrics.Rubric` / `WeightedSum`. Our
|
| 414 |
+
`_grade_episode` in `server/rhythm_environment.py` is functionally
|
| 415 |
+
equivalent (composable weighted multi-component scorer) but it reads
|
| 416 |
+
episode-end aggregated state (`_step_rewards`, `_crash_count`,
|
| 417 |
+
`_final_belief`) while the Rubric API expects per-(action, observation)
|
| 418 |
+
inputs. A clean refactor would use `TrajectoryRubric` for cumulative
|
| 419 |
+
components and per-step `Rubric` for crash_free / belief_accuracy.
|
| 420 |
+
|
| 421 |
+
Why not refactored: prioritized debugging mode collapse β bug fixes β
|
| 422 |
+
distillation pivot β eval bugs over the cosmetic conformance work.
|
| 423 |
+
Honest about it; v2 cleanup task.
|
server/rhythm_environment.py
CHANGED
|
@@ -753,6 +753,19 @@ class RhythmEnvironment(Environment):
|
|
| 753 |
0.10 β efficiency: bounded normalized average reward
|
| 754 |
0.20 β belief_accuracy: how close last-emitted belief is to true profile
|
| 755 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 756 |
belief_accuracy is the explicit meta-RL inference signal: an agent
|
| 757 |
that doesn't emit a belief scores 0 here, and an agent that emits
|
| 758 |
a belief close to the hidden profile vector scores up to 1. Without
|
|
|
|
| 753 |
0.10 β efficiency: bounded normalized average reward
|
| 754 |
0.20 β belief_accuracy: how close last-emitted belief is to true profile
|
| 755 |
|
| 756 |
+
DESIGN NOTE β Acknowledged conformance gap with OpenEnv:
|
| 757 |
+
This grader is functionally equivalent to a `WeightedSum` Rubric
|
| 758 |
+
(from `openenv.core.rubrics`) over 6 child Rubrics β same
|
| 759 |
+
composability, same independent components, same explicit weights.
|
| 760 |
+
We did not refactor to use the Rubric class literal because the
|
| 761 |
+
grader reads aggregated episode-end state (per-step rewards buffer,
|
| 762 |
+
crash_count, terminal belief) while OpenEnv's `Rubric.forward`
|
| 763 |
+
expects per-(action, observation) inputs. A clean refactor would
|
| 764 |
+
use `TrajectoryRubric` for the cumulative components and the
|
| 765 |
+
per-step `Rubric` for crash_free / belief_accuracy. Tracked as
|
| 766 |
+
a v2 cleanup task; not blocking on the meta-RL skill we're
|
| 767 |
+
evaluating.
|
| 768 |
+
|
| 769 |
belief_accuracy is the explicit meta-RL inference signal: an agent
|
| 770 |
that doesn't emit a belief scores 0 here, and an agent that emits
|
| 771 |
a belief close to the hidden profile vector scores up to 1. Without
|