InosLihka commited on
Commit
dc5658d
Β·
1 Parent(s): efe2271

Acknowledge OpenEnv Rubric system conformance gap

Browse files

We use a custom multi-component grader in _grade_episode that's
functionally equivalent to openenv.core.rubrics.WeightedSum (composable
weighted scoring, same independent components, same explicit weights)
but doesn't use the literal Rubric class. The Rubric API is per-step
(action, observation), while our grader reads aggregated episode-end
state β€” a clean refactor would use TrajectoryRubric.

Decision: don't refactor in the last 30 min before deadline. Risk of
breaking the live HF Space (which IS the submission) outweighs the
literal-conformance benefit. Documented as a v2 cleanup task in the
code docstring + iterations.md.

Files changed (2) hide show
  1. docs/iterations.md +14 -0
  2. server/rhythm_environment.py +13 -0
docs/iterations.md CHANGED
@@ -407,3 +407,17 @@ Five rounds of GRPO patches couldn't beat heuristic because the grader
407
  didn't measure inference. Reading the model's reasoning surfaced the
408
  mismatch. Fixing the grader and switching to Algorithm Distillation got
409
  us a real result. The journey is the writeup.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
407
  didn't measure inference. Reading the model's reasoning surfaced the
408
  mismatch. Fixing the grader and switching to Algorithm Distillation got
409
  us a real result. The journey is the writeup.
410
+
411
+ ## Acknowledged gap: OpenEnv Rubric system
412
+
413
+ We don't literally use `openenv.core.rubrics.Rubric` / `WeightedSum`. Our
414
+ `_grade_episode` in `server/rhythm_environment.py` is functionally
415
+ equivalent (composable weighted multi-component scorer) but it reads
416
+ episode-end aggregated state (`_step_rewards`, `_crash_count`,
417
+ `_final_belief`) while the Rubric API expects per-(action, observation)
418
+ inputs. A clean refactor would use `TrajectoryRubric` for cumulative
419
+ components and per-step `Rubric` for crash_free / belief_accuracy.
420
+
421
+ Why not refactored: prioritized debugging mode collapse β†’ bug fixes β†’
422
+ distillation pivot β†’ eval bugs over the cosmetic conformance work.
423
+ Honest about it; v2 cleanup task.
server/rhythm_environment.py CHANGED
@@ -753,6 +753,19 @@ class RhythmEnvironment(Environment):
753
  0.10 β€” efficiency: bounded normalized average reward
754
  0.20 β€” belief_accuracy: how close last-emitted belief is to true profile
755
 
 
 
 
 
 
 
 
 
 
 
 
 
 
756
  belief_accuracy is the explicit meta-RL inference signal: an agent
757
  that doesn't emit a belief scores 0 here, and an agent that emits
758
  a belief close to the hidden profile vector scores up to 1. Without
 
753
  0.10 β€” efficiency: bounded normalized average reward
754
  0.20 β€” belief_accuracy: how close last-emitted belief is to true profile
755
 
756
+ DESIGN NOTE β€” Acknowledged conformance gap with OpenEnv:
757
+ This grader is functionally equivalent to a `WeightedSum` Rubric
758
+ (from `openenv.core.rubrics`) over 6 child Rubrics β€” same
759
+ composability, same independent components, same explicit weights.
760
+ We did not refactor to use the Rubric class literal because the
761
+ grader reads aggregated episode-end state (per-step rewards buffer,
762
+ crash_count, terminal belief) while OpenEnv's `Rubric.forward`
763
+ expects per-(action, observation) inputs. A clean refactor would
764
+ use `TrajectoryRubric` for the cumulative components and the
765
+ per-step `Rubric` for crash_free / belief_accuracy. Tracked as
766
+ a v2 cleanup task; not blocking on the meta-RL skill we're
767
+ evaluating.
768
+
769
  belief_accuracy is the explicit meta-RL inference signal: an agent
770
  that doesn't emit a belief scores 0 here, and an agent that emits
771
  a belief close to the hidden profile vector scores up to 1. Without