Spaces:

samrat-rm
/

WhyDidItFail

Sleeping

samrat-rm commited on 13 days ago

Commit

89b370c

1 Parent(s): f58e721

fix: implementing strict prompt conditions for scores/reward to be in 0.0–1.0 range

Files changed (1) hide show

openenv.yaml CHANGED Viewed

@@ -37,6 +37,7 @@ tasks:
         - 0.30: Wrong label but description matches a related concept
         - 0.15: Wrong failure mode, no diagnosis submitted, or empty response
         - If in doubt, return 0.15. Only return one of the four values listed above.
   - id: task_medium
     difficulty: medium
@@ -69,6 +70,7 @@ tasks:
         - 0.30: Wrong label but description matches a related concept
         - 0.15: Wrong failure mode, no diagnosis submitted, or empty response
         - If in doubt, return 0.15. Only return one of the four values listed above.
   - id: task_hard
     difficulty: hard
@@ -103,4 +105,5 @@ tasks:
         - 0.50: Correct failure mode but fix is vague, wrong, or missing
         - 0.20: Wrong failure mode but fix is incidentally relevant
         - 0.15: Wrong failure mode, no useful fix, no diagnosis submitted, or empty response
-        - If in doubt, return 0.15. Only return one of the five values listed above.

         - 0.30: Wrong label but description matches a related concept
         - 0.15: Wrong failure mode, no diagnosis submitted, or empty response
         - If in doubt, return 0.15. Only return one of the four values listed above.
+        - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.
   - id: task_medium
     difficulty: medium
         - 0.30: Wrong label but description matches a related concept
         - 0.15: Wrong failure mode, no diagnosis submitted, or empty response
         - If in doubt, return 0.15. Only return one of the four values listed above.
+        - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.
   - id: task_hard
     difficulty: hard
         - 0.50: Correct failure mode but fix is vague, wrong, or missing
         - 0.20: Wrong failure mode but fix is incidentally relevant
         - 0.15: Wrong failure mode, no useful fix, no diagnosis submitted, or empty response
+        - If in doubt, return 0.15. Only return one of the five values listed above.
+        - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.