Spaces:
Sleeping
Sleeping
fix: implementing strict prompt conditions for scores/reward to be in 0.0–1.0 range
Browse files- openenv.yaml +4 -1
openenv.yaml
CHANGED
|
@@ -37,6 +37,7 @@ tasks:
|
|
| 37 |
- 0.30: Wrong label but description matches a related concept
|
| 38 |
- 0.15: Wrong failure mode, no diagnosis submitted, or empty response
|
| 39 |
- If in doubt, return 0.15. Only return one of the four values listed above.
|
|
|
|
| 40 |
|
| 41 |
- id: task_medium
|
| 42 |
difficulty: medium
|
|
@@ -69,6 +70,7 @@ tasks:
|
|
| 69 |
- 0.30: Wrong label but description matches a related concept
|
| 70 |
- 0.15: Wrong failure mode, no diagnosis submitted, or empty response
|
| 71 |
- If in doubt, return 0.15. Only return one of the four values listed above.
|
|
|
|
| 72 |
|
| 73 |
- id: task_hard
|
| 74 |
difficulty: hard
|
|
@@ -103,4 +105,5 @@ tasks:
|
|
| 103 |
- 0.50: Correct failure mode but fix is vague, wrong, or missing
|
| 104 |
- 0.20: Wrong failure mode but fix is incidentally relevant
|
| 105 |
- 0.15: Wrong failure mode, no useful fix, no diagnosis submitted, or empty response
|
| 106 |
-
- If in doubt, return 0.15. Only return one of the five values listed above.
|
|
|
|
|
|
| 37 |
- 0.30: Wrong label but description matches a related concept
|
| 38 |
- 0.15: Wrong failure mode, no diagnosis submitted, or empty response
|
| 39 |
- If in doubt, return 0.15. Only return one of the four values listed above.
|
| 40 |
+
- CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.
|
| 41 |
|
| 42 |
- id: task_medium
|
| 43 |
difficulty: medium
|
|
|
|
| 70 |
- 0.30: Wrong label but description matches a related concept
|
| 71 |
- 0.15: Wrong failure mode, no diagnosis submitted, or empty response
|
| 72 |
- If in doubt, return 0.15. Only return one of the four values listed above.
|
| 73 |
+
- CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.
|
| 74 |
|
| 75 |
- id: task_hard
|
| 76 |
difficulty: hard
|
|
|
|
| 105 |
- 0.50: Correct failure mode but fix is vague, wrong, or missing
|
| 106 |
- 0.20: Wrong failure mode but fix is incidentally relevant
|
| 107 |
- 0.15: Wrong failure mode, no useful fix, no diagnosis submitted, or empty response
|
| 108 |
+
- If in doubt, return 0.15. Only return one of the five values listed above.
|
| 109 |
+
- CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.
|