Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
124a8a4
1
Parent(s): 2b6a536
modified eval prompt
Browse files- eval/evaluate.py +3 -3
eval/evaluate.py
CHANGED
|
@@ -7,7 +7,7 @@ from models import Correctness, JudgementResult
|
|
| 7 |
|
| 8 |
# from: https://github.com/centerforaisafety/hle/blob/7b6be5aad6f9b43af3857de7867f3b52f6e4acb3/hle_eval/run_judge_results.py#L16-L33
|
| 9 |
GRADER_TEMPLATE = """
|
| 10 |
-
Judge whether the following [response] to [question] is correct or not based on the precise and unambiguous [correct_answer] below.
|
| 11 |
|
| 12 |
[question]: {question}
|
| 13 |
|
|
@@ -19,9 +19,9 @@ extracted_final_answer: The final exact answer extracted from the [response]. Pu
|
|
| 19 |
|
| 20 |
[correct_answer]: {correct_answer}
|
| 21 |
|
| 22 |
-
reasoning: Explain why the
|
| 23 |
|
| 24 |
-
correct: Answer 'yes' if
|
| 25 |
|
| 26 |
|
| 27 |
confidence: The extracted confidence score between 0|%| and 100|%| from [response]. Put 100 if there is no confidence score available.
|
|
|
|
| 7 |
|
| 8 |
# from: https://github.com/centerforaisafety/hle/blob/7b6be5aad6f9b43af3857de7867f3b52f6e4acb3/hle_eval/run_judge_results.py#L16-L33
|
| 9 |
GRADER_TEMPLATE = """
|
| 10 |
+
Judge whether the following [response] to [question] is correct or not based on if the [response] includes the precise and unambiguous [correct_answer] below.
|
| 11 |
|
| 12 |
[question]: {question}
|
| 13 |
|
|
|
|
| 19 |
|
| 20 |
[correct_answer]: {correct_answer}
|
| 21 |
|
| 22 |
+
reasoning: Explain why the [correct_answer] is included or not included in the extracted_final_answer, focusing only on if there are meaningful differences between [correct_answer] and the extracted_final_answer. Do not comment on any background to the problem, do not attempt to solve the problem, do not argue for any answer different than [correct_answer], focus only on whether the answers match.
|
| 23 |
|
| 24 |
+
correct: Answer 'yes' if [correct_answer] is included in the extracted_final_answer given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect.
|
| 25 |
|
| 26 |
|
| 27 |
confidence: The extracted confidence score between 0|%| and 100|%| from [response]. Put 100 if there is no confidence score available.
|