akseljoonas HF Staff commited on
Commit
124a8a4
·
1 Parent(s): 2b6a536

modified eval prompt

Browse files
Files changed (1) hide show
  1. eval/evaluate.py +3 -3
eval/evaluate.py CHANGED
@@ -7,7 +7,7 @@ from models import Correctness, JudgementResult
7
 
8
  # from: https://github.com/centerforaisafety/hle/blob/7b6be5aad6f9b43af3857de7867f3b52f6e4acb3/hle_eval/run_judge_results.py#L16-L33
9
  GRADER_TEMPLATE = """
10
- Judge whether the following [response] to [question] is correct or not based on the precise and unambiguous [correct_answer] below.
11
 
12
  [question]: {question}
13
 
@@ -19,9 +19,9 @@ extracted_final_answer: The final exact answer extracted from the [response]. Pu
19
 
20
  [correct_answer]: {correct_answer}
21
 
22
- reasoning: Explain why the extracted_final_answer is correct or incorrect based on [correct_answer], focusing only on if there are meaningful differences between [correct_answer] and the extracted_final_answer. Do not comment on any background to the problem, do not attempt to solve the problem, do not argue for any answer different than [correct_answer], focus only on whether the answers match.
23
 
24
- correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect.
25
 
26
 
27
  confidence: The extracted confidence score between 0|%| and 100|%| from [response]. Put 100 if there is no confidence score available.
 
7
 
8
  # from: https://github.com/centerforaisafety/hle/blob/7b6be5aad6f9b43af3857de7867f3b52f6e4acb3/hle_eval/run_judge_results.py#L16-L33
9
  GRADER_TEMPLATE = """
10
+ Judge whether the following [response] to [question] is correct or not based on if the [response] includes the precise and unambiguous [correct_answer] below.
11
 
12
  [question]: {question}
13
 
 
19
 
20
  [correct_answer]: {correct_answer}
21
 
22
+ reasoning: Explain why the [correct_answer] is included or not included in the extracted_final_answer, focusing only on if there are meaningful differences between [correct_answer] and the extracted_final_answer. Do not comment on any background to the problem, do not attempt to solve the problem, do not argue for any answer different than [correct_answer], focus only on whether the answers match.
23
 
24
+ correct: Answer 'yes' if [correct_answer] is included in the extracted_final_answer given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect.
25
 
26
 
27
  confidence: The extracted confidence score between 0|%| and 100|%| from [response]. Put 100 if there is no confidence score available.