Harley-ml commited on
Commit
f5246b9
·
verified ·
1 Parent(s): 1a2f7a2

Upload raw_lmharness_eval_output.txt

Browse files
Files changed (1) hide show
  1. raw_lmharness_eval_output.txt +208 -0
raw_lmharness_eval_output.txt ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### 0-Shot
2
+
3
+ | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr|
4
+ |---------------------------------------|------:|----------------|-----:|-----------|---|------:|---|-----:|
5
+ |anli_r1 | 1|none | 0|acc |↑ | 0.3690|± |0.0153|
6
+ |anli_r2 | 1|none | 0|acc |↑ | 0.3310|± |0.0149|
7
+ |anli_r3 | 1|none | 0|acc |↑ | 0.3233|± |0.0135|
8
+ |arc_challenge | 1|none | 0|acc |↑ | 0.1809|± |0.0112|
9
+ | | |none | 0|acc_norm |↑ | 0.2210|± |0.0121|
10
+ |arc_easy | 1|none | 0|acc |↑ | 0.3283|± |0.0096|
11
+ | | |none | 0|acc_norm |↑ | 0.3194|± |0.0096|
12
+ |gsm8k | 3|flexible-extract| 5|exact_match|↑ | 0.0114|± |0.0029|
13
+ | | |strict-match | 5|exact_match|↑ | 0.0015|± |0.0011|
14
+ |hellaswag | 1|none | 0|acc |↑ | 0.2649|± |0.0044|
15
+ | | |none | 0|acc_norm |↑ | 0.2677|± |0.0044|
16
+ |mmlu | 2|none | |acc |↑ | 0.2300|± |0.0035|
17
+ | - humanities | 2|none | |acc |↑ | 0.2429|± |0.0063|
18
+ | - formal_logic | 1|none | 0|acc |↑ | 0.2381|± |0.0381|
19
+ | - high_school_european_history | 1|none | 0|acc |↑ | 0.2424|± |0.0335|
20
+ | - high_school_us_history | 1|none | 0|acc |↑ | 0.2402|± |0.0300|
21
+ | - high_school_world_history | 1|none | 0|acc |↑ | 0.2743|± |0.0290|
22
+ | - international_law | 1|none | 0|acc |↑ | 0.2314|± |0.0385|
23
+ | - jurisprudence | 1|none | 0|acc |↑ | 0.2685|± |0.0428|
24
+ | - logical_fallacies | 1|none | 0|acc |↑ | 0.2270|± |0.0329|
25
+ | - moral_disputes | 1|none | 0|acc |↑ | 0.2514|± |0.0234|
26
+ | - moral_scenarios | 1|none | 0|acc |↑ | 0.2380|± |0.0142|
27
+ | - philosophy | 1|none | 0|acc |↑ | 0.1961|± |0.0226|
28
+ | - prehistory | 1|none | 0|acc |↑ | 0.2191|± |0.0230|
29
+ | - professional_law | 1|none | 0|acc |↑ | 0.2464|± |0.0110|
30
+ | - world_religions | 1|none | 0|acc |↑ | 0.3216|± |0.0358|
31
+ | - other | 2|none | |acc |↑ | 0.2350|± |0.0076|
32
+ | - business_ethics | 1|none | 0|acc |↑ | 0.3000|± |0.0461|
33
+ | - clinical_knowledge | 1|none | 0|acc |↑ | 0.2113|± |0.0251|
34
+ | - college_medicine | 1|none | 0|acc |↑ | 0.2081|± |0.0310|
35
+ | - global_facts | 1|none | 0|acc |↑ | 0.1700|± |0.0378|
36
+ | - human_aging | 1|none | 0|acc |↑ | 0.2870|± |0.0304|
37
+ | - management | 1|none | 0|acc |↑ | 0.1748|± |0.0376|
38
+ | - marketing | 1|none | 0|acc |↑ | 0.2821|± |0.0295|
39
+ | - medical_genetics | 1|none | 0|acc |↑ | 0.3000|± |0.0461|
40
+ | - miscellaneous | 1|none | 0|acc |↑ | 0.2452|± |0.0154|
41
+ | - nutrition | 1|none | 0|acc |↑ | 0.1993|± |0.0229|
42
+ | - professional_accounting | 1|none | 0|acc |↑ | 0.2234|± |0.0248|
43
+ | - professional_medicine | 1|none | 0|acc |↑ | 0.1801|± |0.0233|
44
+ | - virology | 1|none | 0|acc |↑ | 0.2892|± |0.0353|
45
+ | - social sciences | 2|none | |acc |↑ | 0.2168|± |0.0074|
46
+ | - econometrics | 1|none | 0|acc |↑ | 0.2368|± |0.0400|
47
+ | - high_school_geography | 1|none | 0|acc |↑ | 0.1768|± |0.0272|
48
+ | - high_school_government_and_politics| 1|none | 0|acc |↑ | 0.1969|± |0.0287|
49
+ | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.2103|± |0.0207|
50
+ | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.2059|± |0.0263|
51
+ | - high_school_psychology | 1|none | 0|acc |↑ | 0.1890|± |0.0168|
52
+ | - human_sexuality | 1|none | 0|acc |↑ | 0.2519|± |0.0381|
53
+ | - professional_psychology | 1|none | 0|acc |↑ | 0.2467|± |0.0174|
54
+ | - public_relations | 1|none | 0|acc |↑ | 0.2182|± |0.0396|
55
+ | - security_studies | 1|none | 0|acc |↑ | 0.1918|± |0.0252|
56
+ | - sociology | 1|none | 0|acc |↑ | 0.2438|± |0.0304|
57
+ | - us_foreign_policy | 1|none | 0|acc |↑ | 0.2900|± |0.0456|
58
+ | - stem | 2|none | |acc |↑ | 0.2185|± |0.0073|
59
+ | - abstract_algebra | 1|none | 0|acc |↑ | 0.1900|± |0.0394|
60
+ | - anatomy | 1|none | 0|acc |↑ | 0.2296|± |0.0363|
61
+ | - astronomy | 1|none | 0|acc |↑ | 0.1711|± |0.0306|
62
+ | - college_biology | 1|none | 0|acc |↑ | 0.2569|± |0.0365|
63
+ | - college_chemistry | 1|none | 0|acc |↑ | 0.1800|± |0.0386|
64
+ | - college_computer_science | 1|none | 0|acc |↑ | 0.2400|± |0.0429|
65
+ | - college_mathematics | 1|none | 0|acc |↑ | 0.2100|± |0.0409|
66
+ | - college_physics | 1|none | 0|acc |↑ | 0.2255|± |0.0416|
67
+ | - computer_security | 1|none | 0|acc |↑ | 0.3000|± |0.0461|
68
+ | - conceptual_physics | 1|none | 0|acc |↑ | 0.2681|± |0.0290|
69
+ | - electrical_engineering | 1|none | 0|acc |↑ | 0.2414|± |0.0357|
70
+ | - elementary_mathematics | 1|none | 0|acc |↑ | 0.2090|± |0.0209|
71
+ | - high_school_biology | 1|none | 0|acc |↑ | 0.1871|± |0.0222|
72
+ | - high_school_chemistry | 1|none | 0|acc |↑ | 0.1921|± |0.0277|
73
+ | - high_school_computer_science | 1|none | 0|acc |↑ | 0.2700|± |0.0446|
74
+ | - high_school_mathematics | 1|none | 0|acc |↑ | 0.2222|± |0.0253|
75
+ | - high_school_physics | 1|none | 0|acc |↑ | 0.1987|± |0.0326|
76
+ | - high_school_statistics | 1|none | 0|acc |↑ | 0.1528|± |0.0245|
77
+ | - machine_learning | 1|none | 0|acc |↑ | 0.3214|± |0.0443|
78
+ |piqa | 1|none | 0|acc |↑ | 0.5544|± |0.0116|
79
+ | | |none | 0|acc_norm |↑ | 0.5571|± |0.0116|
80
+ |swag | 1|none | 0|acc |↑ | 0.3024|± |0.0032|
81
+ | | |none | 0|acc_norm |↑ | 0.3297|± |0.0033|
82
+ |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.2399|± |0.0149|
83
+ | | |none | 0|bleu_diff |↑ |-1.2697|± |0.3469|
84
+ | | |none | 0|bleu_max |↑ |10.7605|± |0.5507|
85
+ | | |none | 0|rouge1_acc |↑ | 0.2864|± |0.0158|
86
+ | | |none | 0|rouge1_diff|↑ |-2.4981|± |0.4567|
87
+ | | |none | 0|rouge1_max |↑ |22.1008|± |0.8090|
88
+ | | |none | 0|rouge2_acc |↑ | 0.0979|± |0.0104|
89
+ | | |none | 0|rouge2_diff|↑ |-1.7592|± |0.4683|
90
+ | | |none | 0|rouge2_max |↑ |11.8332|± |0.7556|
91
+ | | |none | 0|rougeL_acc |↑ | 0.2815|± |0.0157|
92
+ | | |none | 0|rougeL_diff|↑ |-2.2800|± |0.4424|
93
+ | | |none | 0|rougeL_max |↑ |20.7733|± |0.7985|
94
+ |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2705|± |0.0156|
95
+ |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4591|± |0.0159|
96
+
97
+ | Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
98
+ |------------------|------:|------|------|------|---|-----:|---|-----:|
99
+ |mmlu | 2|none | |acc |↑ |0.2300|± |0.0035|
100
+ | - humanities | 2|none | |acc |↑ |0.2429|± |0.0063|
101
+ | - other | 2|none | |acc |↑ |0.2350|± |0.0076|
102
+ | - social sciences| 2|none | |acc |↑ |0.2168|± |0.0074|
103
+ | - stem | 2|none | |acc |↑ |0.2185|± |0.0073|
104
+
105
+ ### 5-Shot
106
+
107
+ fewshot: 5, batch_size: 16
108
+ | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr|
109
+ |---------------------------------------|------:|----------------|-----:|-----------|---|------:|---|-----:|
110
+ |anli_r1 | 1|none | 5|acc |↑ | 0.3500|± |0.0151|
111
+ |anli_r2 | 1|none | 5|acc |↑ | 0.3340|± |0.0149|
112
+ |anli_r3 | 1|none | 5|acc |↑ | 0.3250|± |0.0135|
113
+ |arc_challenge | 1|none | 5|acc |↑ | 0.1843|± |0.0113|
114
+ | | |none | 5|acc_norm |↑ | 0.2184|± |0.0121|
115
+ |arc_easy | 1|none | 5|acc |↑ | 0.3380|± |0.0097|
116
+ | | |none | 5|acc_norm |↑ | 0.3215|± |0.0096|
117
+ |gsm8k | 3|flexible-extract| 5|exact_match|↑ | 0.0114|± |0.0029|
118
+ | | |strict-match | 5|exact_match|↑ | 0.0015|± |0.0011|
119
+ |hellaswag | 1|none | 5|acc |↑ | 0.2644|± |0.0044|
120
+ | | |none | 5|acc_norm |↑ | 0.2657|± |0.0044|
121
+ |mmlu | 2|none | |acc |↑ | 0.2413|± |0.0036|
122
+ | - humanities | 2|none | |acc |↑ | 0.2446|± |0.0063|
123
+ | - formal_logic | 1|none | 5|acc |↑ | 0.1825|± |0.0346|
124
+ | - high_school_european_history | 1|none | 5|acc |↑ | 0.2364|± |0.0332|
125
+ | - high_school_us_history | 1|none | 5|acc |↑ | 0.2108|± |0.0286|
126
+ | - high_school_world_history | 1|none | 5|acc |↑ | 0.2447|± |0.0280|
127
+ | - international_law | 1|none | 5|acc |↑ | 0.2727|± |0.0407|
128
+ | - jurisprudence | 1|none | 5|acc |↑ | 0.2593|± |0.0424|
129
+ | - logical_fallacies | 1|none | 5|acc |↑ | 0.2699|± |0.0349|
130
+ | - moral_disputes | 1|none | 5|acc |↑ | 0.2688|± |0.0239|
131
+ | - moral_scenarios | 1|none | 5|acc |↑ | 0.2380|± |0.0142|
132
+ | - philosophy | 1|none | 5|acc |↑ | 0.2283|± |0.0238|
133
+ | - prehistory | 1|none | 5|acc |↑ | 0.2809|± |0.0250|
134
+ | - professional_law | 1|none | 5|acc |↑ | 0.2464|± |0.0110|
135
+ | - world_religions | 1|none | 5|acc |↑ | 0.2164|± |0.0316|
136
+ | - other | 2|none | |acc |↑ | 0.2288|± |0.0075|
137
+ | - business_ethics | 1|none | 5|acc |↑ | 0.2400|± |0.0429|
138
+ | - clinical_knowledge | 1|none | 5|acc |↑ | 0.2189|± |0.0254|
139
+ | - college_medicine | 1|none | 5|acc |↑ | 0.2139|± |0.0313|
140
+ | - global_facts | 1|none | 5|acc |↑ | 0.1800|± |0.0386|
141
+ | - human_aging | 1|none | 5|acc |↑ | 0.2018|± |0.0269|
142
+ | - management | 1|none | 5|acc |↑ | 0.1748|± |0.0376|
143
+ | - marketing | 1|none | 5|acc |↑ | 0.1880|± |0.0256|
144
+ | - medical_genetics | 1|none | 5|acc |↑ | 0.3000|± |0.0461|
145
+ | - miscellaneous | 1|none | 5|acc |↑ | 0.2490|± |0.0155|
146
+ | - nutrition | 1|none | 5|acc |↑ | 0.2451|± |0.0246|
147
+ | - professional_accounting | 1|none | 5|acc |↑ | 0.2624|± |0.0262|
148
+ | - professional_medicine | 1|none | 5|acc |↑ | 0.1801|± |0.0233|
149
+ | - virology | 1|none | 5|acc |↑ | 0.2651|± |0.0344|
150
+ | - social sciences | 2|none | |acc |↑ | 0.2317|± |0.0076|
151
+ | - econometrics | 1|none | 5|acc |↑ | 0.2632|± |0.0414|
152
+ | - high_school_geography | 1|none | 5|acc |↑ | 0.2475|± |0.0307|
153
+ | - high_school_government_and_politics| 1|none | 5|acc |↑ | 0.2383|± |0.0307|
154
+ | - high_school_macroeconomics | 1|none | 5|acc |↑ | 0.2103|± |0.0207|
155
+ | - high_school_microeconomics | 1|none | 5|acc |↑ | 0.2689|± |0.0288|
156
+ | - high_school_psychology | 1|none | 5|acc |↑ | 0.1982|± |0.0171|
157
+ | - human_sexuality | 1|none | 5|acc |↑ | 0.2443|± |0.0377|
158
+ | - professional_psychology | 1|none | 5|acc |↑ | 0.2484|± |0.0175|
159
+ | - public_relations | 1|none | 5|acc |↑ | 0.2091|± |0.0390|
160
+ | - security_studies | 1|none | 5|acc |↑ | 0.2245|± |0.0267|
161
+ | - sociology | 1|none | 5|acc |↑ | 0.2388|± |0.0301|
162
+ | - us_foreign_policy | 1|none | 5|acc |↑ | 0.2400|± |0.0429|
163
+ | - stem | 2|none | |acc |↑ | 0.2578|± |0.0078|
164
+ | - abstract_algebra | 1|none | 5|acc |↑ | 0.2200|± |0.0416|
165
+ | - anatomy | 1|none | 5|acc |↑ | 0.3852|± |0.0420|
166
+ | - astronomy | 1|none | 5|acc |↑ | 0.1776|± |0.0311|
167
+ | - college_biology | 1|none | 5|acc |↑ | 0.2361|± |0.0355|
168
+ | - college_chemistry | 1|none | 5|acc |↑ | 0.2000|± |0.0402|
169
+ | - college_computer_science | 1|none | 5|acc |↑ | 0.1700|± |0.0378|
170
+ | - college_mathematics | 1|none | 5|acc |↑ | 0.2300|± |0.0423|
171
+ | - college_physics | 1|none | 5|acc |↑ | 0.2157|± |0.0409|
172
+ | - computer_security | 1|none | 5|acc |↑ | 0.2100|± |0.0409|
173
+ | - conceptual_physics | 1|none | 5|acc |↑ | 0.2723|± |0.0291|
174
+ | - electrical_engineering | 1|none | 5|acc |↑ | 0.2483|± |0.0360|
175
+ | - elementary_mathematics | 1|none | 5|acc |↑ | 0.2566|± |0.0225|
176
+ | - high_school_biology | 1|none | 5|acc |↑ | 0.3097|± |0.0263|
177
+ | - high_school_chemistry | 1|none | 5|acc |↑ | 0.2906|± |0.0319|
178
+ | - high_school_computer_science | 1|none | 5|acc |↑ | 0.1900|± |0.0394|
179
+ | - high_school_mathematics | 1|none | 5|acc |↑ | 0.2481|± |0.0263|
180
+ | - high_school_physics | 1|none | 5|acc |↑ | 0.2583|± |0.0357|
181
+ | - high_school_statistics | 1|none | 5|acc |↑ | 0.3333|± |0.0321|
182
+ | - machine_learning | 1|none | 5|acc |↑ | 0.2321|± |0.0401|
183
+ |piqa | 1|none | 5|acc |↑ | 0.5560|± |0.0116|
184
+ | | |none | 5|acc_norm |↑ | 0.5533|± |0.0116|
185
+ |swag | 1|none | 5|acc |↑ | 0.2963|± |0.0032|
186
+ | | |none | 5|acc_norm |↑ | 0.3201|± |0.0033|
187
+ |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.2399|± |0.0149|
188
+ | | |none | 0|bleu_diff |↑ |-1.2697|± |0.3469|
189
+ | | |none | 0|bleu_max |↑ |10.7605|± |0.5507|
190
+ | | |none | 0|rouge1_acc |↑ | 0.2864|± |0.0158|
191
+ | | |none | 0|rouge1_diff|↑ |-2.4981|± |0.4567|
192
+ | | |none | 0|rouge1_max |↑ |22.1008|± |0.8090|
193
+ | | |none | 0|rouge2_acc |↑ | 0.0979|± |0.0104|
194
+ | | |none | 0|rouge2_diff|↑ |-1.7592|± |0.4683|
195
+ | | |none | 0|rouge2_max |↑ |11.8332|± |0.7556|
196
+ | | |none | 0|rougeL_acc |↑ | 0.2815|± |0.0157|
197
+ | | |none | 0|rougeL_diff|↑ |-2.2800|± |0.4424|
198
+ | | |none | 0|rougeL_max |↑ |20.7733|± |0.7985|
199
+ |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2705|± |0.0156|
200
+ |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4591|± |0.0159|
201
+
202
+ | Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
203
+ |------------------|------:|------|------|------|---|-----:|---|-----:|
204
+ |mmlu | 2|none | |acc |↑ |0.2413|± |0.0036|
205
+ | - humanities | 2|none | |acc |↑ |0.2446|± |0.0063|
206
+ | - other | 2|none | |acc |↑ |0.2288|± |0.0075|
207
+ | - social sciences| 2|none | |acc |↑ |0.2317|± |0.0076|
208
+ | - stem | 2|none | |acc |↑ |0.2578|± |0.0078|