Upload raw_lmharness_eval_output.txt
Browse files- raw_lmharness_eval_output.txt +208 -0
raw_lmharness_eval_output.txt
ADDED
|
@@ -0,0 +1,208 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
### 0-Shot
|
| 2 |
+
|
| 3 |
+
| Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr|
|
| 4 |
+
|---------------------------------------|------:|----------------|-----:|-----------|---|------:|---|-----:|
|
| 5 |
+
|anli_r1 | 1|none | 0|acc |↑ | 0.3690|± |0.0153|
|
| 6 |
+
|anli_r2 | 1|none | 0|acc |↑ | 0.3310|± |0.0149|
|
| 7 |
+
|anli_r3 | 1|none | 0|acc |↑ | 0.3233|± |0.0135|
|
| 8 |
+
|arc_challenge | 1|none | 0|acc |↑ | 0.1809|± |0.0112|
|
| 9 |
+
| | |none | 0|acc_norm |↑ | 0.2210|± |0.0121|
|
| 10 |
+
|arc_easy | 1|none | 0|acc |↑ | 0.3283|± |0.0096|
|
| 11 |
+
| | |none | 0|acc_norm |↑ | 0.3194|± |0.0096|
|
| 12 |
+
|gsm8k | 3|flexible-extract| 5|exact_match|↑ | 0.0114|± |0.0029|
|
| 13 |
+
| | |strict-match | 5|exact_match|↑ | 0.0015|± |0.0011|
|
| 14 |
+
|hellaswag | 1|none | 0|acc |↑ | 0.2649|± |0.0044|
|
| 15 |
+
| | |none | 0|acc_norm |↑ | 0.2677|± |0.0044|
|
| 16 |
+
|mmlu | 2|none | |acc |↑ | 0.2300|± |0.0035|
|
| 17 |
+
| - humanities | 2|none | |acc |↑ | 0.2429|± |0.0063|
|
| 18 |
+
| - formal_logic | 1|none | 0|acc |↑ | 0.2381|± |0.0381|
|
| 19 |
+
| - high_school_european_history | 1|none | 0|acc |↑ | 0.2424|± |0.0335|
|
| 20 |
+
| - high_school_us_history | 1|none | 0|acc |↑ | 0.2402|± |0.0300|
|
| 21 |
+
| - high_school_world_history | 1|none | 0|acc |↑ | 0.2743|± |0.0290|
|
| 22 |
+
| - international_law | 1|none | 0|acc |↑ | 0.2314|± |0.0385|
|
| 23 |
+
| - jurisprudence | 1|none | 0|acc |↑ | 0.2685|± |0.0428|
|
| 24 |
+
| - logical_fallacies | 1|none | 0|acc |↑ | 0.2270|± |0.0329|
|
| 25 |
+
| - moral_disputes | 1|none | 0|acc |↑ | 0.2514|± |0.0234|
|
| 26 |
+
| - moral_scenarios | 1|none | 0|acc |↑ | 0.2380|± |0.0142|
|
| 27 |
+
| - philosophy | 1|none | 0|acc |↑ | 0.1961|± |0.0226|
|
| 28 |
+
| - prehistory | 1|none | 0|acc |↑ | 0.2191|± |0.0230|
|
| 29 |
+
| - professional_law | 1|none | 0|acc |↑ | 0.2464|± |0.0110|
|
| 30 |
+
| - world_religions | 1|none | 0|acc |↑ | 0.3216|± |0.0358|
|
| 31 |
+
| - other | 2|none | |acc |↑ | 0.2350|± |0.0076|
|
| 32 |
+
| - business_ethics | 1|none | 0|acc |↑ | 0.3000|± |0.0461|
|
| 33 |
+
| - clinical_knowledge | 1|none | 0|acc |↑ | 0.2113|± |0.0251|
|
| 34 |
+
| - college_medicine | 1|none | 0|acc |↑ | 0.2081|± |0.0310|
|
| 35 |
+
| - global_facts | 1|none | 0|acc |↑ | 0.1700|± |0.0378|
|
| 36 |
+
| - human_aging | 1|none | 0|acc |↑ | 0.2870|± |0.0304|
|
| 37 |
+
| - management | 1|none | 0|acc |↑ | 0.1748|± |0.0376|
|
| 38 |
+
| - marketing | 1|none | 0|acc |↑ | 0.2821|± |0.0295|
|
| 39 |
+
| - medical_genetics | 1|none | 0|acc |↑ | 0.3000|± |0.0461|
|
| 40 |
+
| - miscellaneous | 1|none | 0|acc |↑ | 0.2452|± |0.0154|
|
| 41 |
+
| - nutrition | 1|none | 0|acc |↑ | 0.1993|± |0.0229|
|
| 42 |
+
| - professional_accounting | 1|none | 0|acc |↑ | 0.2234|± |0.0248|
|
| 43 |
+
| - professional_medicine | 1|none | 0|acc |↑ | 0.1801|± |0.0233|
|
| 44 |
+
| - virology | 1|none | 0|acc |↑ | 0.2892|± |0.0353|
|
| 45 |
+
| - social sciences | 2|none | |acc |↑ | 0.2168|± |0.0074|
|
| 46 |
+
| - econometrics | 1|none | 0|acc |↑ | 0.2368|± |0.0400|
|
| 47 |
+
| - high_school_geography | 1|none | 0|acc |↑ | 0.1768|± |0.0272|
|
| 48 |
+
| - high_school_government_and_politics| 1|none | 0|acc |↑ | 0.1969|± |0.0287|
|
| 49 |
+
| - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.2103|± |0.0207|
|
| 50 |
+
| - high_school_microeconomics | 1|none | 0|acc |↑ | 0.2059|± |0.0263|
|
| 51 |
+
| - high_school_psychology | 1|none | 0|acc |↑ | 0.1890|± |0.0168|
|
| 52 |
+
| - human_sexuality | 1|none | 0|acc |↑ | 0.2519|± |0.0381|
|
| 53 |
+
| - professional_psychology | 1|none | 0|acc |↑ | 0.2467|± |0.0174|
|
| 54 |
+
| - public_relations | 1|none | 0|acc |↑ | 0.2182|± |0.0396|
|
| 55 |
+
| - security_studies | 1|none | 0|acc |↑ | 0.1918|± |0.0252|
|
| 56 |
+
| - sociology | 1|none | 0|acc |↑ | 0.2438|± |0.0304|
|
| 57 |
+
| - us_foreign_policy | 1|none | 0|acc |↑ | 0.2900|± |0.0456|
|
| 58 |
+
| - stem | 2|none | |acc |↑ | 0.2185|± |0.0073|
|
| 59 |
+
| - abstract_algebra | 1|none | 0|acc |↑ | 0.1900|± |0.0394|
|
| 60 |
+
| - anatomy | 1|none | 0|acc |↑ | 0.2296|± |0.0363|
|
| 61 |
+
| - astronomy | 1|none | 0|acc |↑ | 0.1711|± |0.0306|
|
| 62 |
+
| - college_biology | 1|none | 0|acc |↑ | 0.2569|± |0.0365|
|
| 63 |
+
| - college_chemistry | 1|none | 0|acc |↑ | 0.1800|± |0.0386|
|
| 64 |
+
| - college_computer_science | 1|none | 0|acc |↑ | 0.2400|± |0.0429|
|
| 65 |
+
| - college_mathematics | 1|none | 0|acc |↑ | 0.2100|± |0.0409|
|
| 66 |
+
| - college_physics | 1|none | 0|acc |↑ | 0.2255|± |0.0416|
|
| 67 |
+
| - computer_security | 1|none | 0|acc |↑ | 0.3000|± |0.0461|
|
| 68 |
+
| - conceptual_physics | 1|none | 0|acc |↑ | 0.2681|± |0.0290|
|
| 69 |
+
| - electrical_engineering | 1|none | 0|acc |↑ | 0.2414|± |0.0357|
|
| 70 |
+
| - elementary_mathematics | 1|none | 0|acc |↑ | 0.2090|± |0.0209|
|
| 71 |
+
| - high_school_biology | 1|none | 0|acc |↑ | 0.1871|± |0.0222|
|
| 72 |
+
| - high_school_chemistry | 1|none | 0|acc |↑ | 0.1921|± |0.0277|
|
| 73 |
+
| - high_school_computer_science | 1|none | 0|acc |↑ | 0.2700|± |0.0446|
|
| 74 |
+
| - high_school_mathematics | 1|none | 0|acc |↑ | 0.2222|± |0.0253|
|
| 75 |
+
| - high_school_physics | 1|none | 0|acc |↑ | 0.1987|± |0.0326|
|
| 76 |
+
| - high_school_statistics | 1|none | 0|acc |↑ | 0.1528|± |0.0245|
|
| 77 |
+
| - machine_learning | 1|none | 0|acc |↑ | 0.3214|± |0.0443|
|
| 78 |
+
|piqa | 1|none | 0|acc |↑ | 0.5544|± |0.0116|
|
| 79 |
+
| | |none | 0|acc_norm |↑ | 0.5571|± |0.0116|
|
| 80 |
+
|swag | 1|none | 0|acc |↑ | 0.3024|± |0.0032|
|
| 81 |
+
| | |none | 0|acc_norm |↑ | 0.3297|± |0.0033|
|
| 82 |
+
|truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.2399|± |0.0149|
|
| 83 |
+
| | |none | 0|bleu_diff |↑ |-1.2697|± |0.3469|
|
| 84 |
+
| | |none | 0|bleu_max |↑ |10.7605|± |0.5507|
|
| 85 |
+
| | |none | 0|rouge1_acc |↑ | 0.2864|± |0.0158|
|
| 86 |
+
| | |none | 0|rouge1_diff|↑ |-2.4981|± |0.4567|
|
| 87 |
+
| | |none | 0|rouge1_max |↑ |22.1008|± |0.8090|
|
| 88 |
+
| | |none | 0|rouge2_acc |↑ | 0.0979|± |0.0104|
|
| 89 |
+
| | |none | 0|rouge2_diff|↑ |-1.7592|± |0.4683|
|
| 90 |
+
| | |none | 0|rouge2_max |↑ |11.8332|± |0.7556|
|
| 91 |
+
| | |none | 0|rougeL_acc |↑ | 0.2815|± |0.0157|
|
| 92 |
+
| | |none | 0|rougeL_diff|↑ |-2.2800|± |0.4424|
|
| 93 |
+
| | |none | 0|rougeL_max |↑ |20.7733|± |0.7985|
|
| 94 |
+
|truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2705|± |0.0156|
|
| 95 |
+
|truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4591|± |0.0159|
|
| 96 |
+
|
| 97 |
+
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|
| 98 |
+
|------------------|------:|------|------|------|---|-----:|---|-----:|
|
| 99 |
+
|mmlu | 2|none | |acc |↑ |0.2300|± |0.0035|
|
| 100 |
+
| - humanities | 2|none | |acc |↑ |0.2429|± |0.0063|
|
| 101 |
+
| - other | 2|none | |acc |↑ |0.2350|± |0.0076|
|
| 102 |
+
| - social sciences| 2|none | |acc |↑ |0.2168|± |0.0074|
|
| 103 |
+
| - stem | 2|none | |acc |↑ |0.2185|± |0.0073|
|
| 104 |
+
|
| 105 |
+
### 5-Shot
|
| 106 |
+
|
| 107 |
+
fewshot: 5, batch_size: 16
|
| 108 |
+
| Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr|
|
| 109 |
+
|---------------------------------------|------:|----------------|-----:|-----------|---|------:|---|-----:|
|
| 110 |
+
|anli_r1 | 1|none | 5|acc |↑ | 0.3500|± |0.0151|
|
| 111 |
+
|anli_r2 | 1|none | 5|acc |↑ | 0.3340|± |0.0149|
|
| 112 |
+
|anli_r3 | 1|none | 5|acc |↑ | 0.3250|± |0.0135|
|
| 113 |
+
|arc_challenge | 1|none | 5|acc |↑ | 0.1843|± |0.0113|
|
| 114 |
+
| | |none | 5|acc_norm |↑ | 0.2184|± |0.0121|
|
| 115 |
+
|arc_easy | 1|none | 5|acc |↑ | 0.3380|± |0.0097|
|
| 116 |
+
| | |none | 5|acc_norm |↑ | 0.3215|± |0.0096|
|
| 117 |
+
|gsm8k | 3|flexible-extract| 5|exact_match|↑ | 0.0114|± |0.0029|
|
| 118 |
+
| | |strict-match | 5|exact_match|↑ | 0.0015|± |0.0011|
|
| 119 |
+
|hellaswag | 1|none | 5|acc |↑ | 0.2644|± |0.0044|
|
| 120 |
+
| | |none | 5|acc_norm |↑ | 0.2657|± |0.0044|
|
| 121 |
+
|mmlu | 2|none | |acc |↑ | 0.2413|± |0.0036|
|
| 122 |
+
| - humanities | 2|none | |acc |↑ | 0.2446|± |0.0063|
|
| 123 |
+
| - formal_logic | 1|none | 5|acc |↑ | 0.1825|± |0.0346|
|
| 124 |
+
| - high_school_european_history | 1|none | 5|acc |↑ | 0.2364|± |0.0332|
|
| 125 |
+
| - high_school_us_history | 1|none | 5|acc |↑ | 0.2108|± |0.0286|
|
| 126 |
+
| - high_school_world_history | 1|none | 5|acc |↑ | 0.2447|± |0.0280|
|
| 127 |
+
| - international_law | 1|none | 5|acc |↑ | 0.2727|± |0.0407|
|
| 128 |
+
| - jurisprudence | 1|none | 5|acc |↑ | 0.2593|± |0.0424|
|
| 129 |
+
| - logical_fallacies | 1|none | 5|acc |↑ | 0.2699|± |0.0349|
|
| 130 |
+
| - moral_disputes | 1|none | 5|acc |↑ | 0.2688|± |0.0239|
|
| 131 |
+
| - moral_scenarios | 1|none | 5|acc |↑ | 0.2380|± |0.0142|
|
| 132 |
+
| - philosophy | 1|none | 5|acc |↑ | 0.2283|± |0.0238|
|
| 133 |
+
| - prehistory | 1|none | 5|acc |↑ | 0.2809|± |0.0250|
|
| 134 |
+
| - professional_law | 1|none | 5|acc |↑ | 0.2464|± |0.0110|
|
| 135 |
+
| - world_religions | 1|none | 5|acc |↑ | 0.2164|± |0.0316|
|
| 136 |
+
| - other | 2|none | |acc |↑ | 0.2288|± |0.0075|
|
| 137 |
+
| - business_ethics | 1|none | 5|acc |↑ | 0.2400|± |0.0429|
|
| 138 |
+
| - clinical_knowledge | 1|none | 5|acc |↑ | 0.2189|± |0.0254|
|
| 139 |
+
| - college_medicine | 1|none | 5|acc |↑ | 0.2139|± |0.0313|
|
| 140 |
+
| - global_facts | 1|none | 5|acc |↑ | 0.1800|± |0.0386|
|
| 141 |
+
| - human_aging | 1|none | 5|acc |↑ | 0.2018|± |0.0269|
|
| 142 |
+
| - management | 1|none | 5|acc |↑ | 0.1748|± |0.0376|
|
| 143 |
+
| - marketing | 1|none | 5|acc |↑ | 0.1880|± |0.0256|
|
| 144 |
+
| - medical_genetics | 1|none | 5|acc |↑ | 0.3000|± |0.0461|
|
| 145 |
+
| - miscellaneous | 1|none | 5|acc |↑ | 0.2490|± |0.0155|
|
| 146 |
+
| - nutrition | 1|none | 5|acc |↑ | 0.2451|± |0.0246|
|
| 147 |
+
| - professional_accounting | 1|none | 5|acc |↑ | 0.2624|± |0.0262|
|
| 148 |
+
| - professional_medicine | 1|none | 5|acc |↑ | 0.1801|± |0.0233|
|
| 149 |
+
| - virology | 1|none | 5|acc |↑ | 0.2651|± |0.0344|
|
| 150 |
+
| - social sciences | 2|none | |acc |↑ | 0.2317|± |0.0076|
|
| 151 |
+
| - econometrics | 1|none | 5|acc |↑ | 0.2632|± |0.0414|
|
| 152 |
+
| - high_school_geography | 1|none | 5|acc |↑ | 0.2475|± |0.0307|
|
| 153 |
+
| - high_school_government_and_politics| 1|none | 5|acc |↑ | 0.2383|± |0.0307|
|
| 154 |
+
| - high_school_macroeconomics | 1|none | 5|acc |↑ | 0.2103|± |0.0207|
|
| 155 |
+
| - high_school_microeconomics | 1|none | 5|acc |↑ | 0.2689|± |0.0288|
|
| 156 |
+
| - high_school_psychology | 1|none | 5|acc |↑ | 0.1982|± |0.0171|
|
| 157 |
+
| - human_sexuality | 1|none | 5|acc |↑ | 0.2443|± |0.0377|
|
| 158 |
+
| - professional_psychology | 1|none | 5|acc |↑ | 0.2484|± |0.0175|
|
| 159 |
+
| - public_relations | 1|none | 5|acc |↑ | 0.2091|± |0.0390|
|
| 160 |
+
| - security_studies | 1|none | 5|acc |↑ | 0.2245|± |0.0267|
|
| 161 |
+
| - sociology | 1|none | 5|acc |↑ | 0.2388|± |0.0301|
|
| 162 |
+
| - us_foreign_policy | 1|none | 5|acc |↑ | 0.2400|± |0.0429|
|
| 163 |
+
| - stem | 2|none | |acc |↑ | 0.2578|± |0.0078|
|
| 164 |
+
| - abstract_algebra | 1|none | 5|acc |↑ | 0.2200|± |0.0416|
|
| 165 |
+
| - anatomy | 1|none | 5|acc |↑ | 0.3852|± |0.0420|
|
| 166 |
+
| - astronomy | 1|none | 5|acc |↑ | 0.1776|± |0.0311|
|
| 167 |
+
| - college_biology | 1|none | 5|acc |↑ | 0.2361|± |0.0355|
|
| 168 |
+
| - college_chemistry | 1|none | 5|acc |↑ | 0.2000|± |0.0402|
|
| 169 |
+
| - college_computer_science | 1|none | 5|acc |↑ | 0.1700|± |0.0378|
|
| 170 |
+
| - college_mathematics | 1|none | 5|acc |↑ | 0.2300|± |0.0423|
|
| 171 |
+
| - college_physics | 1|none | 5|acc |↑ | 0.2157|± |0.0409|
|
| 172 |
+
| - computer_security | 1|none | 5|acc |↑ | 0.2100|± |0.0409|
|
| 173 |
+
| - conceptual_physics | 1|none | 5|acc |↑ | 0.2723|± |0.0291|
|
| 174 |
+
| - electrical_engineering | 1|none | 5|acc |↑ | 0.2483|± |0.0360|
|
| 175 |
+
| - elementary_mathematics | 1|none | 5|acc |↑ | 0.2566|± |0.0225|
|
| 176 |
+
| - high_school_biology | 1|none | 5|acc |↑ | 0.3097|± |0.0263|
|
| 177 |
+
| - high_school_chemistry | 1|none | 5|acc |↑ | 0.2906|± |0.0319|
|
| 178 |
+
| - high_school_computer_science | 1|none | 5|acc |↑ | 0.1900|± |0.0394|
|
| 179 |
+
| - high_school_mathematics | 1|none | 5|acc |↑ | 0.2481|± |0.0263|
|
| 180 |
+
| - high_school_physics | 1|none | 5|acc |↑ | 0.2583|± |0.0357|
|
| 181 |
+
| - high_school_statistics | 1|none | 5|acc |↑ | 0.3333|± |0.0321|
|
| 182 |
+
| - machine_learning | 1|none | 5|acc |↑ | 0.2321|± |0.0401|
|
| 183 |
+
|piqa | 1|none | 5|acc |↑ | 0.5560|± |0.0116|
|
| 184 |
+
| | |none | 5|acc_norm |↑ | 0.5533|± |0.0116|
|
| 185 |
+
|swag | 1|none | 5|acc |↑ | 0.2963|± |0.0032|
|
| 186 |
+
| | |none | 5|acc_norm |↑ | 0.3201|± |0.0033|
|
| 187 |
+
|truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.2399|± |0.0149|
|
| 188 |
+
| | |none | 0|bleu_diff |↑ |-1.2697|± |0.3469|
|
| 189 |
+
| | |none | 0|bleu_max |↑ |10.7605|± |0.5507|
|
| 190 |
+
| | |none | 0|rouge1_acc |↑ | 0.2864|± |0.0158|
|
| 191 |
+
| | |none | 0|rouge1_diff|↑ |-2.4981|± |0.4567|
|
| 192 |
+
| | |none | 0|rouge1_max |↑ |22.1008|± |0.8090|
|
| 193 |
+
| | |none | 0|rouge2_acc |↑ | 0.0979|± |0.0104|
|
| 194 |
+
| | |none | 0|rouge2_diff|↑ |-1.7592|± |0.4683|
|
| 195 |
+
| | |none | 0|rouge2_max |↑ |11.8332|± |0.7556|
|
| 196 |
+
| | |none | 0|rougeL_acc |↑ | 0.2815|± |0.0157|
|
| 197 |
+
| | |none | 0|rougeL_diff|↑ |-2.2800|± |0.4424|
|
| 198 |
+
| | |none | 0|rougeL_max |↑ |20.7733|± |0.7985|
|
| 199 |
+
|truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2705|± |0.0156|
|
| 200 |
+
|truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4591|± |0.0159|
|
| 201 |
+
|
| 202 |
+
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|
| 203 |
+
|------------------|------:|------|------|------|---|-----:|---|-----:|
|
| 204 |
+
|mmlu | 2|none | |acc |↑ |0.2413|± |0.0036|
|
| 205 |
+
| - humanities | 2|none | |acc |↑ |0.2446|± |0.0063|
|
| 206 |
+
| - other | 2|none | |acc |↑ |0.2288|± |0.0075|
|
| 207 |
+
| - social sciences| 2|none | |acc |↑ |0.2317|± |0.0076|
|
| 208 |
+
| - stem | 2|none | |acc |↑ |0.2578|± |0.0078|
|