Harley-ml commited on
Commit
3a757cf
·
verified ·
1 Parent(s): a43f902

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +501 -3
README.md CHANGED
@@ -1,3 +1,501 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ model-index:
3
+ - name: Tenete-8M
4
+ results:
5
+
6
+ # ====================== 0‑Shot ======================
7
+ # --- Multiple‑choice tasks ---
8
+ - task:
9
+ type: multiple-choice
10
+ name: ANLI R1 (0-Shot)
11
+ dataset:
12
+ type: anli_r1
13
+ name: ANLI R1
14
+ metrics:
15
+ - name: accuracy
16
+ type: acc
17
+ value: 0.3690
18
+
19
+ - task:
20
+ type: multiple-choice
21
+ name: ANLI R2 (0-Shot)
22
+ dataset:
23
+ type: anli_r2
24
+ name: ANLI R2
25
+ metrics:
26
+ - name: accuracy
27
+ type: acc
28
+ value: 0.3310
29
+
30
+ - task:
31
+ type: multiple-choice
32
+ name: ANLI R3 (0-Shot)
33
+ dataset:
34
+ type: anli_r3
35
+ name: ANLI R3
36
+ metrics:
37
+ - name: accuracy
38
+ type: acc
39
+ value: 0.3233
40
+
41
+ - task:
42
+ type: multiple-choice
43
+ name: ARC Challenge (0-Shot)
44
+ dataset:
45
+ type: arc_challenge
46
+ name: ARC Challenge
47
+ metrics:
48
+ - name: accuracy
49
+ type: acc
50
+ value: 0.1809
51
+ - name: accuracy_norm
52
+ type: acc_norm
53
+ value: 0.2210
54
+
55
+ - task:
56
+ type: multiple-choice
57
+ name: ARC Easy (0-Shot)
58
+ dataset:
59
+ type: arc_easy
60
+ name: ARC Easy
61
+ metrics:
62
+ - name: accuracy
63
+ type: acc
64
+ value: 0.3283
65
+ - name: accuracy_norm
66
+ type: acc_norm
67
+ value: 0.3194
68
+
69
+ - task:
70
+ type: multiple-choice
71
+ name: HellaSwag (0-Shot)
72
+ dataset:
73
+ type: hellaswag
74
+ name: HellaSwag
75
+ metrics:
76
+ - name: accuracy
77
+ type: acc
78
+ value: 0.2649
79
+ - name: accuracy_norm
80
+ type: acc_norm
81
+ value: 0.2677
82
+
83
+ - task:
84
+ type: multiple-choice
85
+ name: MMLU (0-Shot)
86
+ dataset:
87
+ type: mmlu
88
+ name: MMLU
89
+ metrics:
90
+ - name: accuracy
91
+ type: acc
92
+ value: 0.2300
93
+
94
+ - task:
95
+ type: multiple-choice
96
+ name: MMLU Humanities (0-Shot)
97
+ dataset:
98
+ type: mmlu
99
+ name: MMLU Humanities
100
+ metrics:
101
+ - name: accuracy
102
+ type: acc
103
+ value: 0.2429
104
+
105
+ - task:
106
+ type: multiple-choice
107
+ name: MMLU Other (0-Shot)
108
+ dataset:
109
+ type: mmlu
110
+ name: MMLU Other
111
+ metrics:
112
+ - name: accuracy
113
+ type: acc
114
+ value: 0.2350
115
+
116
+ - task:
117
+ type: multiple-choice
118
+ name: MMLU Social Sciences (0-Shot)
119
+ dataset:
120
+ type: mmlu
121
+ name: MMLU Social Sciences
122
+ metrics:
123
+ - name: accuracy
124
+ type: acc
125
+ value: 0.2168
126
+
127
+ - task:
128
+ type: multiple-choice
129
+ name: MMLU STEM (0-Shot)
130
+ dataset:
131
+ type: mmlu
132
+ name: MMLU STEM
133
+ metrics:
134
+ - name: accuracy
135
+ type: acc
136
+ value: 0.2185
137
+
138
+ - task:
139
+ type: multiple-choice
140
+ name: PiQA (0-Shot)
141
+ dataset:
142
+ type: piqa
143
+ name: PiQA
144
+ metrics:
145
+ - name: accuracy
146
+ type: acc
147
+ value: 0.5544
148
+ - name: accuracy_norm
149
+ type: acc_norm
150
+ value: 0.5571
151
+
152
+ - task:
153
+ type: multiple-choice
154
+ name: SWAG (0-Shot)
155
+ dataset:
156
+ type: swag
157
+ name: SWAG
158
+ metrics:
159
+ - name: accuracy
160
+ type: acc
161
+ value: 0.3024
162
+ - name: accuracy_norm
163
+ type: acc_norm
164
+ value: 0.3297
165
+
166
+ - task:
167
+ type: multiple-choice
168
+ name: TruthfulQA MC1 (0-Shot)
169
+ dataset:
170
+ type: truthfulqa_mc1
171
+ name: TruthfulQA MC1
172
+ metrics:
173
+ - name: accuracy
174
+ type: acc
175
+ value: 0.2705
176
+
177
+ - task:
178
+ type: multiple-choice
179
+ name: TruthfulQA MC2 (0-Shot)
180
+ dataset:
181
+ type: truthfulqa_mc2
182
+ name: TruthfulQA MC2
183
+ metrics:
184
+ - name: accuracy
185
+ type: acc
186
+ value: 0.4591
187
+
188
+ # --- Generation tasks ---
189
+ - task:
190
+ type: text-generation
191
+ name: GSM8K (0-Shot)
192
+ dataset:
193
+ type: gsm8k
194
+ name: GSM8K
195
+ metrics:
196
+ - name: exact_match (flexible-extract)
197
+ type: exact_match
198
+ value: 0.0114
199
+ - name: exact_match (strict-match)
200
+ type: exact_match
201
+ value: 0.0015
202
+
203
+ - task:
204
+ type: text-generation
205
+ name: TruthfulQA Gen (0-Shot)
206
+ dataset:
207
+ type: truthfulqa_gen
208
+ name: TruthfulQA Gen
209
+ metrics:
210
+ - name: bleu_acc
211
+ type: bleu_acc
212
+ value: 0.2399
213
+ - name: bleu_diff
214
+ type: bleu_diff
215
+ value: -1.2697
216
+ - name: bleu_max
217
+ type: bleu_max
218
+ value: 10.7605
219
+ - name: rouge1_acc
220
+ type: rouge1_acc
221
+ value: 0.2864
222
+ - name: rouge1_diff
223
+ type: rouge1_diff
224
+ value: -2.4981
225
+ - name: rouge1_max
226
+ type: rouge1_max
227
+ value: 22.1008
228
+ - name: rouge2_acc
229
+ type: rouge2_acc
230
+ value: 0.0979
231
+ - name: rouge2_diff
232
+ type: rouge2_diff
233
+ value: -1.7592
234
+ - name: rouge2_max
235
+ type: rouge2_max
236
+ value: 11.8332
237
+ - name: rougeL_acc
238
+ type: rougeL_acc
239
+ value: 0.2815
240
+ - name: rougeL_diff
241
+ type: rougeL_diff
242
+ value: -2.2800
243
+ - name: rougeL_max
244
+ type: rougeL_max
245
+ value: 20.7733
246
+
247
+ # ====================== 5‑Shot ======================
248
+ # --- Multiple‑choice tasks ---
249
+ - task:
250
+ type: multiple-choice
251
+ name: ANLI R1 (5-Shot)
252
+ dataset:
253
+ type: anli_r1
254
+ name: ANLI R1
255
+ metrics:
256
+ - name: accuracy
257
+ type: acc
258
+ value: 0.3500
259
+
260
+ - task:
261
+ type: multiple-choice
262
+ name: ANLI R2 (5-Shot)
263
+ dataset:
264
+ type: anli_r2
265
+ name: ANLI R2
266
+ metrics:
267
+ - name: accuracy
268
+ type: acc
269
+ value: 0.3340
270
+
271
+ - task:
272
+ type: multiple-choice
273
+ name: ANLI R3 (5-Shot)
274
+ dataset:
275
+ type: anli_r3
276
+ name: ANLI R3
277
+ metrics:
278
+ - name: accuracy
279
+ type: acc
280
+ value: 0.3250
281
+
282
+ - task:
283
+ type: multiple-choice
284
+ name: ARC Challenge (5-Shot)
285
+ dataset:
286
+ type: arc_challenge
287
+ name: ARC Challenge
288
+ metrics:
289
+ - name: accuracy
290
+ type: acc
291
+ value: 0.1843
292
+ - name: accuracy_norm
293
+ type: acc_norm
294
+ value: 0.2184
295
+
296
+ - task:
297
+ type: multiple-choice
298
+ name: ARC Easy (5-Shot)
299
+ dataset:
300
+ type: arc_easy
301
+ name: ARC Easy
302
+ metrics:
303
+ - name: accuracy
304
+ type: acc
305
+ value: 0.3380
306
+ - name: accuracy_norm
307
+ type: acc_norm
308
+ value: 0.3215
309
+
310
+ - task:
311
+ type: multiple-choice
312
+ name: HellaSwag (5-Shot)
313
+ dataset:
314
+ type: hellaswag
315
+ name: HellaSwag
316
+ metrics:
317
+ - name: accuracy
318
+ type: acc
319
+ value: 0.2644
320
+ - name: accuracy_norm
321
+ type: acc_norm
322
+ value: 0.2657
323
+
324
+ - task:
325
+ type: multiple-choice
326
+ name: MMLU (5-Shot)
327
+ dataset:
328
+ type: mmlu
329
+ name: MMLU
330
+ metrics:
331
+ - name: accuracy
332
+ type: acc
333
+ value: 0.2413
334
+
335
+ - task:
336
+ type: multiple-choice
337
+ name: MMLU Humanities (5-Shot)
338
+ dataset:
339
+ type: mmlu
340
+ name: MMLU Humanities
341
+ metrics:
342
+ - name: accuracy
343
+ type: acc
344
+ value: 0.2446
345
+
346
+ - task:
347
+ type: multiple-choice
348
+ name: MMLU Other (5-Shot)
349
+ dataset:
350
+ type: mmlu
351
+ name: MMLU Other
352
+ metrics:
353
+ - name: accuracy
354
+ type: acc
355
+ value: 0.2288
356
+
357
+ - task:
358
+ type: multiple-choice
359
+ name: MMLU Social Sciences (5-Shot)
360
+ dataset:
361
+ type: mmlu
362
+ name: MMLU Social Sciences
363
+ metrics:
364
+ - name: accuracy
365
+ type: acc
366
+ value: 0.2317
367
+
368
+ - task:
369
+ type: multiple-choice
370
+ name: MMLU STEM (5-Shot)
371
+ dataset:
372
+ type: mmlu
373
+ name: MMLU STEM
374
+ metrics:
375
+ - name: accuracy
376
+ type: acc
377
+ value: 0.2578
378
+
379
+ - task:
380
+ type: multiple-choice
381
+ name: PiQA (5-Shot)
382
+ dataset:
383
+ type: piqa
384
+ name: PiQA
385
+ metrics:
386
+ - name: accuracy
387
+ type: acc
388
+ value: 0.5560
389
+ - name: accuracy_norm
390
+ type: acc_norm
391
+ value: 0.5533
392
+
393
+ - task:
394
+ type: multiple-choice
395
+ name: SWAG (5-Shot)
396
+ dataset:
397
+ type: swag
398
+ name: SWAG
399
+ metrics:
400
+ - name: accuracy
401
+ type: acc
402
+ value: 0.2963
403
+ - name: accuracy_norm
404
+ type: acc_norm
405
+ value: 0.3201
406
+
407
+ - task:
408
+ type: multiple-choice
409
+ name: TruthfulQA MC1 (5-Shot)*
410
+ dataset:
411
+ type: truthfulqa_mc1
412
+ name: TruthfulQA MC1
413
+ metrics:
414
+ - name: accuracy
415
+ type: acc
416
+ value: 0.2705
417
+ # * 5‑shot setting not applicable; result identical to 0‑shot
418
+
419
+ - task:
420
+ type: multiple-choice
421
+ name: TruthfulQA MC2 (5-Shot)*
422
+ dataset:
423
+ type: truthfulqa_mc2
424
+ name: TruthfulQA MC2
425
+ metrics:
426
+ - name: accuracy
427
+ type: acc
428
+ value: 0.4591
429
+ # * 5‑shot setting not applicable; result identical to 0‑shot
430
+
431
+ # --- Generation tasks ---
432
+ - task:
433
+ type: text-generation
434
+ name: GSM8K (5-Shot)
435
+ dataset:
436
+ type: gsm8k
437
+ name: GSM8K
438
+ metrics:
439
+ - name: exact_match (flexible-extract)
440
+ type: exact_match
441
+ value: 0.0114
442
+ - name: exact_match (strict-match)
443
+ type: exact_match
444
+ value: 0.0015
445
+ # 5‑shot yields identical numbers in this run
446
+
447
+ - task:
448
+ type: text-generation
449
+ name: TruthfulQA Gen (5-Shot)*
450
+ dataset:
451
+ type: truthfulqa_gen
452
+ name: TruthfulQA Gen
453
+ metrics:
454
+ - name: bleu_acc
455
+ type: bleu_acc
456
+ value: 0.2399
457
+ - name: bleu_diff
458
+ type: bleu_diff
459
+ value: -1.2697
460
+ - name: bleu_max
461
+ type: bleu_max
462
+ value: 10.7605
463
+ - name: rouge1_acc
464
+ type: rouge1_acc
465
+ value: 0.2864
466
+ - name: rouge1_diff
467
+ type: rouge1_diff
468
+ value: -2.4981
469
+ - name: rouge1_max
470
+ type: rouge1_max
471
+ value: 22.1008
472
+ - name: rouge2_acc
473
+ type: rouge2_acc
474
+ value: 0.0979
475
+ - name: rouge2_diff
476
+ type: rouge2_diff
477
+ value: -1.7592
478
+ - name: rouge2_max
479
+ type: rouge2_max
480
+ value: 11.8332
481
+ - name: rougeL_acc
482
+ type: rougeL_acc
483
+ value: 0.2815
484
+ - name: rougeL_diff
485
+ type: rougeL_diff
486
+ value: -2.2800
487
+ - name: rougeL_max
488
+ type: rougeL_max
489
+ value: 20.7733
490
+ # * 5‑shot setting not applicable; result identical to 0‑shot
491
+ license: mit
492
+ language:
493
+ - en
494
+ tags:
495
+ - harley-ml
496
+ - tenete
497
+ - small
498
+ - sota
499
+ - slm
500
+ - text-generation
501
+ ---