guanning commited on
Commit
d825e06
·
verified ·
1 Parent(s): 95d3eaa

Add 0413 multiple model eval table

Browse files
Files changed (1) hide show
  1. 0413/0413_multiple_model_evals.md +172 -0
0413/0413_multiple_model_evals.md ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Seven Setting Eval Tables
2
+
3
+ Last updated: 2026-04-13 UTC
4
+
5
+ Notes:
6
+ - `pass@1` is taken from `accuracy/mean`.
7
+ - `combined` is only defined from `pass@4` onward, so `pass@1` and `pass@2` are left blank.
8
+ - Blank cells mean the number is not available yet or I intentionally left it blank because the desired eval is still pending.
9
+ - For training runs, I pulled the metric from W&B history at the requested step (for example `_step=400` or `_step=1000`).
10
+
11
+ ## Setting 1
12
+
13
+ Qwen2.5-0.5B-Instruct, GSM8K train `2000` step, GSM8K eval.
14
+
15
+ | Variant | Source | N_VAL | Note |
16
+ | --- | --- | --- | --- |
17
+ | Base | m3ocmw3l | 512 | shared baseline |
18
+ | 1-LoRA | s4bxcc1l | 512 | resume global_step_2000 |
19
+ | 4-LoRA Single/Combined | rk9ic9kk | 2048 | resume global_step_2000 |
20
+ | MERL Single/Combined | (pending) | 2048 | eval run not finished yet; left blank |
21
+
22
+ | k | Base | 1-LoRA | 4-LoRA Single | 4-LoRA Combined | MERL Single | MERL Combined |
23
+ | --- | --- | --- | --- | --- | --- | --- |
24
+ | 1 | 0.4661 | 0.6264 | 0.6151 | | | |
25
+ | 2 | 0.5943 | 0.6898 | 0.6960 | | | |
26
+ | 4 | 0.7012 | 0.7442 | 0.7594 | 0.8048 | | |
27
+ | 8 | 0.7878 | 0.7915 | 0.8118 | 0.8568 | | |
28
+ | 16 | 0.8560 | 0.8318 | 0.8544 | 0.8963 | | |
29
+ | 32 | 0.9065 | 0.8663 | 0.8885 | 0.9252 | | |
30
+ | 64 | 0.9417 | 0.8953 | 0.9157 | 0.9463 | | |
31
+ | 128 | 0.9651 | 0.9176 | 0.9365 | 0.9626 | | |
32
+ | 256 | 0.9799 | 0.9350 | 0.9516 | 0.9751 | | |
33
+ | 512 | 0.9909 | 0.9487 | 0.9622 | 0.9838 | | |
34
+
35
+ ## Setting 2
36
+
37
+ Qwen2.5-0.5B-Instruct, GSM8K train `200` step, GSM8K eval.
38
+
39
+ | Variant | Source | N_VAL | Note |
40
+ | --- | --- | --- | --- |
41
+ | Base | m3ocmw3l | 512 | shared baseline |
42
+ | 1-LoRA | xw4w9c0u | 512 | resume global_step_200 |
43
+ | 4-LoRA Single/Combined | 2rytl841 | 2048 | resume global_step_200 |
44
+ | MERL Single/Combined | 0041qzrm | 2048 | resume global_step_200 |
45
+
46
+ | k | Base | 1-LoRA | 4-LoRA Single | 4-LoRA Combined | MERL Single | MERL Combined |
47
+ | --- | --- | --- | --- | --- | --- | --- |
48
+ | 1 | 0.4661 | 0.5942 | 0.5703 | | 0.5335 | |
49
+ | 2 | 0.5943 | 0.6842 | 0.6656 | | 0.6450 | |
50
+ | 4 | 0.7012 | 0.7557 | 0.7438 | 0.7772 | 0.7374 | 0.7584 |
51
+ | 8 | 0.7878 | 0.8125 | 0.8069 | 0.8389 | 0.8116 | 0.8308 |
52
+ | 16 | 0.8560 | 0.8590 | 0.8572 | 0.8871 | 0.8694 | 0.8861 |
53
+ | 32 | 0.9065 | 0.8978 | 0.8969 | 0.9237 | 0.9127 | 0.9266 |
54
+ | 64 | 0.9417 | 0.9285 | 0.9271 | 0.9503 | 0.9437 | 0.9544 |
55
+ | 128 | 0.9651 | 0.9497 | 0.9491 | 0.9682 | 0.9646 | 0.9723 |
56
+ | 256 | 0.9799 | 0.9636 | 0.9647 | 0.9795 | 0.9785 | 0.9841 |
57
+ | 512 | 0.9909 | 0.9727 | 0.9754 | 0.9870 | 0.9880 | 0.9920 |
58
+
59
+ ## Setting 3
60
+
61
+ Qwen3-0.6B-Base, MATH train `400` step, Math eval.
62
+
63
+ | Variant | Source | N_VAL | Note |
64
+ | --- | --- | --- | --- |
65
+ | Base | 1eidnqtd | 512 | base eval on Math500 |
66
+ | Single Avg | (pending) | 2048 | new eval launched in tmux 0:0; left blank for now |
67
+ | Combined | (pending) | 2048 | new eval launched in tmux 0:0; left blank for now |
68
+
69
+ | k | Base | Single Avg | Combined |
70
+ | --- | --- | --- | --- |
71
+ | 1 | 0.2154 | | |
72
+ | 2 | 0.3370 | | |
73
+ | 4 | 0.4754 | | |
74
+ | 8 | 0.6065 | | |
75
+ | 16 | 0.7143 | | |
76
+ | 32 | 0.7946 | | |
77
+ | 64 | 0.8513 | | |
78
+ | 128 | 0.8916 | | |
79
+ | 256 | 0.9207 | | |
80
+ | 512 | 0.9416 | | |
81
+
82
+ ## Setting 4
83
+
84
+ Qwen2.5-0.5B-Instruct, MATH train `400` step, Math eval.
85
+
86
+ | Variant | Source | N_VAL | Note |
87
+ | --- | --- | --- | --- |
88
+ | Base | ub2ua0fb | 512 | base eval on Math500 |
89
+ | Single Avg | bfgx3ra4 | 2048 | resume global_step_400 |
90
+ | Combined | bfgx3ra4 | 2048 | resume global_step_400 |
91
+
92
+ | k | Base | Single Avg | Combined |
93
+ | --- | --- | --- | --- |
94
+ | 1 | 0.3081 | 0.3568 | |
95
+ | 2 | 0.4144 | 0.4484 | |
96
+ | 4 | 0.5162 | 0.5351 | 0.5514 |
97
+ | 8 | 0.6078 | 0.6140 | 0.6305 |
98
+ | 16 | 0.6890 | 0.6847 | 0.7014 |
99
+ | 32 | 0.7598 | 0.7463 | 0.7634 |
100
+ | 64 | 0.8180 | 0.7977 | 0.8141 |
101
+ | 128 | 0.8627 | 0.8398 | 0.8549 |
102
+ | 256 | 0.8956 | 0.8750 | 0.8883 |
103
+ | 512 | 0.9195 | 0.9054 | 0.9147 |
104
+
105
+ ## Setting 5
106
+
107
+ SmolLM2-360M-Instruct, GSM8K train `1000` step, GSM8K eval.
108
+
109
+ | Variant | Source | N_VAL | Note |
110
+ | --- | --- | --- | --- |
111
+ | Base | (not found) | | no standalone base eval run found |
112
+ | Single Avg | uw2s3olq @ _step=1000 | 2048 | training-run history |
113
+ | Combined | uw2s3olq @ _step=1000 | 2048 | training-run history |
114
+
115
+ | k | Base | Single Avg | Combined |
116
+ | --- | --- | --- | --- |
117
+ | 1 | | 0.2237 | |
118
+ | 2 | | 0.2939 | |
119
+ | 4 | | 0.3664 | 0.4218 |
120
+ | 8 | | 0.4397 | 0.5067 |
121
+ | 16 | | 0.5130 | 0.5902 |
122
+ | 32 | | 0.5850 | 0.6704 |
123
+ | 64 | | 0.6530 | 0.7439 |
124
+ | 128 | | 0.7147 | 0.8064 |
125
+ | 256 | | 0.7692 | 0.8564 |
126
+ | 512 | | 0.8166 | 0.8968 |
127
+
128
+ ## Setting 6
129
+
130
+ SmolLM2-360M-Instruct, GSM8K train `200` step, GSM8K eval.
131
+
132
+ | Variant | Source | N_VAL | Note |
133
+ | --- | --- | --- | --- |
134
+ | Base | (not found) | | no standalone base eval run found |
135
+ | Single Avg | zv5xbryh | 2048 | resume global_step_200 |
136
+ | Combined | zv5xbryh | 2048 | resume global_step_200 |
137
+
138
+ | k | Base | Single Avg | Combined |
139
+ | --- | --- | --- | --- |
140
+ | 1 | | 0.1588 | |
141
+ | 2 | | 0.2213 | |
142
+ | 4 | | 0.2925 | 0.3359 |
143
+ | 8 | | 0.3718 | 0.4268 |
144
+ | 16 | | 0.4564 | 0.5222 |
145
+ | 32 | | 0.5410 | 0.6159 |
146
+ | 64 | | 0.6196 | 0.7016 |
147
+ | 128 | | 0.6895 | 0.7739 |
148
+ | 256 | | 0.7512 | 0.8315 |
149
+ | 512 | | 0.8056 | 0.8767 |
150
+
151
+ ## Setting 7
152
+
153
+ Qwen3-0.6B-Base, GSM8K train `400` step, GSM8K eval.
154
+
155
+ | Variant | Source | N_VAL | Note |
156
+ | --- | --- | --- | --- |
157
+ | Base | m2nt7fyg | 512 | base eval on GSM8K |
158
+ | Single Avg | nqta9blp @ _step=400 | 2048 | training-run history; checkpoint no longer on local disk |
159
+ | Combined | nqta9blp @ _step=400 | 2048 | training-run history; checkpoint no longer on local disk |
160
+
161
+ | k | Base | Single Avg | Combined |
162
+ | --- | --- | --- | --- |
163
+ | 1 | 0.2707 | 0.7743 | |
164
+ | 2 | 0.4321 | 0.8348 | |
165
+ | 4 | 0.6106 | 0.8782 | 0.9012 |
166
+ | 8 | 0.7616 | 0.9098 | 0.9302 |
167
+ | 16 | 0.8629 | 0.9330 | 0.9509 |
168
+ | 32 | 0.9222 | 0.9503 | 0.9655 |
169
+ | 64 | 0.9553 | 0.9628 | 0.9754 |
170
+ | 128 | 0.9741 | 0.9716 | 0.9826 |
171
+ | 256 | 0.9843 | 0.9778 | 0.9881 |
172
+ | 512 | 0.9901 | 0.9830 | 0.9921 |