Add 0413 multiple model eval table
Browse files
0413/0413_multiple_model_evals.md
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Seven Setting Eval Tables
|
| 2 |
+
|
| 3 |
+
Last updated: 2026-04-13 UTC
|
| 4 |
+
|
| 5 |
+
Notes:
|
| 6 |
+
- `pass@1` is taken from `accuracy/mean`.
|
| 7 |
+
- `combined` is only defined from `pass@4` onward, so `pass@1` and `pass@2` are left blank.
|
| 8 |
+
- Blank cells mean the number is not available yet or I intentionally left it blank because the desired eval is still pending.
|
| 9 |
+
- For training runs, I pulled the metric from W&B history at the requested step (for example `_step=400` or `_step=1000`).
|
| 10 |
+
|
| 11 |
+
## Setting 1
|
| 12 |
+
|
| 13 |
+
Qwen2.5-0.5B-Instruct, GSM8K train `2000` step, GSM8K eval.
|
| 14 |
+
|
| 15 |
+
| Variant | Source | N_VAL | Note |
|
| 16 |
+
| --- | --- | --- | --- |
|
| 17 |
+
| Base | m3ocmw3l | 512 | shared baseline |
|
| 18 |
+
| 1-LoRA | s4bxcc1l | 512 | resume global_step_2000 |
|
| 19 |
+
| 4-LoRA Single/Combined | rk9ic9kk | 2048 | resume global_step_2000 |
|
| 20 |
+
| MERL Single/Combined | (pending) | 2048 | eval run not finished yet; left blank |
|
| 21 |
+
|
| 22 |
+
| k | Base | 1-LoRA | 4-LoRA Single | 4-LoRA Combined | MERL Single | MERL Combined |
|
| 23 |
+
| --- | --- | --- | --- | --- | --- | --- |
|
| 24 |
+
| 1 | 0.4661 | 0.6264 | 0.6151 | | | |
|
| 25 |
+
| 2 | 0.5943 | 0.6898 | 0.6960 | | | |
|
| 26 |
+
| 4 | 0.7012 | 0.7442 | 0.7594 | 0.8048 | | |
|
| 27 |
+
| 8 | 0.7878 | 0.7915 | 0.8118 | 0.8568 | | |
|
| 28 |
+
| 16 | 0.8560 | 0.8318 | 0.8544 | 0.8963 | | |
|
| 29 |
+
| 32 | 0.9065 | 0.8663 | 0.8885 | 0.9252 | | |
|
| 30 |
+
| 64 | 0.9417 | 0.8953 | 0.9157 | 0.9463 | | |
|
| 31 |
+
| 128 | 0.9651 | 0.9176 | 0.9365 | 0.9626 | | |
|
| 32 |
+
| 256 | 0.9799 | 0.9350 | 0.9516 | 0.9751 | | |
|
| 33 |
+
| 512 | 0.9909 | 0.9487 | 0.9622 | 0.9838 | | |
|
| 34 |
+
|
| 35 |
+
## Setting 2
|
| 36 |
+
|
| 37 |
+
Qwen2.5-0.5B-Instruct, GSM8K train `200` step, GSM8K eval.
|
| 38 |
+
|
| 39 |
+
| Variant | Source | N_VAL | Note |
|
| 40 |
+
| --- | --- | --- | --- |
|
| 41 |
+
| Base | m3ocmw3l | 512 | shared baseline |
|
| 42 |
+
| 1-LoRA | xw4w9c0u | 512 | resume global_step_200 |
|
| 43 |
+
| 4-LoRA Single/Combined | 2rytl841 | 2048 | resume global_step_200 |
|
| 44 |
+
| MERL Single/Combined | 0041qzrm | 2048 | resume global_step_200 |
|
| 45 |
+
|
| 46 |
+
| k | Base | 1-LoRA | 4-LoRA Single | 4-LoRA Combined | MERL Single | MERL Combined |
|
| 47 |
+
| --- | --- | --- | --- | --- | --- | --- |
|
| 48 |
+
| 1 | 0.4661 | 0.5942 | 0.5703 | | 0.5335 | |
|
| 49 |
+
| 2 | 0.5943 | 0.6842 | 0.6656 | | 0.6450 | |
|
| 50 |
+
| 4 | 0.7012 | 0.7557 | 0.7438 | 0.7772 | 0.7374 | 0.7584 |
|
| 51 |
+
| 8 | 0.7878 | 0.8125 | 0.8069 | 0.8389 | 0.8116 | 0.8308 |
|
| 52 |
+
| 16 | 0.8560 | 0.8590 | 0.8572 | 0.8871 | 0.8694 | 0.8861 |
|
| 53 |
+
| 32 | 0.9065 | 0.8978 | 0.8969 | 0.9237 | 0.9127 | 0.9266 |
|
| 54 |
+
| 64 | 0.9417 | 0.9285 | 0.9271 | 0.9503 | 0.9437 | 0.9544 |
|
| 55 |
+
| 128 | 0.9651 | 0.9497 | 0.9491 | 0.9682 | 0.9646 | 0.9723 |
|
| 56 |
+
| 256 | 0.9799 | 0.9636 | 0.9647 | 0.9795 | 0.9785 | 0.9841 |
|
| 57 |
+
| 512 | 0.9909 | 0.9727 | 0.9754 | 0.9870 | 0.9880 | 0.9920 |
|
| 58 |
+
|
| 59 |
+
## Setting 3
|
| 60 |
+
|
| 61 |
+
Qwen3-0.6B-Base, MATH train `400` step, Math eval.
|
| 62 |
+
|
| 63 |
+
| Variant | Source | N_VAL | Note |
|
| 64 |
+
| --- | --- | --- | --- |
|
| 65 |
+
| Base | 1eidnqtd | 512 | base eval on Math500 |
|
| 66 |
+
| Single Avg | (pending) | 2048 | new eval launched in tmux 0:0; left blank for now |
|
| 67 |
+
| Combined | (pending) | 2048 | new eval launched in tmux 0:0; left blank for now |
|
| 68 |
+
|
| 69 |
+
| k | Base | Single Avg | Combined |
|
| 70 |
+
| --- | --- | --- | --- |
|
| 71 |
+
| 1 | 0.2154 | | |
|
| 72 |
+
| 2 | 0.3370 | | |
|
| 73 |
+
| 4 | 0.4754 | | |
|
| 74 |
+
| 8 | 0.6065 | | |
|
| 75 |
+
| 16 | 0.7143 | | |
|
| 76 |
+
| 32 | 0.7946 | | |
|
| 77 |
+
| 64 | 0.8513 | | |
|
| 78 |
+
| 128 | 0.8916 | | |
|
| 79 |
+
| 256 | 0.9207 | | |
|
| 80 |
+
| 512 | 0.9416 | | |
|
| 81 |
+
|
| 82 |
+
## Setting 4
|
| 83 |
+
|
| 84 |
+
Qwen2.5-0.5B-Instruct, MATH train `400` step, Math eval.
|
| 85 |
+
|
| 86 |
+
| Variant | Source | N_VAL | Note |
|
| 87 |
+
| --- | --- | --- | --- |
|
| 88 |
+
| Base | ub2ua0fb | 512 | base eval on Math500 |
|
| 89 |
+
| Single Avg | bfgx3ra4 | 2048 | resume global_step_400 |
|
| 90 |
+
| Combined | bfgx3ra4 | 2048 | resume global_step_400 |
|
| 91 |
+
|
| 92 |
+
| k | Base | Single Avg | Combined |
|
| 93 |
+
| --- | --- | --- | --- |
|
| 94 |
+
| 1 | 0.3081 | 0.3568 | |
|
| 95 |
+
| 2 | 0.4144 | 0.4484 | |
|
| 96 |
+
| 4 | 0.5162 | 0.5351 | 0.5514 |
|
| 97 |
+
| 8 | 0.6078 | 0.6140 | 0.6305 |
|
| 98 |
+
| 16 | 0.6890 | 0.6847 | 0.7014 |
|
| 99 |
+
| 32 | 0.7598 | 0.7463 | 0.7634 |
|
| 100 |
+
| 64 | 0.8180 | 0.7977 | 0.8141 |
|
| 101 |
+
| 128 | 0.8627 | 0.8398 | 0.8549 |
|
| 102 |
+
| 256 | 0.8956 | 0.8750 | 0.8883 |
|
| 103 |
+
| 512 | 0.9195 | 0.9054 | 0.9147 |
|
| 104 |
+
|
| 105 |
+
## Setting 5
|
| 106 |
+
|
| 107 |
+
SmolLM2-360M-Instruct, GSM8K train `1000` step, GSM8K eval.
|
| 108 |
+
|
| 109 |
+
| Variant | Source | N_VAL | Note |
|
| 110 |
+
| --- | --- | --- | --- |
|
| 111 |
+
| Base | (not found) | | no standalone base eval run found |
|
| 112 |
+
| Single Avg | uw2s3olq @ _step=1000 | 2048 | training-run history |
|
| 113 |
+
| Combined | uw2s3olq @ _step=1000 | 2048 | training-run history |
|
| 114 |
+
|
| 115 |
+
| k | Base | Single Avg | Combined |
|
| 116 |
+
| --- | --- | --- | --- |
|
| 117 |
+
| 1 | | 0.2237 | |
|
| 118 |
+
| 2 | | 0.2939 | |
|
| 119 |
+
| 4 | | 0.3664 | 0.4218 |
|
| 120 |
+
| 8 | | 0.4397 | 0.5067 |
|
| 121 |
+
| 16 | | 0.5130 | 0.5902 |
|
| 122 |
+
| 32 | | 0.5850 | 0.6704 |
|
| 123 |
+
| 64 | | 0.6530 | 0.7439 |
|
| 124 |
+
| 128 | | 0.7147 | 0.8064 |
|
| 125 |
+
| 256 | | 0.7692 | 0.8564 |
|
| 126 |
+
| 512 | | 0.8166 | 0.8968 |
|
| 127 |
+
|
| 128 |
+
## Setting 6
|
| 129 |
+
|
| 130 |
+
SmolLM2-360M-Instruct, GSM8K train `200` step, GSM8K eval.
|
| 131 |
+
|
| 132 |
+
| Variant | Source | N_VAL | Note |
|
| 133 |
+
| --- | --- | --- | --- |
|
| 134 |
+
| Base | (not found) | | no standalone base eval run found |
|
| 135 |
+
| Single Avg | zv5xbryh | 2048 | resume global_step_200 |
|
| 136 |
+
| Combined | zv5xbryh | 2048 | resume global_step_200 |
|
| 137 |
+
|
| 138 |
+
| k | Base | Single Avg | Combined |
|
| 139 |
+
| --- | --- | --- | --- |
|
| 140 |
+
| 1 | | 0.1588 | |
|
| 141 |
+
| 2 | | 0.2213 | |
|
| 142 |
+
| 4 | | 0.2925 | 0.3359 |
|
| 143 |
+
| 8 | | 0.3718 | 0.4268 |
|
| 144 |
+
| 16 | | 0.4564 | 0.5222 |
|
| 145 |
+
| 32 | | 0.5410 | 0.6159 |
|
| 146 |
+
| 64 | | 0.6196 | 0.7016 |
|
| 147 |
+
| 128 | | 0.6895 | 0.7739 |
|
| 148 |
+
| 256 | | 0.7512 | 0.8315 |
|
| 149 |
+
| 512 | | 0.8056 | 0.8767 |
|
| 150 |
+
|
| 151 |
+
## Setting 7
|
| 152 |
+
|
| 153 |
+
Qwen3-0.6B-Base, GSM8K train `400` step, GSM8K eval.
|
| 154 |
+
|
| 155 |
+
| Variant | Source | N_VAL | Note |
|
| 156 |
+
| --- | --- | --- | --- |
|
| 157 |
+
| Base | m2nt7fyg | 512 | base eval on GSM8K |
|
| 158 |
+
| Single Avg | nqta9blp @ _step=400 | 2048 | training-run history; checkpoint no longer on local disk |
|
| 159 |
+
| Combined | nqta9blp @ _step=400 | 2048 | training-run history; checkpoint no longer on local disk |
|
| 160 |
+
|
| 161 |
+
| k | Base | Single Avg | Combined |
|
| 162 |
+
| --- | --- | --- | --- |
|
| 163 |
+
| 1 | 0.2707 | 0.7743 | |
|
| 164 |
+
| 2 | 0.4321 | 0.8348 | |
|
| 165 |
+
| 4 | 0.6106 | 0.8782 | 0.9012 |
|
| 166 |
+
| 8 | 0.7616 | 0.9098 | 0.9302 |
|
| 167 |
+
| 16 | 0.8629 | 0.9330 | 0.9509 |
|
| 168 |
+
| 32 | 0.9222 | 0.9503 | 0.9655 |
|
| 169 |
+
| 64 | 0.9553 | 0.9628 | 0.9754 |
|
| 170 |
+
| 128 | 0.9741 | 0.9716 | 0.9826 |
|
| 171 |
+
| 256 | 0.9843 | 0.9778 | 0.9881 |
|
| 172 |
+
| 512 | 0.9901 | 0.9830 | 0.9921 |
|