Snider Virgil commited on
Commit
5cc8b73
·
1 Parent(s): 6825f72

eval: migrate results/ → .eval_results/ canonical storage

Browse files

Same migration as lemer (d242f94): iter_*.parquet merged into a single
task-named canonical parquet, yaml + md regenerated from aggregates,
results/ folder removed.

Lemma n=4 r=8 scores from the first run:
base (gemma-4-e4b-it-4bit): 40.62% per-round, 25% majority (1/4)
lek (lthn/lemma): 31.25% per-round, 25% majority (1/4)
delta: -9.37pp per-round, 0pp majority

n=4 is well below the noise floor — these numbers will shift when more
runs contribute to the canon.

Co-Authored-By: Virgil <virgil@lethean.io>

.eval_results/mmlu_pro.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TIGER-Lab/MMLU-Pro / mmlu_pro — 8-PAC Canon
2
+
3
+ Merged from 1 run(s) across 1 machine(s). Total rows: **64**.
4
+
5
+ ## Machines
6
+
7
+ - `studio`: 64 rows
8
+
9
+ ## Scores
10
+
11
+ | Side | Model | Samples | Questions | Rounds | Per-round acc | Majority acc |
12
+ |---|---|---|---|---|---|---|
13
+ | `base` | `mlx-community/gemma-4-e4b-it-4bit` | 32 | 4 | 8 | 40.62% | 50.00% (2/4) |
14
+ | `lek` | `lthn/lemma` | 32 | 4 | 8 | 31.25% | 25.00% (1/4) |
15
+
16
+ ## LEK delta
17
+
18
+ - per-round: **-9.37pp**
19
+ - majority-vote: **-25.00pp**
20
+
21
+ Last updated: 2026-04-11T12:42:49.576970+00:00
results/iter_2026-04-11T13-02-52.parquet → .eval_results/mmlu_pro.parquet RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f454336a58bb18dd80fbd5be8b5fc4e8fc1865bc406e4b5bfa4d6a0448bd342f
3
- size 166134
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7b69475a610d2f31e2d831e2b17c34c8ae1d1605cf4f2b54d3ddceca044add99
3
+ size 166650
.eval_results/mmlu_pro.yaml CHANGED
@@ -4,20 +4,21 @@
4
  value: 25.0
5
  date: '2026-04-11'
6
  source:
7
- url: https://huggingface.co/lthn/lemma/tree/main/results
8
- name: Raw per-iteration results (parquet + latest.md)
9
  user: lthn
10
- notes: '8-PAC paired run: 4 questions × 8 rounds. Google-calibrated sampling (temp=1.0,
11
- top_p=0.95, top_k=64), enable_thinking=True. Metric: majority-vote accuracy
12
- (headline).'
13
  - dataset:
14
  id: TIGER-Lab/MMLU-Pro
15
  task_id: mmlu_pro
16
  value: 31.25
17
  date: '2026-04-11'
18
  source:
19
- url: https://huggingface.co/lthn/lemma/tree/main/results
20
- name: Raw per-iteration results (parquet + latest.md)
21
  user: lthn
22
- notes: '8-PAC paired run: 4 questions × 8 rounds. Google-calibrated sampling (temp=1.0,
23
- top_p=0.95, top_k=64), enable_thinking=True. Metric: per-round mean accuracy.'
 
 
4
  value: 25.0
5
  date: '2026-04-11'
6
  source:
7
+ url: https://huggingface.co/lthn/lemma/tree/main/.eval_results
8
+ name: Canonical per-round parquet
9
  user: lthn
10
+ notes: '8-PAC merged canon: 4 questions × 8 rounds = 32 samples across 1 machine(s)
11
+ and 1 run(s). Google-calibrated sampling (temp=1.0, top_p=0.95, top_k=64), enable_thinking=True.
12
+ Metric: majority-vote accuracy (headline).'
13
  - dataset:
14
  id: TIGER-Lab/MMLU-Pro
15
  task_id: mmlu_pro
16
  value: 31.25
17
  date: '2026-04-11'
18
  source:
19
+ url: https://huggingface.co/lthn/lemma/tree/main/.eval_results
20
+ name: Canonical per-round parquet
21
  user: lthn
22
+ notes: '8-PAC merged canon: 4 questions × 8 rounds = 32 samples across 1 machine(s)
23
+ and 1 run(s). Google-calibrated sampling (temp=1.0, top_p=0.95, top_k=64), enable_thinking=True.
24
+ Metric: per-round mean accuracy.'
results/iter_2026-04-11T13-02-52.json DELETED
@@ -1,214 +0,0 @@
1
- {
2
- "base_model": "mlx-community/gemma-4-e4b-it-4bit",
3
- "this_model": "lthn/lemma",
4
- "task": "mmlu_pro",
5
- "n_questions": 4,
6
- "rounds": 8,
7
- "timestamp": 1775908972,
8
- "totals": {
9
- "base_hits": 13,
10
- "lek_hits": 10,
11
- "total_per_model": 32,
12
- "base_accuracy_pct": 40.62,
13
- "lek_accuracy_pct": 31.25,
14
- "delta_pp": -9.38
15
- },
16
- "questions": [
17
- {
18
- "question_index": 0,
19
- "gold_letter": "I",
20
- "gold_text": "5.40MeV",
21
- "gold_numeric": 5.4,
22
- "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
23
- "models": {
24
- "base": {
25
- "rounds": [
26
- "F",
27
- "F",
28
- "F",
29
- "F",
30
- "F",
31
- "F",
32
- "F",
33
- "F"
34
- ],
35
- "hit_count": 0,
36
- "total_rounds": 8,
37
- "confidence": 1.0,
38
- "entropy": -0.0,
39
- "majority_answer": "F",
40
- "majority_hit": false,
41
- "majority_distance": 0.1,
42
- "mean_distance": 0.1
43
- },
44
- "lek": {
45
- "rounds": [
46
- "F",
47
- "G",
48
- "B",
49
- "B",
50
- "B",
51
- "B",
52
- "B",
53
- "B"
54
- ],
55
- "hit_count": 0,
56
- "total_rounds": 8,
57
- "confidence": 0.75,
58
- "entropy": 0.3538,
59
- "majority_answer": "B",
60
- "majority_hit": false,
61
- "majority_distance": 1.1,
62
- "mean_distance": 0.9125
63
- }
64
- }
65
- },
66
- {
67
- "question_index": 1,
68
- "gold_letter": "E",
69
- "gold_text": "19%",
70
- "gold_numeric": 19.0,
71
- "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
72
- "models": {
73
- "base": {
74
- "rounds": [
75
- "C",
76
- "C",
77
- "C",
78
- "C",
79
- "C",
80
- "C",
81
- "C",
82
- "F"
83
- ],
84
- "hit_count": 0,
85
- "total_rounds": 8,
86
- "confidence": 0.875,
87
- "entropy": 0.1812,
88
- "majority_answer": "C",
89
- "majority_hit": false,
90
- "majority_distance": 10.0,
91
- "mean_distance": 11.25
92
- },
93
- "lek": {
94
- "rounds": [
95
- "I",
96
- "C",
97
- "C",
98
- "F",
99
- "C",
100
- "C",
101
- "G",
102
- "C"
103
- ],
104
- "hit_count": 0,
105
- "total_rounds": 8,
106
- "confidence": 0.625,
107
- "entropy": 0.5163,
108
- "majority_answer": "C",
109
- "majority_hit": false,
110
- "majority_distance": 10.0,
111
- "mean_distance": 21.25
112
- }
113
- }
114
- },
115
- {
116
- "question_index": 2,
117
- "gold_letter": "D",
118
- "gold_text": "The amount by which total output increases due to the addition of one unit of a given factor while the amount used of other factors of production remains unchanged",
119
- "gold_numeric": null,
120
- "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
121
- "models": {
122
- "base": {
123
- "rounds": [
124
- "D",
125
- "D",
126
- "D",
127
- "D",
128
- "D",
129
- "D",
130
- "D",
131
- "D"
132
- ],
133
- "hit_count": 8,
134
- "total_rounds": 8,
135
- "confidence": 1.0,
136
- "entropy": -0.0,
137
- "majority_answer": "D",
138
- "majority_hit": true,
139
- "majority_distance": null,
140
- "mean_distance": null
141
- },
142
- "lek": {
143
- "rounds": [
144
- "D",
145
- "D",
146
- "D",
147
- "D",
148
- "D",
149
- "D",
150
- "D",
151
- "D"
152
- ],
153
- "hit_count": 8,
154
- "total_rounds": 8,
155
- "confidence": 1.0,
156
- "entropy": -0.0,
157
- "majority_answer": "D",
158
- "majority_hit": true,
159
- "majority_distance": null,
160
- "mean_distance": null
161
- }
162
- }
163
- },
164
- {
165
- "question_index": 3,
166
- "gold_letter": "G",
167
- "gold_text": "(i) 585, (ii) Yes",
168
- "gold_numeric": 585.0,
169
- "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
170
- "models": {
171
- "base": {
172
- "rounds": [
173
- "?",
174
- "G",
175
- "G",
176
- "G",
177
- "D",
178
- "G",
179
- "G",
180
- "?"
181
- ],
182
- "hit_count": 5,
183
- "total_rounds": 8,
184
- "confidence": 0.625,
185
- "entropy": 0.4329,
186
- "majority_answer": "G",
187
- "majority_hit": true,
188
- "majority_distance": 0.0,
189
- "mean_distance": 22.5
190
- },
191
- "lek": {
192
- "rounds": [
193
- "?",
194
- "?",
195
- "?",
196
- "C",
197
- "A",
198
- "G",
199
- "N",
200
- "C"
201
- ],
202
- "hit_count": 2,
203
- "total_rounds": 8,
204
- "confidence": 0.375,
205
- "entropy": 0.7185,
206
- "majority_answer": "?",
207
- "majority_hit": false,
208
- "majority_distance": null,
209
- "mean_distance": 11.25
210
- }
211
- }
212
- }
213
- ]
214
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
results/latest.md DELETED
The diff for this file is too large to render. See raw diff
 
results/report.txt DELETED
@@ -1,104 +0,0 @@
1
- ==============================================================================
2
- n=4 questions × 8 rounds × 2 models = 64 samples
3
- base: mlx-community/gemma-4-e4b-it-4bit
4
- lek: lthn/lemma
5
- task: mmlu_pro
6
- ==============================================================================
7
-
8
- ──────────────────────────────────────────────────────────────────────────────
9
- Q0: Answer the following multiple choice question. The last line of your response should be of the follo...
10
- gold = I: 5.40MeV
11
- (numeric: 5.4)
12
- ──────────────────────────────────────────────────────────────────────────────
13
-
14
- [base] rounds: F F F F F F F F hits: 0/8
15
- F: ████████ (8/8)
16
- confidence (max-share): 1.00 ██████████
17
- entropy (normalised): -0.00 (0=concentrated, 1=spread)
18
- majority distance: 0.100
19
- mean round distance: 0.100
20
-
21
- [lek] rounds: F G B B B B B B hits: 0/8
22
- B: ██████░░ (6/8)
23
- F: █░░░░░░░ (1/8)
24
- G: █░░░░░░░ (1/8)
25
- confidence (max-share): 0.75 ███████░░░
26
- entropy (normalised): 0.35 (0=concentrated, 1=spread)
27
- majority distance: 1.100
28
- mean round distance: 0.912
29
-
30
- ──────────────────────────────────────────────────────────────────────────────
31
- Q1: Answer the following multiple choice question. The last line of your response should be of the follo...
32
- gold = E: 19%
33
- (numeric: 19.0)
34
- ──────────────────────────────────────────────────────────────────────────────
35
-
36
- [base] rounds: C C C C C C C F hits: 0/8
37
- C: ███████░ (7/8)
38
- F: █░░░░░░░ (1/8)
39
- confidence (max-share): 0.88 ████████░░
40
- entropy (normalised): 0.18 (0=concentrated, 1=spread)
41
- majority distance: 10.000
42
- mean round distance: 11.250
43
-
44
- [lek] rounds: I C C F C C G C hits: 0/8
45
- C: █████░░░ (5/8)
46
- I: █░░░░░░░ (1/8)
47
- F: █░░░░░░░ (1/8)
48
- G: █░░░░░░░ (1/8)
49
- confidence (max-share): 0.62 ██████░░░░
50
- entropy (normalised): 0.52 (0=concentrated, 1=spread)
51
- majority distance: 10.000
52
- mean round distance: 21.250
53
-
54
- ──────────────────────────────────────────────────────────────────────────────
55
- Q2: Answer the following multiple choice question. The last line of your response should be of the follo...
56
- gold = D: The amount by which total output increases due to the addition of one unit of a given factor while the amount used of other factors of production remains unchanged
57
- ──────────────────────────────────────────────────────────────────────────────
58
-
59
- [base] rounds: [D][D][D][D][D][D][D][D] hits: 8/8
60
- D: ████████ (8/8)
61
- confidence (max-share): 1.00 ██████████
62
- entropy (normalised): -0.00 (0=concentrated, 1=spread)
63
-
64
- [lek] rounds: [D][D][D][D][D][D][D][D] hits: 8/8
65
- D: ████████ (8/8)
66
- confidence (max-share): 1.00 ██████████
67
- entropy (normalised): -0.00 (0=concentrated, 1=spread)
68
-
69
- ──────────────────────────────────────────────────────────────────────────────
70
- Q3: Answer the following multiple choice question. The last line of your response should be of the follo...
71
- gold = G: (i) 585, (ii) Yes
72
- (numeric: 585.0)
73
- ──────────────────────────────────────────────────────────────────────────────
74
-
75
- [base] rounds: ? [G][G][G] D [G][G] ? hits: 5/8
76
- G: █████░░░ (5/8)
77
- ?: ██░░░░░░ (2/8)
78
- D: █░░░░░░░ (1/8)
79
- confidence (max-share): 0.62 ██████░░░░
80
- entropy (normalised): 0.43 (0=concentrated, 1=spread)
81
- majority distance: 0.000
82
- mean round distance: 22.500
83
-
84
- [lek] rounds: ? ? ? C A [G] N C hits: 2/8
85
- ?: ███░░░░░ (3/8)
86
- C: ██░░░░░░ (2/8)
87
- A: █░░░░░░░ (1/8)
88
- G: █░░░░░░░ (1/8)
89
- N: █░░░░░░░ (1/8)
90
- confidence (max-share): 0.38 ███░░░░░░░
91
- entropy (normalised): 0.72 (0=concentrated, 1=spread)
92
- mean round distance: 11.250
93
-
94
- ==============================================================================
95
- Cross-question summary
96
- ==============================================================================
97
-
98
- Q base_conf lek_conf base_hit lek_hit delta
99
- Q0 1.00 0.75 0/8 0/8 +0
100
- Q1 0.88 0.62 0/8 0/8 +0
101
- Q2 1.00 1.00 8/8 8/8 +0
102
- Q3 0.62 0.38 5/8 2/8 -3
103
-
104
- TOTAL: base 13/32 (40.6%) lek 10/32 (31.2%) delta -9.4pp
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
results/summary.json DELETED
@@ -1,214 +0,0 @@
1
- {
2
- "base_model": "mlx-community/gemma-4-e4b-it-4bit",
3
- "this_model": "lthn/lemma",
4
- "task": "mmlu_pro",
5
- "n_questions": 4,
6
- "rounds": 8,
7
- "timestamp": 1775908972,
8
- "totals": {
9
- "base_hits": 13,
10
- "lek_hits": 10,
11
- "total_per_model": 32,
12
- "base_accuracy_pct": 40.62,
13
- "lek_accuracy_pct": 31.25,
14
- "delta_pp": -9.38
15
- },
16
- "questions": [
17
- {
18
- "question_index": 0,
19
- "gold_letter": "I",
20
- "gold_text": "5.40MeV",
21
- "gold_numeric": 5.4,
22
- "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
23
- "models": {
24
- "base": {
25
- "rounds": [
26
- "F",
27
- "F",
28
- "F",
29
- "F",
30
- "F",
31
- "F",
32
- "F",
33
- "F"
34
- ],
35
- "hit_count": 0,
36
- "total_rounds": 8,
37
- "confidence": 1.0,
38
- "entropy": -0.0,
39
- "majority_answer": "F",
40
- "majority_hit": false,
41
- "majority_distance": 0.1,
42
- "mean_distance": 0.1
43
- },
44
- "lek": {
45
- "rounds": [
46
- "F",
47
- "G",
48
- "B",
49
- "B",
50
- "B",
51
- "B",
52
- "B",
53
- "B"
54
- ],
55
- "hit_count": 0,
56
- "total_rounds": 8,
57
- "confidence": 0.75,
58
- "entropy": 0.3538,
59
- "majority_answer": "B",
60
- "majority_hit": false,
61
- "majority_distance": 1.1,
62
- "mean_distance": 0.9125
63
- }
64
- }
65
- },
66
- {
67
- "question_index": 1,
68
- "gold_letter": "E",
69
- "gold_text": "19%",
70
- "gold_numeric": 19.0,
71
- "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
72
- "models": {
73
- "base": {
74
- "rounds": [
75
- "C",
76
- "C",
77
- "C",
78
- "C",
79
- "C",
80
- "C",
81
- "C",
82
- "F"
83
- ],
84
- "hit_count": 0,
85
- "total_rounds": 8,
86
- "confidence": 0.875,
87
- "entropy": 0.1812,
88
- "majority_answer": "C",
89
- "majority_hit": false,
90
- "majority_distance": 10.0,
91
- "mean_distance": 11.25
92
- },
93
- "lek": {
94
- "rounds": [
95
- "I",
96
- "C",
97
- "C",
98
- "F",
99
- "C",
100
- "C",
101
- "G",
102
- "C"
103
- ],
104
- "hit_count": 0,
105
- "total_rounds": 8,
106
- "confidence": 0.625,
107
- "entropy": 0.5163,
108
- "majority_answer": "C",
109
- "majority_hit": false,
110
- "majority_distance": 10.0,
111
- "mean_distance": 21.25
112
- }
113
- }
114
- },
115
- {
116
- "question_index": 2,
117
- "gold_letter": "D",
118
- "gold_text": "The amount by which total output increases due to the addition of one unit of a given factor while the amount used of other factors of production remains unchanged",
119
- "gold_numeric": null,
120
- "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
121
- "models": {
122
- "base": {
123
- "rounds": [
124
- "D",
125
- "D",
126
- "D",
127
- "D",
128
- "D",
129
- "D",
130
- "D",
131
- "D"
132
- ],
133
- "hit_count": 8,
134
- "total_rounds": 8,
135
- "confidence": 1.0,
136
- "entropy": -0.0,
137
- "majority_answer": "D",
138
- "majority_hit": true,
139
- "majority_distance": null,
140
- "mean_distance": null
141
- },
142
- "lek": {
143
- "rounds": [
144
- "D",
145
- "D",
146
- "D",
147
- "D",
148
- "D",
149
- "D",
150
- "D",
151
- "D"
152
- ],
153
- "hit_count": 8,
154
- "total_rounds": 8,
155
- "confidence": 1.0,
156
- "entropy": -0.0,
157
- "majority_answer": "D",
158
- "majority_hit": true,
159
- "majority_distance": null,
160
- "mean_distance": null
161
- }
162
- }
163
- },
164
- {
165
- "question_index": 3,
166
- "gold_letter": "G",
167
- "gold_text": "(i) 585, (ii) Yes",
168
- "gold_numeric": 585.0,
169
- "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
170
- "models": {
171
- "base": {
172
- "rounds": [
173
- "?",
174
- "G",
175
- "G",
176
- "G",
177
- "D",
178
- "G",
179
- "G",
180
- "?"
181
- ],
182
- "hit_count": 5,
183
- "total_rounds": 8,
184
- "confidence": 0.625,
185
- "entropy": 0.4329,
186
- "majority_answer": "G",
187
- "majority_hit": true,
188
- "majority_distance": 0.0,
189
- "mean_distance": 22.5
190
- },
191
- "lek": {
192
- "rounds": [
193
- "?",
194
- "?",
195
- "?",
196
- "C",
197
- "A",
198
- "G",
199
- "N",
200
- "C"
201
- ],
202
- "hit_count": 2,
203
- "total_rounds": 8,
204
- "confidence": 0.375,
205
- "entropy": 0.7185,
206
- "majority_answer": "?",
207
- "majority_hit": false,
208
- "majority_distance": null,
209
- "mean_distance": 11.25
210
- }
211
- }
212
- }
213
- ]
214
- }