eval: migrate results/ → .eval_results/ canonical storage

Same migration as lemer (d242f94): iter_*.parquet merged into a single
task-named canonical parquet, yaml + md regenerated from aggregates,
results/ folder removed.

Lemma n=4 r=8 scores from the first run:
base (gemma-4-e4b-it-4bit): 40.62% per-round, 25% majority (1/4)
lek (lthn/lemma): 31.25% per-round, 25% majority (1/4)
delta: -9.37pp per-round, 0pp majority

n=4 is well below the noise floor — these numbers will shift when more
runs contribute to the canon.

Co-Authored-By: Virgil <virgil@lethean.io>

Files changed (7) hide show

.eval_results/mmlu_pro.md +21 -0
results/iter_2026-04-11T13-02-52.parquet → .eval_results/mmlu_pro.parquet +2 -2
.eval_results/mmlu_pro.yaml +10 -9
results/iter_2026-04-11T13-02-52.json +0 -214
results/latest.md +0 -0
results/report.txt +0 -104
results/summary.json +0 -214

.eval_results/mmlu_pro.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# TIGER-Lab/MMLU-Pro / mmlu_pro — 8-PAC Canon
+Merged from 1 run(s) across 1 machine(s). Total rows: **64**.
+## Machines
+- `studio`: 64 rows
+## Scores
+| Side | Model | Samples | Questions | Rounds | Per-round acc | Majority acc |
+|---|---|---|---|---|---|---|
+| `base` | `mlx-community/gemma-4-e4b-it-4bit` | 32 | 4 | 8 | 40.62% | 50.00% (2/4) |
+| `lek` | `lthn/lemma` | 32 | 4 | 8 | 31.25% | 25.00% (1/4) |
+## LEK delta
+- per-round: **-9.37pp**
+- majority-vote: **-25.00pp**
+Last updated: 2026-04-11T12:42:49.576970+00:00

results/iter_2026-04-11T13-02-52.parquet → .eval_results/mmlu_pro.parquet RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f454336a58bb18dd80fbd5be8b5fc4e8fc1865bc406e4b5bfa4d6a0448bd342f
-size 166134

 version https://git-lfs.github.com/spec/v1
+oid sha256:7b69475a610d2f31e2d831e2b17c34c8ae1d1605cf4f2b54d3ddceca044add99
+size 166650

.eval_results/mmlu_pro.yaml CHANGED Viewed

@@ -4,20 +4,21 @@
     value: 25.0
     date: '2026-04-11'
     source:
-      url: https://huggingface.co/lthn/lemma/tree/main/results
-      name: Raw per-iteration results (parquet + latest.md)
       user: lthn
-    notes: '8-PAC paired run: 4 questions × 8 rounds. Google-calibrated sampling (temp=1.0,
-      top_p=0.95, top_k=64), enable_thinking=True. Metric: majority-vote accuracy
-      (headline).'
   - dataset:
       id: TIGER-Lab/MMLU-Pro
       task_id: mmlu_pro
     value: 31.25
     date: '2026-04-11'
     source:
-      url: https://huggingface.co/lthn/lemma/tree/main/results
-      name: Raw per-iteration results (parquet + latest.md)
       user: lthn
-    notes: '8-PAC paired run: 4 questions × 8 rounds. Google-calibrated sampling (temp=1.0,
-      top_p=0.95, top_k=64), enable_thinking=True. Metric: per-round mean accuracy.'

     value: 25.0
     date: '2026-04-11'
     source:
+      url: https://huggingface.co/lthn/lemma/tree/main/.eval_results
+      name: Canonical per-round parquet
       user: lthn
+    notes: '8-PAC merged canon: 4 questions × 8 rounds = 32 samples across 1 machine(s)
+      and 1 run(s). Google-calibrated sampling (temp=1.0, top_p=0.95, top_k=64), enable_thinking=True.
+      Metric: majority-vote accuracy (headline).'
   - dataset:
       id: TIGER-Lab/MMLU-Pro
       task_id: mmlu_pro
     value: 31.25
     date: '2026-04-11'
     source:
+      url: https://huggingface.co/lthn/lemma/tree/main/.eval_results
+      name: Canonical per-round parquet
       user: lthn
+    notes: '8-PAC merged canon: 4 questions × 8 rounds = 32 samples across 1 machine(s)
+      and 1 run(s). Google-calibrated sampling (temp=1.0, top_p=0.95, top_k=64), enable_thinking=True.
+      Metric: per-round mean accuracy.'

results/iter_2026-04-11T13-02-52.json DELETED Viewed

@@ -1,214 +0,0 @@
-{
-  "base_model": "mlx-community/gemma-4-e4b-it-4bit",
-  "this_model": "lthn/lemma",
-  "task": "mmlu_pro",
-  "n_questions": 4,
-  "rounds": 8,
-  "timestamp": 1775908972,
-  "totals": {
-    "base_hits": 13,
-    "lek_hits": 10,
-    "total_per_model": 32,
-    "base_accuracy_pct": 40.62,
-    "lek_accuracy_pct": 31.25,
-    "delta_pp": -9.38
-  },
-  "questions": [
-    {
-      "question_index": 0,
-      "gold_letter": "I",
-      "gold_text": "5.40MeV",
-      "gold_numeric": 5.4,
-      "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
-      "models": {
-        "base": {
-          "rounds": [
-            "F",
-            "F",
-            "F",
-            "F",
-            "F",
-            "F",
-            "F",
-            "F"
-          ],
-          "hit_count": 0,
-          "total_rounds": 8,
-          "confidence": 1.0,
-          "entropy": -0.0,
-          "majority_answer": "F",
-          "majority_hit": false,
-          "majority_distance": 0.1,
-          "mean_distance": 0.1
-        },
-        "lek": {
-          "rounds": [
-            "F",
-            "G",
-            "B",
-            "B",
-            "B",
-            "B",
-            "B",
-            "B"
-          ],
-          "hit_count": 0,
-          "total_rounds": 8,
-          "confidence": 0.75,
-          "entropy": 0.3538,
-          "majority_answer": "B",
-          "majority_hit": false,
-          "majority_distance": 1.1,
-          "mean_distance": 0.9125
-        }
-      }
-    },
-    {
-      "question_index": 1,
-      "gold_letter": "E",
-      "gold_text": "19%",
-      "gold_numeric": 19.0,
-      "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
-      "models": {
-        "base": {
-          "rounds": [
-            "C",
-            "C",
-            "C",
-            "C",
-            "C",
-            "C",
-            "C",
-            "F"
-          ],
-          "hit_count": 0,
-          "total_rounds": 8,
-          "confidence": 0.875,
-          "entropy": 0.1812,
-          "majority_answer": "C",
-          "majority_hit": false,
-          "majority_distance": 10.0,
-          "mean_distance": 11.25
-        },
-        "lek": {
-          "rounds": [
-            "I",
-            "C",
-            "C",
-            "F",
-            "C",
-            "C",
-            "G",
-            "C"
-          ],
-          "hit_count": 0,
-          "total_rounds": 8,
-          "confidence": 0.625,
-          "entropy": 0.5163,
-          "majority_answer": "C",
-          "majority_hit": false,
-          "majority_distance": 10.0,
-          "mean_distance": 21.25
-        }
-      }
-    },
-    {
-      "question_index": 2,
-      "gold_letter": "D",
-      "gold_text": "The amount by which total output increases due to the addition of one unit of a given factor while the amount used of other factors of production remains unchanged",
-      "gold_numeric": null,
-      "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
-      "models": {
-        "base": {
-          "rounds": [
-            "D",
-            "D",
-            "D",
-            "D",
-            "D",
-            "D",
-            "D",
-            "D"
-          ],
-          "hit_count": 8,
-          "total_rounds": 8,
-          "confidence": 1.0,
-          "entropy": -0.0,
-          "majority_answer": "D",
-          "majority_hit": true,
-          "majority_distance": null,
-          "mean_distance": null
-        },
-        "lek": {
-          "rounds": [
-            "D",
-            "D",
-            "D",
-            "D",
-            "D",
-            "D",
-            "D",
-            "D"
-          ],
-          "hit_count": 8,
-          "total_rounds": 8,
-          "confidence": 1.0,
-          "entropy": -0.0,
-          "majority_answer": "D",
-          "majority_hit": true,
-          "majority_distance": null,
-          "mean_distance": null
-        }
-      }
-    },
-    {
-      "question_index": 3,
-      "gold_letter": "G",
-      "gold_text": "(i) 585, (ii) Yes",
-      "gold_numeric": 585.0,
-      "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
-      "models": {
-        "base": {
-          "rounds": [
-            "?",
-            "G",
-            "G",
-            "G",
-            "D",
-            "G",
-            "G",
-            "?"
-          ],
-          "hit_count": 5,
-          "total_rounds": 8,
-          "confidence": 0.625,
-          "entropy": 0.4329,
-          "majority_answer": "G",
-          "majority_hit": true,
-          "majority_distance": 0.0,
-          "mean_distance": 22.5
-        },
-        "lek": {
-          "rounds": [
-            "?",
-            "?",
-            "?",
-            "C",
-            "A",
-            "G",
-            "N",
-            "C"
-          ],
-          "hit_count": 2,
-          "total_rounds": 8,
-          "confidence": 0.375,
-          "entropy": 0.7185,
-          "majority_answer": "?",
-          "majority_hit": false,
-          "majority_distance": null,
-          "mean_distance": 11.25
-        }
-      }
-    }
-  ]
-}

results/latest.md DELETED Viewed

The diff for this file is too large to render. See raw diff

results/report.txt DELETED Viewed

@@ -1,104 +0,0 @@
-==============================================================================
-  n=4 questions × 8 rounds × 2 models  = 64 samples
-  base:  mlx-community/gemma-4-e4b-it-4bit
-  lek:   lthn/lemma
-  task:  mmlu_pro
-==============================================================================
-──────────────────────────────────────────────────────────────────────────────
-  Q0: Answer the following multiple choice question. The last line of your response should be of the follo...
-     gold = I: 5.40MeV
-     (numeric: 5.4)
-──────────────────────────────────────────────────────────────────────────────
-  [base] rounds:  F  F  F  F  F  F  F  F   hits: 0/8
-    F: ████████ (8/8)
-    confidence (max-share):   1.00  ██████████
-    entropy (normalised):     -0.00  (0=concentrated, 1=spread)
-    majority distance:        0.100
-    mean round distance:      0.100
-  [lek] rounds:  F  G  B  B  B  B  B  B   hits: 0/8
-    B: ██████░░ (6/8)
-    F: █░░░░░░░ (1/8)
-    G: █░░░░░░░ (1/8)
-    confidence (max-share):   0.75  ███████░░░
-    entropy (normalised):     0.35  (0=concentrated, 1=spread)
-    majority distance:        1.100
-    mean round distance:      0.912
-──────────────────────────────────────────────────────────────────────────────
-  Q1: Answer the following multiple choice question. The last line of your response should be of the follo...
-     gold = E: 19%
-     (numeric: 19.0)
-──────────────────────────────────────────────────────────────────────────────
-  [base] rounds:  C  C  C  C  C  C  C  F   hits: 0/8
-    C: ███████░ (7/8)
-    F: █░░░░░░░ (1/8)
-    confidence (max-share):   0.88  ████████░░
-    entropy (normalised):     0.18  (0=concentrated, 1=spread)
-    majority distance:        10.000
-    mean round distance:      11.250
-  [lek] rounds:  I  C  C  F  C  C  G  C   hits: 0/8
-    C: █████░░░ (5/8)
-    I: █░░░░░░░ (1/8)
-    F: █░░░░░░░ (1/8)
-    G: █░░░░░░░ (1/8)
-    confidence (max-share):   0.62  ██████░░░░
-    entropy (normalised):     0.52  (0=concentrated, 1=spread)
-    majority distance:        10.000
-    mean round distance:      21.250
-──────────────────────────────────────────────────────────────────────────────
-  Q2: Answer the following multiple choice question. The last line of your response should be of the follo...
-     gold = D: The amount by which total output increases due to the addition of one unit of a given factor while the amount used of other factors of production remains unchanged
-──────────────────────────────────────────────────────────────────────────────
-  [base] rounds: [D][D][D][D][D][D][D][D]  hits: 8/8
-    D: ████████ (8/8)
-    confidence (max-share):   1.00  ██████████
-    entropy (normalised):     -0.00  (0=concentrated, 1=spread)
-  [lek] rounds: [D][D][D][D][D][D][D][D]  hits: 8/8
-    D: ████████ (8/8)
-    confidence (max-share):   1.00  ██████████
-    entropy (normalised):     -0.00  (0=concentrated, 1=spread)
-──────────────────────────────────────────────────────────────────────────────
-  Q3: Answer the following multiple choice question. The last line of your response should be of the follo...
-     gold = G: (i) 585, (ii) Yes
-     (numeric: 585.0)
-──────────────────────────────────────────────────────────────────────────────
-  [base] rounds:  ? [G][G][G] D [G][G] ?   hits: 5/8
-    G: █████░░░ (5/8)
-    ?: ██░░░░░░ (2/8)
-    D: █░░░░░░░ (1/8)
-    confidence (max-share):   0.62  ██████░░░░
-    entropy (normalised):     0.43  (0=concentrated, 1=spread)
-    majority distance:        0.000
-    mean round distance:      22.500
-  [lek] rounds:  ?  ?  ?  C  A [G] N  C   hits: 2/8
-    ?: ███░░░░░ (3/8)
-    C: ██░░░░░░ (2/8)
-    A: █░░░░░░░ (1/8)
-    G: █░░░░░░░ (1/8)
-    N: █░░░░░░░ (1/8)
-    confidence (max-share):   0.38  ███░░░░░░░
-    entropy (normalised):     0.72  (0=concentrated, 1=spread)
-    mean round distance:      11.250
-==============================================================================
-  Cross-question summary
-==============================================================================
-  Q    base_conf  lek_conf   base_hit   lek_hit    delta
-  Q0   1.00       0.75       0/8        0/8        +0
-  Q1   0.88       0.62       0/8        0/8        +0
-  Q2   1.00       1.00       8/8        8/8        +0
-  Q3   0.62       0.38       5/8        2/8        -3
-  TOTAL:  base 13/32 (40.6%)   lek 10/32 (31.2%)   delta -9.4pp

results/summary.json DELETED Viewed

@@ -1,214 +0,0 @@
-{
-  "base_model": "mlx-community/gemma-4-e4b-it-4bit",
-  "this_model": "lthn/lemma",
-  "task": "mmlu_pro",
-  "n_questions": 4,
-  "rounds": 8,
-  "timestamp": 1775908972,
-  "totals": {
-    "base_hits": 13,
-    "lek_hits": 10,
-    "total_per_model": 32,
-    "base_accuracy_pct": 40.62,
-    "lek_accuracy_pct": 31.25,
-    "delta_pp": -9.38
-  },
-  "questions": [
-    {
-      "question_index": 0,
-      "gold_letter": "I",
-      "gold_text": "5.40MeV",
-      "gold_numeric": 5.4,
-      "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
-      "models": {
-        "base": {
-          "rounds": [
-            "F",
-            "F",
-            "F",
-            "F",
-            "F",
-            "F",
-            "F",
-            "F"
-          ],
-          "hit_count": 0,
-          "total_rounds": 8,
-          "confidence": 1.0,
-          "entropy": -0.0,
-          "majority_answer": "F",
-          "majority_hit": false,
-          "majority_distance": 0.1,
-          "mean_distance": 0.1
-        },
-        "lek": {
-          "rounds": [
-            "F",
-            "G",
-            "B",
-            "B",
-            "B",
-            "B",
-            "B",
-            "B"
-          ],
-          "hit_count": 0,
-          "total_rounds": 8,
-          "confidence": 0.75,
-          "entropy": 0.3538,
-          "majority_answer": "B",
-          "majority_hit": false,
-          "majority_distance": 1.1,
-          "mean_distance": 0.9125
-        }
-      }
-    },
-    {
-      "question_index": 1,
-      "gold_letter": "E",
-      "gold_text": "19%",
-      "gold_numeric": 19.0,
-      "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
-      "models": {
-        "base": {
-          "rounds": [
-            "C",
-            "C",
-            "C",
-            "C",
-            "C",
-            "C",
-            "C",
-            "F"
-          ],
-          "hit_count": 0,
-          "total_rounds": 8,
-          "confidence": 0.875,
-          "entropy": 0.1812,
-          "majority_answer": "C",
-          "majority_hit": false,
-          "majority_distance": 10.0,
-          "mean_distance": 11.25
-        },
-        "lek": {
-          "rounds": [
-            "I",
-            "C",
-            "C",
-            "F",
-            "C",
-            "C",
-            "G",
-            "C"
-          ],
-          "hit_count": 0,
-          "total_rounds": 8,
-          "confidence": 0.625,
-          "entropy": 0.5163,
-          "majority_answer": "C",
-          "majority_hit": false,
-          "majority_distance": 10.0,
-          "mean_distance": 21.25
-        }
-      }
-    },
-    {
-      "question_index": 2,
-      "gold_letter": "D",
-      "gold_text": "The amount by which total output increases due to the addition of one unit of a given factor while the amount used of other factors of production remains unchanged",
-      "gold_numeric": null,
-      "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
-      "models": {
-        "base": {
-          "rounds": [
-            "D",
-            "D",
-            "D",
-            "D",
-            "D",
-            "D",
-            "D",
-            "D"
-          ],
-          "hit_count": 8,
-          "total_rounds": 8,
-          "confidence": 1.0,
-          "entropy": -0.0,
-          "majority_answer": "D",
-          "majority_hit": true,
-          "majority_distance": null,
-          "mean_distance": null
-        },
-        "lek": {
-          "rounds": [
-            "D",
-            "D",
-            "D",
-            "D",
-            "D",
-            "D",
-            "D",
-            "D"
-          ],
-          "hit_count": 8,
-          "total_rounds": 8,
-          "confidence": 1.0,
-          "entropy": -0.0,
-          "majority_answer": "D",
-          "majority_hit": true,
-          "majority_distance": null,
-          "mean_distance": null
-        }
-      }
-    },
-    {
-      "question_index": 3,
-      "gold_letter": "G",
-      "gold_text": "(i) 585, (ii) Yes",
-      "gold_numeric": 585.0,
-      "question_body": "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step befo",
-      "models": {
-        "base": {
-          "rounds": [
-            "?",
-            "G",
-            "G",
-            "G",
-            "D",
-            "G",
-            "G",
-            "?"
-          ],
-          "hit_count": 5,
-          "total_rounds": 8,
-          "confidence": 0.625,
-          "entropy": 0.4329,
-          "majority_answer": "G",
-          "majority_hit": true,
-          "majority_distance": 0.0,
-          "mean_distance": 22.5
-        },
-        "lek": {
-          "rounds": [
-            "?",
-            "?",
-            "?",
-            "C",
-            "A",
-            "G",
-            "N",
-            "C"
-          ],
-          "hit_count": 2,
-          "total_rounds": 8,
-          "confidence": 0.375,
-          "entropy": 0.7185,
-          "majority_answer": "?",
-          "majority_hit": false,
-          "majority_distance": null,
-          "mean_distance": 11.25
-        }
-      }
-    }
-  ]
-}