eval: first 8-PAC paired run — strong LEK delta signal

First paired benchmark for lemmy (26B A4B MoE) vs gemma-4-26b-a4b-it-4bit
on MMLU-Pro. n=4 r=8, same methodology as lemer/lemma.

Results:
base: 40.62% per-round, 25% majority (1/4)
lek: 71.88% per-round, 75% majority (3/4)
delta: +31.26pp per-round, +50pp majority

n=4 is well below the noise floor for a 32-domain benchmark — this is a
directional signal, not a statistical claim. But the delta direction and
magnitude on the biggest MoE in the family are worth capturing in the
canon for when larger runs contribute.

Canonical storage is .eval_results/mmlu_pro.{parquet,yaml,md} — same
layout as lemer and lemma.

Co-Authored-By: Virgil <virgil@lethean.io>

Files changed (3) hide show

.eval_results/mmlu_pro.md +21 -0
.eval_results/mmlu_pro.parquet +3 -0
.eval_results/mmlu_pro.yaml +24 -0

.eval_results/mmlu_pro.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# TIGER-Lab/MMLU-Pro / mmlu_pro — 8-PAC Canon
+Merged from 1 run(s) across 1 machine(s). Total rows: **64**.
+## Machines
+- `studio`: 64 rows
+## Scores
+| Side | Model | Samples | Questions | Rounds | Per-round acc | Majority acc |
+|---|---|---|---|---|---|---|
+| `base` | `mlx-community/gemma-4-26b-a4b-it-4bit` | 32 | 4 | 8 | 40.62% | 25.00% (1/4) |
+| `lek` | `lthn/lemmy` | 32 | 4 | 8 | 71.88% | 75.00% (3/4) |
+## LEK delta
+- per-round: **+31.26pp**
+- majority-vote: **+50.00pp**
+Last updated: 2026-04-11T12:43:21.933401+00:00

.eval_results/mmlu_pro.parquet ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:353d1f8e9ebae3a1c9fe81c64b82a9919759b1303b36242f9318f1af2e5a94de
+size 205694

.eval_results/mmlu_pro.yaml ADDED Viewed

	@@ -0,0 +1,24 @@

+  - dataset:
+      id: TIGER-Lab/MMLU-Pro
+      task_id: mmlu_pro
+    value: 75.0
+    date: '2026-04-11'
+    source:
+      url: https://huggingface.co/lthn/lemmy/tree/main/.eval_results
+      name: Canonical per-round parquet
+      user: lthn
+    notes: '8-PAC merged canon: 4 questions × 8 rounds = 32 samples across 1 machine(s)
+      and 1 run(s). Google-calibrated sampling (temp=1.0, top_p=0.95, top_k=64), enable_thinking=True.
+      Metric: majority-vote accuracy (headline).'
+  - dataset:
+      id: TIGER-Lab/MMLU-Pro
+      task_id: mmlu_pro
+    value: 71.88
+    date: '2026-04-11'
+    source:
+      url: https://huggingface.co/lthn/lemmy/tree/main/.eval_results
+      name: Canonical per-round parquet
+      user: lthn
+    notes: '8-PAC merged canon: 4 questions × 8 rounds = 32 samples across 1 machine(s)
+      and 1 run(s). Google-calibrated sampling (temp=1.0, top_p=0.95, top_k=64), enable_thinking=True.
+      Metric: per-round mean accuracy.'