Snider Virgil commited on
Commit
486294c
·
1 Parent(s): 45b8593

eval: first 8-PAC paired run — strong LEK delta signal

Browse files

First paired benchmark for lemmy (26B A4B MoE) vs gemma-4-26b-a4b-it-4bit
on MMLU-Pro. n=4 r=8, same methodology as lemer/lemma.

Results:
base: 40.62% per-round, 25% majority (1/4)
lek: 71.88% per-round, 75% majority (3/4)
delta: +31.26pp per-round, +50pp majority

n=4 is well below the noise floor for a 32-domain benchmark — this is a
directional signal, not a statistical claim. But the delta direction and
magnitude on the biggest MoE in the family are worth capturing in the
canon for when larger runs contribute.

Canonical storage is .eval_results/mmlu_pro.{parquet,yaml,md} — same
layout as lemer and lemma.

Co-Authored-By: Virgil <virgil@lethean.io>

.eval_results/mmlu_pro.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TIGER-Lab/MMLU-Pro / mmlu_pro — 8-PAC Canon
2
+
3
+ Merged from 1 run(s) across 1 machine(s). Total rows: **64**.
4
+
5
+ ## Machines
6
+
7
+ - `studio`: 64 rows
8
+
9
+ ## Scores
10
+
11
+ | Side | Model | Samples | Questions | Rounds | Per-round acc | Majority acc |
12
+ |---|---|---|---|---|---|---|
13
+ | `base` | `mlx-community/gemma-4-26b-a4b-it-4bit` | 32 | 4 | 8 | 40.62% | 25.00% (1/4) |
14
+ | `lek` | `lthn/lemmy` | 32 | 4 | 8 | 71.88% | 75.00% (3/4) |
15
+
16
+ ## LEK delta
17
+
18
+ - per-round: **+31.26pp**
19
+ - majority-vote: **+50.00pp**
20
+
21
+ Last updated: 2026-04-11T12:43:21.933401+00:00
.eval_results/mmlu_pro.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:353d1f8e9ebae3a1c9fe81c64b82a9919759b1303b36242f9318f1af2e5a94de
3
+ size 205694
.eval_results/mmlu_pro.yaml ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: TIGER-Lab/MMLU-Pro
3
+ task_id: mmlu_pro
4
+ value: 75.0
5
+ date: '2026-04-11'
6
+ source:
7
+ url: https://huggingface.co/lthn/lemmy/tree/main/.eval_results
8
+ name: Canonical per-round parquet
9
+ user: lthn
10
+ notes: '8-PAC merged canon: 4 questions × 8 rounds = 32 samples across 1 machine(s)
11
+ and 1 run(s). Google-calibrated sampling (temp=1.0, top_p=0.95, top_k=64), enable_thinking=True.
12
+ Metric: majority-vote accuracy (headline).'
13
+ - dataset:
14
+ id: TIGER-Lab/MMLU-Pro
15
+ task_id: mmlu_pro
16
+ value: 71.88
17
+ date: '2026-04-11'
18
+ source:
19
+ url: https://huggingface.co/lthn/lemmy/tree/main/.eval_results
20
+ name: Canonical per-round parquet
21
+ user: lthn
22
+ notes: '8-PAC merged canon: 4 questions × 8 rounds = 32 samples across 1 machine(s)
23
+ and 1 run(s). Google-calibrated sampling (temp=1.0, top_p=0.95, top_k=64), enable_thinking=True.
24
+ Metric: per-round mean accuracy.'