Snider Virgil commited on
Commit ·
486294c
1
Parent(s): 45b8593
eval: first 8-PAC paired run — strong LEK delta signal
Browse filesFirst paired benchmark for lemmy (26B A4B MoE) vs gemma-4-26b-a4b-it-4bit
on MMLU-Pro. n=4 r=8, same methodology as lemer/lemma.
Results:
base: 40.62% per-round, 25% majority (1/4)
lek: 71.88% per-round, 75% majority (3/4)
delta: +31.26pp per-round, +50pp majority
n=4 is well below the noise floor for a 32-domain benchmark — this is a
directional signal, not a statistical claim. But the delta direction and
magnitude on the biggest MoE in the family are worth capturing in the
canon for when larger runs contribute.
Canonical storage is .eval_results/mmlu_pro.{parquet,yaml,md} — same
layout as lemer and lemma.
Co-Authored-By: Virgil <virgil@lethean.io>
- .eval_results/mmlu_pro.md +21 -0
- .eval_results/mmlu_pro.parquet +3 -0
- .eval_results/mmlu_pro.yaml +24 -0
.eval_results/mmlu_pro.md
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TIGER-Lab/MMLU-Pro / mmlu_pro — 8-PAC Canon
|
| 2 |
+
|
| 3 |
+
Merged from 1 run(s) across 1 machine(s). Total rows: **64**.
|
| 4 |
+
|
| 5 |
+
## Machines
|
| 6 |
+
|
| 7 |
+
- `studio`: 64 rows
|
| 8 |
+
|
| 9 |
+
## Scores
|
| 10 |
+
|
| 11 |
+
| Side | Model | Samples | Questions | Rounds | Per-round acc | Majority acc |
|
| 12 |
+
|---|---|---|---|---|---|---|
|
| 13 |
+
| `base` | `mlx-community/gemma-4-26b-a4b-it-4bit` | 32 | 4 | 8 | 40.62% | 25.00% (1/4) |
|
| 14 |
+
| `lek` | `lthn/lemmy` | 32 | 4 | 8 | 71.88% | 75.00% (3/4) |
|
| 15 |
+
|
| 16 |
+
## LEK delta
|
| 17 |
+
|
| 18 |
+
- per-round: **+31.26pp**
|
| 19 |
+
- majority-vote: **+50.00pp**
|
| 20 |
+
|
| 21 |
+
Last updated: 2026-04-11T12:43:21.933401+00:00
|
.eval_results/mmlu_pro.parquet
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:353d1f8e9ebae3a1c9fe81c64b82a9919759b1303b36242f9318f1af2e5a94de
|
| 3 |
+
size 205694
|
.eval_results/mmlu_pro.yaml
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
- dataset:
|
| 2 |
+
id: TIGER-Lab/MMLU-Pro
|
| 3 |
+
task_id: mmlu_pro
|
| 4 |
+
value: 75.0
|
| 5 |
+
date: '2026-04-11'
|
| 6 |
+
source:
|
| 7 |
+
url: https://huggingface.co/lthn/lemmy/tree/main/.eval_results
|
| 8 |
+
name: Canonical per-round parquet
|
| 9 |
+
user: lthn
|
| 10 |
+
notes: '8-PAC merged canon: 4 questions × 8 rounds = 32 samples across 1 machine(s)
|
| 11 |
+
and 1 run(s). Google-calibrated sampling (temp=1.0, top_p=0.95, top_k=64), enable_thinking=True.
|
| 12 |
+
Metric: majority-vote accuracy (headline).'
|
| 13 |
+
- dataset:
|
| 14 |
+
id: TIGER-Lab/MMLU-Pro
|
| 15 |
+
task_id: mmlu_pro
|
| 16 |
+
value: 71.88
|
| 17 |
+
date: '2026-04-11'
|
| 18 |
+
source:
|
| 19 |
+
url: https://huggingface.co/lthn/lemmy/tree/main/.eval_results
|
| 20 |
+
name: Canonical per-round parquet
|
| 21 |
+
user: lthn
|
| 22 |
+
notes: '8-PAC merged canon: 4 questions × 8 rounds = 32 samples across 1 machine(s)
|
| 23 |
+
and 1 run(s). Google-calibrated sampling (temp=1.0, top_p=0.95, top_k=64), enable_thinking=True.
|
| 24 |
+
Metric: per-round mean accuracy.'
|