eval: lemer/gguf/Q4_K_M advance (2026-04-12T20:24:25Z)

Files changed (3) hide show

.eval_results/toxigen.gguf.Q4_K_M.md CHANGED Viewed

@@ -1,21 +1,21 @@
 # TIGER-Lab/MMLU-Pro / toxigen — 8-PAC Canon
-Merged from 248 run(s) across 1 machine(s). Total rows: **3968**.
 ## Machines
-- `charon.lthn.io`: 3968 rows
 ## Scores
 | Side | Model | Samples | Questions | Rounds | Per-round acc | Majority acc |
 |---|---|---|---|---|---|---|
-| `base` | `hf.co/LetheanNetwork/lemer:Q4_K_M` | 1984 | 248 | 8 | 28.78% | 27.82% (69/248) |
-| `lek` | `hf.co/lthn/lemer:Q4_K_M` | 1984 | 248 | 8 | 21.57% | 19.76% (49/248) |
 ## LEK delta
-- per-round: **-7.21pp**
-- majority-vote: **-8.06pp**
-Last updated: 2026-04-12T20:22:16.322298+00:00

 # TIGER-Lab/MMLU-Pro / toxigen — 8-PAC Canon
+Merged from 249 run(s) across 1 machine(s). Total rows: **3984**.
 ## Machines
+- `charon.lthn.io`: 3984 rows
 ## Scores
 | Side | Model | Samples | Questions | Rounds | Per-round acc | Majority acc |
 |---|---|---|---|---|---|---|
+| `base` | `hf.co/LetheanNetwork/lemer:Q4_K_M` | 1992 | 249 | 8 | 28.66% | 27.71% (69/249) |
+| `lek` | `hf.co/lthn/lemer:Q4_K_M` | 1992 | 249 | 8 | 21.49% | 19.68% (49/249) |
 ## LEK delta
+- per-round: **-7.17pp**
+- majority-vote: **-8.03pp**
+Last updated: 2026-04-12T20:24:25.349479+00:00

.eval_results/toxigen.gguf.Q4_K_M.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0c23a04d2b72b6f31133b5147c59a126b2d8b5532b87151a964ff7f094ab50b6
-size 600993

 version https://git-lfs.github.com/spec/v1
+oid sha256:cece15955d8038138822f4abf5a9e1bcf20a5f9328d328671d21a80ce0cf7330
+size 606396

.eval_results/toxigen.gguf.Q4_K_M.yaml CHANGED Viewed

@@ -2,14 +2,14 @@
       id: TIGER-Lab/MMLU-Pro
       task_id: mmlu_pro
       revision: 3373e0b32277875b8db2aa555a333b78a08477ea
-    value: 19.76
     date: '2026-04-12'
     source:
       url:
         https://huggingface.co/hf.co/lthn/lemer:Q4_K_M/tree/main/.eval_results
       name: LEM-benchmarks canonical parquet
       user: lthn
-    notes: "8-PAC merged canon, 248 questions × 8 rounds = 1984 samples across 1 machine(s)
-      and 248 run(s). Paired A/B vs hf.co/LetheanNetwork/lemer:Q4_K_M under Google-calibrated
       sampling (temp=1.0, top_p=0.95, top_k=64), enable_thinking=True. Headline metric:
-      majority-vote accuracy (LEK'd side). Per-round mean accuracy: 21.57%."

       id: TIGER-Lab/MMLU-Pro
       task_id: mmlu_pro
       revision: 3373e0b32277875b8db2aa555a333b78a08477ea
+    value: 19.68
     date: '2026-04-12'
     source:
       url:
         https://huggingface.co/hf.co/lthn/lemer:Q4_K_M/tree/main/.eval_results
       name: LEM-benchmarks canonical parquet
       user: lthn
+    notes: "8-PAC merged canon, 249 questions × 8 rounds = 1992 samples across 1 machine(s)
+      and 249 run(s). Paired A/B vs hf.co/LetheanNetwork/lemer:Q4_K_M under Google-calibrated
       sampling (temp=1.0, top_p=0.95, top_k=64), enable_thinking=True. Headline metric:
+      majority-vote accuracy (LEK'd side). Per-round mean accuracy: 21.49%."