DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking · Loss of 3.79% on MMLU Test from 7021 Samples

Loss of 3.79% on MMLU Test from 7021 Samples

by llmfan46 - opened 8 days ago

Discussion

llmfan46

8 days ago

•

edited 8 days ago

The model card says Smarter [beats the base model on 6 of 7 benchmarks ] and uncensored too.

But upon testing, the model shows a loss of 3.79% on MMLU test of 7021 samples (the reported loss would actually be bigger if tested on the full 14042 samples.)

Here are the results:

MMLU Test Results:

Model: gemma-4-31b-it (Original model)

============================================================

Total questions: 7021
Correct: 6096
Accuracy: 0.8683 (86.83%)
Parse failures: 32

============================================================

Top subjects:

professional_law: 0.7618 (598/785)
moral_scenarios: 0.8371 (370/442)
miscellaneous: 0.9269 (355/383)
professional_psychology: 0.8956 (283/316)
high_school_psychology: 0.9667 (261/270)
high_school_macroeconomics: 0.9289 (183/197)
prehistory: 0.9419 (162/172)
moral_disputes: 0.8218 (143/174)
elementary_mathematics: 0.9402 (173/184)
philosophy: 0.8553 (136/159)

Model: gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking

============================================================

Total questions: 7021
Correct: 5830
Accuracy: 0.8304 (83.04%)
Parse failures: 25

============================================================

Top subjects:

professional_law: 0.6777 (532/785)
moral_scenarios: 0.6516 (288/442)
miscellaneous: 0.9243 (354/383)
professional_psychology: 0.8734 (276/316)
high_school_psychology: 0.9704 (262/270)
high_school_macroeconomics: 0.9086 (179/197)
prehistory: 0.9302 (160/172)
moral_disputes: 0.8218 (143/174)
elementary_mathematics: 0.9565 (176/184)
philosophy: 0.8239 (131/159)

DavidAU

Owner 7 days ago

Thank you for the additional metrics.
The 6/7 refers to core benchmarks - ARC-C, ARC-E etc etc listed in the benchmarks area.

ARC-C / ARC-E are critical problem solving tests ; which also test instruction following and understanding at the same time.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment