Loss of 3.79% on MMLU Test from 7021 Samples
The model card says Smarter [beats the base model on 6 of 7 benchmarks ] and uncensored too.
But upon testing, the model shows a loss of 3.79% on MMLU test of 7021 samples (the reported loss would actually be bigger if tested on the full 14042 samples.)
Here are the results:
MMLU Test Results:
Model: gemma-4-31b-it (Original model)
============================================================
Total questions: 7021
Correct: 6096
Accuracy: 0.8683 (86.83%)
Parse failures: 32
============================================================
Top subjects:
- professional_law: 0.7618 (598/785)
- moral_scenarios: 0.8371 (370/442)
- miscellaneous: 0.9269 (355/383)
- professional_psychology: 0.8956 (283/316)
- high_school_psychology: 0.9667 (261/270)
- high_school_macroeconomics: 0.9289 (183/197)
- prehistory: 0.9419 (162/172)
- moral_disputes: 0.8218 (143/174)
- elementary_mathematics: 0.9402 (173/184)
- philosophy: 0.8553 (136/159)
Model: gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking
============================================================
Total questions: 7021
Correct: 5830
Accuracy: 0.8304 (83.04%)
Parse failures: 25
============================================================
Top subjects:
- professional_law: 0.6777 (532/785)
- moral_scenarios: 0.6516 (288/442)
- miscellaneous: 0.9243 (354/383)
- professional_psychology: 0.8734 (276/316)
- high_school_psychology: 0.9704 (262/270)
- high_school_macroeconomics: 0.9086 (179/197)
- prehistory: 0.9302 (160/172)
- moral_disputes: 0.8218 (143/174)
- elementary_mathematics: 0.9565 (176/184)
- philosophy: 0.8239 (131/159)
Thank you for the additional metrics.
The 6/7 refers to core benchmarks - ARC-C, ARC-E etc etc listed in the benchmarks area.
ARC-C / ARC-E are critical problem solving tests ; which also test instruction following and understanding at the same time.