Loss of 3.79% on MMLU Test from 7021 Samples

#4
by llmfan46 - opened

The model card says Smarter [beats the base model on 6 of 7 benchmarks ] and uncensored too.

But upon testing, the model shows a loss of 3.79% on MMLU test of 7021 samples (the reported loss would actually be bigger if tested on the full 14042 samples.)

Here are the results:

MMLU Test Results:

Model: gemma-4-31b-it (Original model)

============================================================

  • Total questions: 7021

  • Correct: 6096

  • Accuracy: 0.8683 (86.83%)

  • Parse failures: 32

============================================================

Top subjects:

  • professional_law: 0.7618 (598/785)
  • moral_scenarios: 0.8371 (370/442)
  • miscellaneous: 0.9269 (355/383)
  • professional_psychology: 0.8956 (283/316)
  • high_school_psychology: 0.9667 (261/270)
  • high_school_macroeconomics: 0.9289 (183/197)
  • prehistory: 0.9419 (162/172)
  • moral_disputes: 0.8218 (143/174)
  • elementary_mathematics: 0.9402 (173/184)
  • philosophy: 0.8553 (136/159)

Model: gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking

============================================================

  • Total questions: 7021

  • Correct: 5830

  • Accuracy: 0.8304 (83.04%)

  • Parse failures: 25

============================================================

Top subjects:

  • professional_law: 0.6777 (532/785)
  • moral_scenarios: 0.6516 (288/442)
  • miscellaneous: 0.9243 (354/383)
  • professional_psychology: 0.8734 (276/316)
  • high_school_psychology: 0.9704 (262/270)
  • high_school_macroeconomics: 0.9086 (179/197)
  • prehistory: 0.9302 (160/172)
  • moral_disputes: 0.8218 (143/174)
  • elementary_mathematics: 0.9565 (176/184)
  • philosophy: 0.8239 (131/159)

Thank you for the additional metrics.
The 6/7 refers to core benchmarks - ARC-C, ARC-E etc etc listed in the benchmarks area.

ARC-C / ARC-E are critical problem solving tests ; which also test instruction following and understanding at the same time.

Sign up or log in to comment