Evaluation metrics on MalayMMLU

by Faris-Faiz - opened Aug 19, 2025

Aug 19, 2025

I noticed that the model card doesn't have the evaluation metrics on MalayMMLU. I have evaluated MalayMMLU and will be sharing it here:

loaded data
                                 Model   Accuracy  ... by_letter        category
0  Malaysian-Qwen2.5-14B-Reasoning-SFT  76.831764  ...      True            STEM
1  Malaysian-Qwen2.5-14B-Reasoning-SFT  77.560433  ...      True        Language
2  Malaysian-Qwen2.5-14B-Reasoning-SFT  69.369760  ...      True  Social science
3  Malaysian-Qwen2.5-14B-Reasoning-SFT  72.031662  ...      True          Others
4  Malaysian-Qwen2.5-14B-Reasoning-SFT  76.177474  ...      True      Humanities

[5 rows x 5 columns]
{'Social science': np.int64(6918), 'Language': np.int64(6288), 'Humanities': np.int64(4395), 'Others': np.int64(4169), 'STEM': np.int64(2443)}
Model : Malaysian-Qwen2.5-14B-Reasoning-SFT
Metric : first
Shot : 0shot
average accuracy 73.9437492256226
accuracy for STEM 76.83176422431437
accuracy for Language 77.56043256997455
accuracy for Social science 69.36976004625615
accuracy for Others 72.0316622691293
accuracy for Humanities 76.17747440273037

This is for the first-token accuracy mode. Quite respectable for a 14B parameter model.

huseinzol05

Mesolitica org Aug 19, 2025

•

edited Aug 19, 2025

This is for warming up reasoning, reasoning benchmark at https://huggingface.co/mesolitica/Malaysian-Qwen2.5-14B-Reasoning-GRPO#malaymmlu, but thanks for the benchmark!

huseinzol05 changed discussion status to closed Aug 19, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment