Evaluation metrics on MalayMMLU

#1
by Faris-Faiz - opened

I noticed that the model card doesn't have the evaluation metrics on MalayMMLU. I have evaluated MalayMMLU and will be sharing it here:

loaded data
                                 Model   Accuracy  ... by_letter        category
0  Malaysian-Qwen2.5-14B-Reasoning-SFT  76.831764  ...      True            STEM
1  Malaysian-Qwen2.5-14B-Reasoning-SFT  77.560433  ...      True        Language
2  Malaysian-Qwen2.5-14B-Reasoning-SFT  69.369760  ...      True  Social science
3  Malaysian-Qwen2.5-14B-Reasoning-SFT  72.031662  ...      True          Others
4  Malaysian-Qwen2.5-14B-Reasoning-SFT  76.177474  ...      True      Humanities

[5 rows x 5 columns]
{'Social science': np.int64(6918), 'Language': np.int64(6288), 'Humanities': np.int64(4395), 'Others': np.int64(4169), 'STEM': np.int64(2443)}
Model : Malaysian-Qwen2.5-14B-Reasoning-SFT
Metric : first
Shot : 0shot
average accuracy 73.9437492256226
accuracy for STEM 76.83176422431437
accuracy for Language 77.56043256997455
accuracy for Social science 69.36976004625615
accuracy for Others 72.0316622691293
accuracy for Humanities 76.17747440273037

This is for the first-token accuracy mode. Quite respectable for a 14B parameter model.

This is for warming up reasoning, reasoning benchmark at https://huggingface.co/mesolitica/Malaysian-Qwen2.5-14B-Reasoning-GRPO#malaymmlu, but thanks for the benchmark!

huseinzol05 changed discussion status to closed

Sign up or log in to comment