Evaluation metrics on MalayMMLU
#1
by Faris-Faiz - opened
I noticed that the model card doesn't have the evaluation metrics on MalayMMLU. I have evaluated MalayMMLU and will be sharing it here:
loaded data
Model Accuracy ... by_letter category
0 Malaysian-Qwen2.5-14B-Reasoning-SFT 76.831764 ... True STEM
1 Malaysian-Qwen2.5-14B-Reasoning-SFT 77.560433 ... True Language
2 Malaysian-Qwen2.5-14B-Reasoning-SFT 69.369760 ... True Social science
3 Malaysian-Qwen2.5-14B-Reasoning-SFT 72.031662 ... True Others
4 Malaysian-Qwen2.5-14B-Reasoning-SFT 76.177474 ... True Humanities
[5 rows x 5 columns]
{'Social science': np.int64(6918), 'Language': np.int64(6288), 'Humanities': np.int64(4395), 'Others': np.int64(4169), 'STEM': np.int64(2443)}
Model : Malaysian-Qwen2.5-14B-Reasoning-SFT
Metric : first
Shot : 0shot
average accuracy 73.9437492256226
accuracy for STEM 76.83176422431437
accuracy for Language 77.56043256997455
accuracy for Social science 69.36976004625615
accuracy for Others 72.0316622691293
accuracy for Humanities 76.17747440273037
This is for the first-token accuracy mode. Quite respectable for a 14B parameter model.
This is for warming up reasoning, reasoning benchmark at https://huggingface.co/mesolitica/Malaysian-Qwen2.5-14B-Reasoning-GRPO#malaymmlu, but thanks for the benchmark!
huseinzol05 changed discussion status to closed