OpenLID-v3 Benchmark
We benchmarked OpenLID-v3 alongside 16 other LID models in this leaderboard:
https://huggingface.co/spaces/omneity-labs/lid-benchmark
I hope you find it useful and informative, feedback is welcome!
Hi, thank you for your contribution!
There are many factors influencing language identification evaluation: intersection of a model's labels (which languages are supported) with those of a particular benchmark, if the data from a benchmark were used for training or development, etc.
I understand it so that your Gherbal model has a focus on Arabic and African languages/language varieties. This is interesting and possibly useful for the community, because we mostly focused on languages of Europe, and did not even train OpenLID-v3 for telling between Arabic varieties (it only outputs "ara" macrolanguage).
As for the feedback, it might be useful to look into confusion matrices, because currently it is not clear if, for example, accuracy 0.65 on MADAR is that low because of one particularly complicated Arabic dialect, or they all are confused to each other more or less randomly (which was the case for OpenLID-v2, that's why we "merged" them all into one macrolanguage in version 3).
By the way, f1_weighted plots are only available for the average across benchmarks, it outputs "error", for each particular benchmark.
Hi @MariaFjodorowa ,
Thank you for your reply and your thorough feedback! Please allow me to address your points one by one:
I understand it so that your Gherbal model has a focus on Arabic and African languages/language varieties. This is interesting and possibly useful for the community, because we mostly focused on languages of Europe, and did not even train OpenLID-v3 for telling between Arabic varieties (it only outputs "ara" macrolanguage).
The benchmark is not specific to Arabic or African languages. The eval datasets used are representative of 449 languages in total, including all European languages with more than 1M speakers (Basque, Silesian, Galician, Faroese as well as the major European languages ..).
Some of the datasets used are global in scope:
- https://huggingface.co/datasets/commoncrawl/commonlid
- https://huggingface.co/datasets/facebook/bouquet
- https://huggingface.co/datasets/openlanguagedata/flores_plus
You can look at the "Europe" leaderboard by navigating to the "Regional" tab and choosing "Europe" afterwards.
For the list of languages, methodology and scopes, please find it in full in the "About" tab of the leaderboard.
As for the feedback, it might be useful to look into confusion matrices, because currently it is not clear if, for example, accuracy 0.65 on MADAR is that low because of one particularly complicated Arabic dialect, or they all are confused to each other more or less randomly (which was the case for OpenLID-v2, that's why we "merged" them all into one macrolanguage in version 3).
You can find the confusion matrices in the leaderboard in the "Confusions" tab. I also released the full predictions from the leaderboard to allow conducting such analyses.
https://huggingface.co/datasets/omneity-labs/lid-benchmark
By the way, f1_weighted plots are only available for the average across benchmarks, it outputs "error", for each particular benchmark.
This is a bug, thanks for catching it! I will fix it in the next update.
Please let me know if this addresses the methodology concerns you have or if there is anything missing in my response.
