Update README.md
Browse files
README.md
CHANGED
|
@@ -17,9 +17,7 @@ metrics:
|
|
| 17 |
|
| 18 |
# CommonLingua
|
| 19 |
|
| 20 |
-
CommonLingua is a 2.35 million-parameters language identification model trained on 2,482,568 paragraphs from Structured Wikipedia and [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus) trained by Pleias in partnership with the GSMA's "AI Language Models in Africa, by Africa, for Africa" initiative.
|
| 21 |
-
|
| 22 |
-
As of 2026, CommonLingua is the best performing model on the CommonLID benchmark with significant gains over the previous SOTA.
|
| 23 |
|
| 24 |
CommonLingua is based on a byte-level hybrid architecture combining three conv1D layers with an attention layer. It was originally designed for large scale classification of pretraining data and intently trained on diverse data sources, especially realistic documents with OCR errors as well as a a particular focus on the long tail — 61 African languages are supported, including languages with almost no coverage.
|
| 25 |
|
|
|
|
| 17 |
|
| 18 |
# CommonLingua
|
| 19 |
|
| 20 |
+
CommonLingua is a 2.35 million-parameters language identification model trained on 2,482,568 paragraphs from Structured Wikipedia and [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus) trained by Pleias in partnership with the GSMA's "AI Language Models in Africa, by Africa, for Africa" initiative. As of 2026, CommonLingua is the best performing model on the CommonLID benchmark with significant gains over the previous baseline.
|
|
|
|
|
|
|
| 21 |
|
| 22 |
CommonLingua is based on a byte-level hybrid architecture combining three conv1D layers with an attention layer. It was originally designed for large scale classification of pretraining data and intently trained on diverse data sources, especially realistic documents with OCR errors as well as a a particular focus on the long tail — 61 African languages are supported, including languages with almost no coverage.
|
| 23 |
|