PleIAs
/

CommonLingua

Text Classification

language-identification

corpus-curation

african-languages

Model card Files Files and versions

Pclanglais commited on 26 days ago

Commit

3b2902b

·

verified ·

1 Parent(s): 16f5348

Update README.md

Files changed (1) hide show

README.md +1 -3

README.md CHANGED Viewed

@@ -17,9 +17,7 @@ metrics:
 # CommonLingua
-CommonLingua is a 2.35 million-parameters language identification model trained on 2,482,568 paragraphs from Structured Wikipedia and [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus) trained by Pleias in partnership with the GSMA's "AI Language Models in Africa, by Africa, for Africa" initiative.
-As of 2026, CommonLingua is the best performing model on the CommonLID benchmark with significant gains over the previous SOTA.
 CommonLingua is based on a byte-level hybrid architecture combining three conv1D layers with an attention layer. It was originally designed for large scale classification of pretraining data and intently trained on diverse data sources, especially realistic documents with OCR errors as well as a a particular focus on the long tail — 61 African languages are supported, including languages with almost no coverage.

 # CommonLingua
+CommonLingua is a 2.35 million-parameters language identification model trained on 2,482,568 paragraphs from Structured Wikipedia and [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus) trained by Pleias in partnership with the GSMA's "AI Language Models in Africa, by Africa, for Africa" initiative. As of 2026, CommonLingua is the best performing model on the CommonLID benchmark with significant gains over the previous baseline.
 CommonLingua is based on a byte-level hybrid architecture combining three conv1D layers with an attention layer. It was originally designed for large scale classification of pretraining data and intently trained on diverse data sources, especially realistic documents with OCR errors as well as a a particular focus on the long tail — 61 African languages are supported, including languages with almost no coverage.