Update README.md

2da630c verified 27 days ago

4.71 kB

license: apache-2.0
language:
  - multilingual
tags:
  - language-identification
  - lid
  - byte-level
  - corpus-curation
  - african-languages
library_name: pytorch
pipeline_tag: text-classification
metrics:
  - f1
  - accuracy

CommonLingua

CommonLingua is a 2.35 million-parameters language identification model trained on 2,482,568 paragraphs from Structured Wikipedia and Common Corpus. As of 2026, CommonLingua is the best performing model on the CommonLID benchmark with significant gains over the previous SOTA.

CommonLingua is based on a byte-level hybrid architecture combining three conv1D layers with an attention layer. It was originally designed for large scale classification of pretraining data and intently trained on diverse data sources, especially realistic documents with OCR errors as well as a a particular focus on the long tail — 61 African languages are supported, including languages with almost no coverage.

Since CommonLingua is trained exclusively on open data under free license, we release the extent original dataset with detailed licensing contribution.

Architecture

CommonLingua uses a new original architecture, optimized for task accuracy in an extremely small model size range.

Main features:

No tokenizer. The model operates directly on raw UTF-8 bytes, padded to 512. This makes it inherently script-agnostic — Latin, Arabic, Ethiopic, N'Ko, Tifinagh, Devanagari, CJK, all handled by the same byte stream.
Trigram hash embedding: a polynomial rolling hash of byte 3-grams indexes a 4096-bucket embedding table. Hash collisions act as regularisation. Our ablations showed the added signal improved macro F1 by +1.2 points over the non-gram baseline.
Causal Conv1D × 3 captures local byte patterns (script ranges, common digraphs, morpheme boundaries).
Bidirectional attention × 1 with RoPE captures global structure across the paragraph.

Evaluation

We evaluated CommonLingua on CommonLID (Ortiz Suárez et al. 2026): 376 k held-out paragraphs, 200+ languages. All baselines are re-evaluated through the same pipeline (iso639-lang normalisation, equivalence-class collapsing applied identically) for an apples-to-apples comparison.

Model	Params	Labels	Strict acc	Equiv acc	Macro F1
OpenLID v2	~600 M	200	55.77 %	70.19 %	0.6390
fastText-218 (NLLB)	~600 M	218	59.53 %	71.64 %	0.6590
GlotLID v3	~600 M	2 102	57.69 %	71.26 %	0.6729
CommonLingua v7.2.1	2.35 M	334	77.63 %	82.92 %	0.7879

CommonLingua reaches +11.5 macro F1 over the next best baseline. We discarded Lingala from our evaluation since most samples from CommonLID turned out to belong to other close languages.

Throughput

We evaluated CommonLingua in texts/sec (one paragraph = one text, ≤ 512 bytes input, padded).

Device	Setting	fp32	bf16	bf16 vs fp32
H100 80GB (bs=4096)	best	10,962	26,236	2.4×
H100 80GB (bs=1024)		10,892	26,130	2.4×
H100 80GB (bs=256)		10,677	25,241	2.4×
H100 80GB (bs=64)	low-latency	10,025	22,625	2.3×
Sapphire Rapids CPU (8 threads)	bs=32	183	553	3.0×
Sapphire Rapids CPU (1 thread)	bs=32	44	114	2.6×

Quick start

pip install "git+https://github.com/PleIAs/bytehybrid-lid#egg=commonlingua[hub]"

from commonlingua import LID

lid = LID.from_pretrained("PleIAs/CommonLingua")          # auto-downloads
# Use the bf16 build for 2× speedup on GPU at no measurable quality cost:
# lid = LID.from_pretrained("PleIAs/CommonLingua", dtype="bf16")

text = (
    "Wikipédia est une encyclopédie universelle, multilingue, créée par Jimmy "
    "Wales et Larry Sanger le 15 janvier 2001 et fonctionnant selon le principe "
    "du wiki."
)
r = lid.predict(text)
print(r.lang, r.confidence)   # fra 0.99

The intended workload is paragraph-level corpus curation. For batch annotation of large parquet shards, see predict_parquet in the package README.

Citation

@misc{commonlingua,
  author = {{PleIAs}},
  title  = {CommonLingua: Byte-level Language Identification for 334 Languages},
  year   = {2026},
  url    = {https://huggingface.co/PleIAs/CommonLingua}
}