PleIAs
/

CommonLingua

@@ -17,43 +17,13 @@ metrics:
 # CommonLingua
-**Byte-level language identification for 334 languages — 2.35 M parameters, 9 MB on disk, runs on CPU.**
-CommonLingua sorts raw web, PDF, and digitised text into 334 ISO 639-3 language buckets so it can feed downstream training pipelines. It was built and trained at [PleIAs](https://pleias.fr) for curating [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus), with a particular focus on the long tail — **61 African languages** are supported, including languages with no fastText / OpenLID coverage.
-|             |       |
-|---|---|
-| **Languages**           | 334 (61 African) |
-| **Parameters**          | 2,347,854 |
-| **Disk (fp32 / bf16)**  | 9.4 MB / 4.7 MB |
-| **Max input**           | 512 bytes (~paragraph) |
-| **CommonLID macro F1**  | **0.7879** |
-| **CommonLID strict acc**| 77.63% |
-| **License**             | Apache-2.0 |
-## Quick start
-```bash
-pip install "git+https://github.com/PleIAs/bytehybrid-lid#egg=commonlingua[hub]"
-```
-```python
-from commonlingua import LID
-lid = LID.from_pretrained("PleIAs/CommonLingua")          # auto-downloads
-# Use the bf16 build for 2× speedup on GPU at no measurable quality cost:
-# lid = LID.from_pretrained("PleIAs/CommonLingua", dtype="bf16")
-text = (
-    "Wikipédia est une encyclopédie universelle, multilingue, créée par Jimmy "
-    "Wales et Larry Sanger le 15 janvier 2001 et fonctionnant selon le principe "
-    "du wiki."
-)
-r = lid.predict(text)
-print(r.lang, r.confidence)   # fra 0.99
-```
-The intended workload is **paragraph-level corpus curation**. For batch annotation of large parquet shards, see `predict_parquet` in the package README.
 ## Architecture
@@ -72,40 +42,16 @@ raw bytes → [trigram hash embed (4096 × 64)]
 ## Evaluation
-Evaluated on **CommonLID** (Ortiz Suárez et al. 2026): 376 k held-out paragraphs, 200+ languages. All baselines re-evaluated through the same pipeline (`iso639-lang` normalisation, equivalence-class collapsing applied identically) for an apples-to-apples comparison.
-### Headline
 | Model                | Params | Labels | Strict acc | Equiv acc | Macro F1   |
 |----------------------|-------:|-------:|----------:|----------:|-----------:|
 | OpenLID v2           | ~600 M |   200  | 55.77 %   | 70.19 %   | 0.6390 |
 | fastText-218 (NLLB)  | ~600 M |   218  | 59.53 %   | 71.64 %   | 0.6590 |
-| GlotLID v3           | ~600 M | 2 102  | 57.69 %   | 71.26 %   | 0.6729 |
-| **CommonLingua v7.2.1** | **2.35 M** | **334** | **77.63 %** | **82.92 %** | **0.7879** |
-CommonLingua reaches **+11.5 macro F1** over the next best baseline with **~250× fewer parameters**. The full per-language F1 breakdown ships in `eval_per_language.json`.
-### African subset
-CommonLID's African subset (17 languages with ≥ 100 gold samples — the regime where OpenLID/GlotLID/fastText reportedly underperform):
-| Model | African macro F1 |
-|---|---:|
-| OpenLID v2 | 0.5xx |
-| GlotLID v3 | 0.725 |
-| **CommonLingua v7.2.1** | **0.7222** |
-Notably, CommonLingua reaches **F1 = 0.975** on Amharic — a language Lingua does not support.
-### fp32 vs bf16
-The bf16 build is half the disk size and ~2.4× faster on H100, with **no measurable quality drop**: 0 of the 72 evaluated languages drift by more than 0.01 F1.
-| Build | Disk | Strict acc | Equiv acc | Macro F1 | African F1 | Lingua F1 |
-|---|---:|---:|---:|---:|---:|---:|
-| **fp32** (default) | 9.4 MB | 0.7763 | 0.8292 | **0.7879** | 0.7222 | 0.8806 |
-| **bf16**           | 4.7 MB | 0.7763 | 0.8292 | **0.7879** | 0.7221 | 0.8804 |
-| Δ                  | −50 %  | 0 | 0 | 0 | −0.0002 | −0.0003 |
 ### Throughput
@@ -120,33 +66,30 @@ Texts/sec (one paragraph = one text, ≤ 512 bytes input, padded). Real CommonLi
 | Sapphire Rapids CPU (8 threads) | bs=32      | _PENDING_ | _PENDING_ | _PENDING_ |
 | Sapphire Rapids CPU (1 thread)  | bs=32      | _PENDING_ | _PENDING_ | _PENDING_ |
-The press release that previously circulated cited "20 t/s on CPU, 3 000 t/s on GPU" — the actual GPU figure is **~9× higher** in fp32 and **~22× higher** in bf16. The bf16 build is recommended whenever the host supports it (essentially: anything Ampere or newer).
-## Training data
-Trained on **2,482,568 paragraphs across 334 languages**, drawn entirely from open-licensed and public-domain sources. Wikipedia provides the bulk (~93 %); the long tail is filled by Pralekha (Indic), VOA Africa, Cultural Heritage, OpenAlex (Indo-Malay journal data + African academic), Common Corpus adversarial pulls, Perseus / OpenPecha / eBible / Sefaria / Ben-Yehuda / Krike-Krake (ancient and minority-script corpora).
-Per-source contributions, license attribution, and full schema are documented in [PleIAs/CommonLingua-Train](https://huggingface.co/datasets/PleIAs/CommonLingua-Train).
-## Known limitations
-1. **Arabic dialect cluster** — Modern Standard (`arb`) is robust (F1 ≈ 0.95), but Moroccan (`ary`, F1 ≈ 0.47) and Egyptian (`arz`, F1 ≈ 0.25) Arabic are structurally hard: the dialects share large stretches of vocabulary with MSA and with each other. Not a data-volume problem; needs targeted corpus.
-2. **Indonesian / Malay (`ind` / `msa`)** — ~48 % msa error rate. Adding 20 k journal-provenance rows for each gave only marginal improvement; this pair will likely need supervised disambiguation features beyond byte-level signal.
-3. **Estonian (`est`) attractor** — accumulates ~750 false positives from unrelated languages. Model quirk to investigate; in practice a confidence threshold of 0.7 mostly removes spurious `est` predictions.
-4. **Lingala (`lin`)** — CommonLID's gold has the labels for `lin` reversed with Tiv/Yoruba (paper acknowledged a related labelling issue). Our F1 = 0 on this class is a benchmark artefact, not a model failure. Real-world `lin` predictions are correct on internal held-out data.
-5. **Short text (<50 chars)** — confidence drops sharply. The model is **not** intended for short-query LID; use CLD3 or a query-tuned model for that regime.
-## Files
-| File | Description |
-|---|---|
-| `model.pt`        | fp32 checkpoint (9.4 MB) |
-| `model.bf16.pt`   | bf16 checkpoint (4.7 MB) |
-| `model.py`        | ByteHybrid v2 architecture |
-| `predict.py`      | Standalone CLI (no `commonlingua` package required) |
-| `config.json`     | model config |
-| `lang2idx.json`   | 334-class label map |
-| `eval_per_language.json` | full per-language F1 on CommonLID |
 ## Citation

 # CommonLingua
+Byte-level language identification model for 334 languages — 2.35 M parameters, 5-9 MB on disk, runs on CPU.
+CommonLingua sorts raw web, PDF, and digitised text into 334 ISO 639-3 language buckets so it can feed downstream training pipelines. It was built and trained at [PleIAs](https://pleias.fr) for curating [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus), with a particular focus on the long tail — 61 African languages are supported, including languages with almost no coverage.
+CommonLingua is trained exclusively on open data under free licensed. We release the original dataset made of 2,482,568 paragraphs across 334 languages, drawn from Structured Wikipedia and other Common Corpus subsets.
+Per-source contributions, license attribution, and full schema are documented in [PleIAs/CommonLingua-Train](https://huggingface.co/datasets/PleIAs/CommonLingua-Train).
 ## Architecture
 ## Evaluation
+Evaluated on **CommonLID** (Ortiz Suárez et al. 2026): 376 k held-out paragraphs, 200+ languages. All baselines are re-evaluated through the same pipeline (`iso639-lang` normalisation, equivalence-class collapsing applied identically) for an apples-to-apples comparison.
 | Model                | Params | Labels | Strict acc | Equiv acc | Macro F1   |
 |----------------------|-------:|-------:|----------:|----------:|-----------:|
 | OpenLID v2           | ~600 M |   200  | 55.77 %   | 70.19 %   | 0.6390 |
 | fastText-218 (NLLB)  | ~600 M |   218  | 59.53 %   | 71.64 %   | 0.6590 |
+| GlotLID v3           | ~600 M | **2 102**  | 57.69 %   | 71.26 %   | 0.6729 |
+| **CommonLingua v7.2.1** | 2.35 M | 334 | **77.63 %** | **82.92 %** | **0.7879** |
+CommonLingua reaches +11.5 macro F1 over the next best baseline.
 ### Throughput
 | Sapphire Rapids CPU (8 threads) | bs=32      | _PENDING_ | _PENDING_ | _PENDING_ |
 | Sapphire Rapids CPU (1 thread)  | bs=32      | _PENDING_ | _PENDING_ | _PENDING_ |
+## Quick start
+```bash
+pip install "git+https://github.com/PleIAs/bytehybrid-lid#egg=commonlingua[hub]"
+```
+```python
+from commonlingua import LID
+lid = LID.from_pretrained("PleIAs/CommonLingua")          # auto-downloads
+# Use the bf16 build for 2× speedup on GPU at no measurable quality cost:
+# lid = LID.from_pretrained("PleIAs/CommonLingua", dtype="bf16")
+text = (
+    "Wikipédia est une encyclopédie universelle, multilingue, créée par Jimmy "
+    "Wales et Larry Sanger le 15 janvier 2001 et fonctionnant selon le principe "
+    "du wiki."
+)
+r = lid.predict(text)
+print(r.lang, r.confidence)   # fra 0.99
+```
+The intended workload is **paragraph-level corpus curation**. For batch annotation of large parquet shards, see `predict_parquet` in the package README.
 ## Citation