Pclanglais commited on
Commit
6da952f
·
verified ·
1 Parent(s): ff66049

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -85
README.md CHANGED
@@ -17,43 +17,13 @@ metrics:
17
 
18
  # CommonLingua
19
 
20
- **Byte-level language identification for 334 languages — 2.35 M parameters, 9 MB on disk, runs on CPU.**
21
 
22
- CommonLingua sorts raw web, PDF, and digitised text into 334 ISO 639-3 language buckets so it can feed downstream training pipelines. It was built and trained at [PleIAs](https://pleias.fr) for curating [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus), with a particular focus on the long tail — **61 African languages** are supported, including languages with no fastText / OpenLID coverage.
23
 
24
- | | |
25
- |---|---|
26
- | **Languages** | 334 (61 African) |
27
- | **Parameters** | 2,347,854 |
28
- | **Disk (fp32 / bf16)** | 9.4 MB / 4.7 MB |
29
- | **Max input** | 512 bytes (~paragraph) |
30
- | **CommonLID macro F1** | **0.7879** |
31
- | **CommonLID strict acc**| 77.63% |
32
- | **License** | Apache-2.0 |
33
 
34
- ## Quick start
35
-
36
- ```bash
37
- pip install "git+https://github.com/PleIAs/bytehybrid-lid#egg=commonlingua[hub]"
38
- ```
39
-
40
- ```python
41
- from commonlingua import LID
42
-
43
- lid = LID.from_pretrained("PleIAs/CommonLingua") # auto-downloads
44
- # Use the bf16 build for 2× speedup on GPU at no measurable quality cost:
45
- # lid = LID.from_pretrained("PleIAs/CommonLingua", dtype="bf16")
46
-
47
- text = (
48
- "Wikipédia est une encyclopédie universelle, multilingue, créée par Jimmy "
49
- "Wales et Larry Sanger le 15 janvier 2001 et fonctionnant selon le principe "
50
- "du wiki."
51
- )
52
- r = lid.predict(text)
53
- print(r.lang, r.confidence) # fra 0.99
54
- ```
55
-
56
- The intended workload is **paragraph-level corpus curation**. For batch annotation of large parquet shards, see `predict_parquet` in the package README.
57
 
58
  ## Architecture
59
 
@@ -72,40 +42,16 @@ raw bytes → [trigram hash embed (4096 × 64)]
72
 
73
  ## Evaluation
74
 
75
- Evaluated on **CommonLID** (Ortiz Suárez et al. 2026): 376 k held-out paragraphs, 200+ languages. All baselines re-evaluated through the same pipeline (`iso639-lang` normalisation, equivalence-class collapsing applied identically) for an apples-to-apples comparison.
76
-
77
- ### Headline
78
 
79
  | Model | Params | Labels | Strict acc | Equiv acc | Macro F1 |
80
  |----------------------|-------:|-------:|----------:|----------:|-----------:|
81
  | OpenLID v2 | ~600 M | 200 | 55.77 % | 70.19 % | 0.6390 |
82
  | fastText-218 (NLLB) | ~600 M | 218 | 59.53 % | 71.64 % | 0.6590 |
83
- | GlotLID v3 | ~600 M | 2 102 | 57.69 % | 71.26 % | 0.6729 |
84
- | **CommonLingua v7.2.1** | **2.35 M** | **334** | **77.63 %** | **82.92 %** | **0.7879** |
85
-
86
- CommonLingua reaches **+11.5 macro F1** over the next best baseline with **~250× fewer parameters**. The full per-language F1 breakdown ships in `eval_per_language.json`.
87
-
88
- ### African subset
89
 
90
- CommonLID's African subset (17 languages with ≥ 100 gold samples — the regime where OpenLID/GlotLID/fastText reportedly underperform):
91
-
92
- | Model | African macro F1 |
93
- |---|---:|
94
- | OpenLID v2 | 0.5xx |
95
- | GlotLID v3 | 0.725 |
96
- | **CommonLingua v7.2.1** | **0.7222** |
97
-
98
- Notably, CommonLingua reaches **F1 = 0.975** on Amharic — a language Lingua does not support.
99
-
100
- ### fp32 vs bf16
101
-
102
- The bf16 build is half the disk size and ~2.4× faster on H100, with **no measurable quality drop**: 0 of the 72 evaluated languages drift by more than 0.01 F1.
103
-
104
- | Build | Disk | Strict acc | Equiv acc | Macro F1 | African F1 | Lingua F1 |
105
- |---|---:|---:|---:|---:|---:|---:|
106
- | **fp32** (default) | 9.4 MB | 0.7763 | 0.8292 | **0.7879** | 0.7222 | 0.8806 |
107
- | **bf16** | 4.7 MB | 0.7763 | 0.8292 | **0.7879** | 0.7221 | 0.8804 |
108
- | Δ | −50 % | 0 | 0 | 0 | −0.0002 | −0.0003 |
109
 
110
  ### Throughput
111
 
@@ -120,33 +66,30 @@ Texts/sec (one paragraph = one text, ≤ 512 bytes input, padded). Real CommonLi
120
  | Sapphire Rapids CPU (8 threads) | bs=32 | _PENDING_ | _PENDING_ | _PENDING_ |
121
  | Sapphire Rapids CPU (1 thread) | bs=32 | _PENDING_ | _PENDING_ | _PENDING_ |
122
 
123
- The press release that previously circulated cited "20 t/s on CPU, 3 000 t/s on GPU" — the actual GPU figure is **~9× higher** in fp32 and **~22× higher** in bf16. The bf16 build is recommended whenever the host supports it (essentially: anything Ampere or newer).
124
 
125
- ## Training data
 
 
126
 
127
- Trained on **2,482,568 paragraphs across 334 languages**, drawn entirely from open-licensed and public-domain sources. Wikipedia provides the bulk (~93 %); the long tail is filled by Pralekha (Indic), VOA Africa, Cultural Heritage, OpenAlex (Indo-Malay journal data + African academic), Common Corpus adversarial pulls, Perseus / OpenPecha / eBible / Sefaria / Ben-Yehuda / Krike-Krake (ancient and minority-script corpora).
 
128
 
129
- Per-source contributions, license attribution, and full schema are documented in [PleIAs/CommonLingua-Train](https://huggingface.co/datasets/PleIAs/CommonLingua-Train).
 
 
 
 
 
 
 
 
 
 
 
 
 
130
 
131
- ## Known limitations
132
-
133
- 1. **Arabic dialect cluster** — Modern Standard (`arb`) is robust (F1 ≈ 0.95), but Moroccan (`ary`, F1 ≈ 0.47) and Egyptian (`arz`, F1 ≈ 0.25) Arabic are structurally hard: the dialects share large stretches of vocabulary with MSA and with each other. Not a data-volume problem; needs targeted corpus.
134
- 2. **Indonesian / Malay (`ind` / `msa`)** — ~48 % msa error rate. Adding 20 k journal-provenance rows for each gave only marginal improvement; this pair will likely need supervised disambiguation features beyond byte-level signal.
135
- 3. **Estonian (`est`) attractor** — accumulates ~750 false positives from unrelated languages. Model quirk to investigate; in practice a confidence threshold of 0.7 mostly removes spurious `est` predictions.
136
- 4. **Lingala (`lin`)** — CommonLID's gold has the labels for `lin` reversed with Tiv/Yoruba (paper acknowledged a related labelling issue). Our F1 = 0 on this class is a benchmark artefact, not a model failure. Real-world `lin` predictions are correct on internal held-out data.
137
- 5. **Short text (<50 chars)** — confidence drops sharply. The model is **not** intended for short-query LID; use CLD3 or a query-tuned model for that regime.
138
-
139
- ## Files
140
-
141
- | File | Description |
142
- |---|---|
143
- | `model.pt` | fp32 checkpoint (9.4 MB) |
144
- | `model.bf16.pt` | bf16 checkpoint (4.7 MB) |
145
- | `model.py` | ByteHybrid v2 architecture |
146
- | `predict.py` | Standalone CLI (no `commonlingua` package required) |
147
- | `config.json` | model config |
148
- | `lang2idx.json` | 334-class label map |
149
- | `eval_per_language.json` | full per-language F1 on CommonLID |
150
 
151
  ## Citation
152
 
 
17
 
18
  # CommonLingua
19
 
20
+ Byte-level language identification model for 334 languages — 2.35 M parameters, 5-9 MB on disk, runs on CPU.
21
 
22
+ CommonLingua sorts raw web, PDF, and digitised text into 334 ISO 639-3 language buckets so it can feed downstream training pipelines. It was built and trained at [PleIAs](https://pleias.fr) for curating [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus), with a particular focus on the long tail — 61 African languages are supported, including languages with almost no coverage.
23
 
24
+ CommonLingua is trained exclusively on open data under free licensed. We release the original dataset made of 2,482,568 paragraphs across 334 languages, drawn from Structured Wikipedia and other Common Corpus subsets.
 
 
 
 
 
 
 
 
25
 
26
+ Per-source contributions, license attribution, and full schema are documented in [PleIAs/CommonLingua-Train](https://huggingface.co/datasets/PleIAs/CommonLingua-Train).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ## Architecture
29
 
 
42
 
43
  ## Evaluation
44
 
45
+ Evaluated on **CommonLID** (Ortiz Suárez et al. 2026): 376 k held-out paragraphs, 200+ languages. All baselines are re-evaluated through the same pipeline (`iso639-lang` normalisation, equivalence-class collapsing applied identically) for an apples-to-apples comparison.
 
 
46
 
47
  | Model | Params | Labels | Strict acc | Equiv acc | Macro F1 |
48
  |----------------------|-------:|-------:|----------:|----------:|-----------:|
49
  | OpenLID v2 | ~600 M | 200 | 55.77 % | 70.19 % | 0.6390 |
50
  | fastText-218 (NLLB) | ~600 M | 218 | 59.53 % | 71.64 % | 0.6590 |
51
+ | GlotLID v3 | ~600 M | **2 102** | 57.69 % | 71.26 % | 0.6729 |
52
+ | **CommonLingua v7.2.1** | 2.35 M | 334 | **77.63 %** | **82.92 %** | **0.7879** |
 
 
 
 
53
 
54
+ CommonLingua reaches +11.5 macro F1 over the next best baseline.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ### Throughput
57
 
 
66
  | Sapphire Rapids CPU (8 threads) | bs=32 | _PENDING_ | _PENDING_ | _PENDING_ |
67
  | Sapphire Rapids CPU (1 thread) | bs=32 | _PENDING_ | _PENDING_ | _PENDING_ |
68
 
69
+ ## Quick start
70
 
71
+ ```bash
72
+ pip install "git+https://github.com/PleIAs/bytehybrid-lid#egg=commonlingua[hub]"
73
+ ```
74
 
75
+ ```python
76
+ from commonlingua import LID
77
 
78
+ lid = LID.from_pretrained("PleIAs/CommonLingua") # auto-downloads
79
+ # Use the bf16 build for 2× speedup on GPU at no measurable quality cost:
80
+ # lid = LID.from_pretrained("PleIAs/CommonLingua", dtype="bf16")
81
+
82
+ text = (
83
+ "Wikipédia est une encyclopédie universelle, multilingue, créée par Jimmy "
84
+ "Wales et Larry Sanger le 15 janvier 2001 et fonctionnant selon le principe "
85
+ "du wiki."
86
+ )
87
+ r = lid.predict(text)
88
+ print(r.lang, r.confidence) # fra 0.99
89
+ ```
90
+
91
+ The intended workload is **paragraph-level corpus curation**. For batch annotation of large parquet shards, see `predict_parquet` in the package README.
92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
 
94
  ## Citation
95