Pclanglais commited on
Commit
1480181
·
verified ·
1 Parent(s): 6da952f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -18
README.md CHANGED
@@ -17,32 +17,24 @@ metrics:
17
 
18
  # CommonLingua
19
 
20
- Byte-level language identification model for 334 languages 2.35 M parameters, 5-9 MB on disk, runs on CPU.
21
 
22
- CommonLingua sorts raw web, PDF, and digitised text into 334 ISO 639-3 language buckets so it can feed downstream training pipelines. It was built and trained at [PleIAs](https://pleias.fr) for curating [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus), with a particular focus on the long tail — 61 African languages are supported, including languages with almost no coverage.
23
 
24
- CommonLingua is trained exclusively on open data under free licensed. We release the original dataset made of 2,482,568 paragraphs across 334 languages, drawn from Structured Wikipedia and other Common Corpus subsets.
25
-
26
- Per-source contributions, license attribution, and full schema are documented in [PleIAs/CommonLingua-Train](https://huggingface.co/datasets/PleIAs/CommonLingua-Train).
27
 
28
  ## Architecture
29
 
30
- ```
31
- raw bytes → [trigram hash embed (4096 × 64)]
32
- ↓ ↘
33
- + ────────→ 3× depthwise Conv1D (k=15) → 1× attention (RoPE, 4 heads)
34
-
35
- masked mean-pool → 334 logits
36
- ```
37
 
38
  - **No tokenizer.** The model operates directly on raw UTF-8 bytes, padded to 512. This makes it inherently script-agnostic — Latin, Arabic, Ethiopic, N'Ko, Tifinagh, Devanagari, CJK, all handled by the same byte stream.
39
- - **Trigram hash embedding** (added in v7.2.1): a polynomial rolling hash of byte 3-grams indexes a 4096-bucket embedding table. Hash collisions act as regularisation. Adds ~262 k parameters and 2% inference overhead, but improves macro F1 by +1.2 points and African F1 by +1.5 points over the no-n-gram baseline.
40
  - **Causal Conv1D × 3** captures local byte patterns (script ranges, common digraphs, morpheme boundaries).
41
  - **Bidirectional attention × 1 with RoPE** captures global structure across the paragraph.
42
 
43
  ## Evaluation
44
 
45
- Evaluated on **CommonLID** (Ortiz Suárez et al. 2026): 376 k held-out paragraphs, 200+ languages. All baselines are re-evaluated through the same pipeline (`iso639-lang` normalisation, equivalence-class collapsing applied identically) for an apples-to-apples comparison.
46
 
47
  | Model | Params | Labels | Strict acc | Equiv acc | Macro F1 |
48
  |----------------------|-------:|-------:|----------:|----------:|-----------:|
@@ -51,11 +43,11 @@ Evaluated on **CommonLID** (Ortiz Suárez et al. 2026): 376 k held-out paragraph
51
  | GlotLID v3 | ~600 M | **2 102** | 57.69 % | 71.26 % | 0.6729 |
52
  | **CommonLingua v7.2.1** | 2.35 M | 334 | **77.63 %** | **82.92 %** | **0.7879** |
53
 
54
- CommonLingua reaches +11.5 macro F1 over the next best baseline.
55
 
56
  ### Throughput
57
 
58
- Texts/sec (one paragraph = one text, ≤ 512 bytes input, padded). Real CommonLingua weights and the production code path:
59
 
60
  | Device | Setting | fp32 | bf16 | bf16 vs fp32 |
61
  |---|---|---:|---:|---:|
@@ -63,8 +55,8 @@ Texts/sec (one paragraph = one text, ≤ 512 bytes input, padded). Real CommonLi
63
  | H100 80GB (bs=1024) | | 10,892 | 26,130 | 2.4× |
64
  | H100 80GB (bs=256) | | 10,677 | 25,241 | 2.4× |
65
  | H100 80GB (bs=64) | low-latency| 10,025 | 22,625 | 2.3× |
66
- | Sapphire Rapids CPU (8 threads) | bs=32 | _PENDING_ | _PENDING_ | _PENDING_ |
67
- | Sapphire Rapids CPU (1 thread) | bs=32 | _PENDING_ | _PENDING_ | _PENDING_ |
68
 
69
  ## Quick start
70
 
 
17
 
18
  # CommonLingua
19
 
20
+ CommonLingua is a 2.35 million-parameters language identification model trained on 2,482,568 paragraphs from Structured Wikipedia and Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus). As of 2026, CommonLingua is the best performing model on the CommonLID benchmark with significant gains over the previous SOTA.
21
 
22
+ CommonLingua is based on a byte-level hybrid architecture combining three conv1D layers with an attention layer. It was originally designed for large scale classification of pretraining data and intently trained on diverse data sources, especially realistic documents with OCR errors as well as a a particular focus on the long tail — 61 African languages are supported, including languages with almost no coverage.
23
 
24
+ Since CommonLingua is trained exclusively on open data under free license, we release the extent [original dataset](https://huggingface.co/datasets/PleIAs/CommonLingua-Train) with detailed licensing contribution.
 
 
25
 
26
  ## Architecture
27
 
28
+ CommonLingua uses a new original architecture, optimized for task accuracy in an extremely small model size range.
 
 
 
 
 
 
29
 
30
  - **No tokenizer.** The model operates directly on raw UTF-8 bytes, padded to 512. This makes it inherently script-agnostic — Latin, Arabic, Ethiopic, N'Ko, Tifinagh, Devanagari, CJK, all handled by the same byte stream.
31
+ - **Trigram hash embedding**: a polynomial rolling hash of byte 3-grams indexes a 4096-bucket embedding table. Hash collisions act as regularisation. Our ablations showed the added signal improved macro F1 by +1.2 points over the non-gram baseline.
32
  - **Causal Conv1D × 3** captures local byte patterns (script ranges, common digraphs, morpheme boundaries).
33
  - **Bidirectional attention × 1 with RoPE** captures global structure across the paragraph.
34
 
35
  ## Evaluation
36
 
37
+ We evaluated CommonLingua on CommonLID (Ortiz Suárez et al. 2026): 376 k held-out paragraphs, 200+ languages. All baselines are re-evaluated through the same pipeline (`iso639-lang` normalisation, equivalence-class collapsing applied identically) for an apples-to-apples comparison.
38
 
39
  | Model | Params | Labels | Strict acc | Equiv acc | Macro F1 |
40
  |----------------------|-------:|-------:|----------:|----------:|-----------:|
 
43
  | GlotLID v3 | ~600 M | **2 102** | 57.69 % | 71.26 % | 0.6729 |
44
  | **CommonLingua v7.2.1** | 2.35 M | 334 | **77.63 %** | **82.92 %** | **0.7879** |
45
 
46
+ CommonLingua reaches +11.5 macro F1 over the next best baseline. We discarded Lingala from our evaluation since most samples from CommonLID turned out to belong to other close languages.
47
 
48
  ### Throughput
49
 
50
+ We evaluated CommonLingua in texts/sec (one paragraph = one text, ≤ 512 bytes input, padded).
51
 
52
  | Device | Setting | fp32 | bf16 | bf16 vs fp32 |
53
  |---|---|---:|---:|---:|
 
55
  | H100 80GB (bs=1024) | | 10,892 | 26,130 | 2.4× |
56
  | H100 80GB (bs=256) | | 10,677 | 25,241 | 2.4× |
57
  | H100 80GB (bs=64) | low-latency| 10,025 | 22,625 | 2.3× |
58
+ | Sapphire Rapids CPU (8 threads) | bs=32 | 183 | **553** | 3.0× |
59
+ | Sapphire Rapids CPU (1 thread) | bs=32 | 44 | **114** | 2.6×|
60
 
61
  ## Quick start
62