PleIAs
/

CommonLingua

Text Classification

language-identification

corpus-curation

african-languages

Model card Files Files and versions

Pclanglais commited on 26 days ago

Commit

d3c2550

·

verified ·

1 Parent(s): 4fa851b

Update README.md

Files changed (1) hide show

README.md +5 -0

README.md CHANGED Viewed

@@ -27,6 +27,11 @@ Since CommonLingua is trained exclusively on open data under free license, we re
 CommonLingua uses a new original architecture, optimized for task accuracy in an extremely small model size range.
 - **No tokenizer.** The model operates directly on raw UTF-8 bytes, padded to 512. This makes it inherently script-agnostic — Latin, Arabic, Ethiopic, N'Ko, Tifinagh, Devanagari, CJK, all handled by the same byte stream.
 - **Trigram hash embedding**: a polynomial rolling hash of byte 3-grams indexes a 4096-bucket embedding table. Hash collisions act as regularisation. Our ablations showed the added signal improved macro F1 by +1.2 points over the non-gram baseline.
 - **Causal Conv1D × 3** captures local byte patterns (script ranges, common digraphs, morpheme boundaries).

 CommonLingua uses a new original architecture, optimized for task accuracy in an extremely small model size range.
+<p align="center">
+  <img width="80%" src="common_lingua.png">
+</p>
+Main features:
 - **No tokenizer.** The model operates directly on raw UTF-8 bytes, padded to 512. This makes it inherently script-agnostic — Latin, Arabic, Ethiopic, N'Ko, Tifinagh, Devanagari, CJK, all handled by the same byte stream.
 - **Trigram hash embedding**: a polynomial rolling hash of byte 3-grams indexes a 4096-bucket embedding table. Hash collisions act as regularisation. Our ablations showed the added signal improved macro F1 by +1.2 points over the non-gram baseline.
 - **Causal Conv1D × 3** captures local byte patterns (script ranges, common digraphs, morpheme boundaries).