Pclanglais commited on
Commit
d3c2550
·
verified ·
1 Parent(s): 4fa851b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -0
README.md CHANGED
@@ -27,6 +27,11 @@ Since CommonLingua is trained exclusively on open data under free license, we re
27
 
28
  CommonLingua uses a new original architecture, optimized for task accuracy in an extremely small model size range.
29
 
 
 
 
 
 
30
  - **No tokenizer.** The model operates directly on raw UTF-8 bytes, padded to 512. This makes it inherently script-agnostic — Latin, Arabic, Ethiopic, N'Ko, Tifinagh, Devanagari, CJK, all handled by the same byte stream.
31
  - **Trigram hash embedding**: a polynomial rolling hash of byte 3-grams indexes a 4096-bucket embedding table. Hash collisions act as regularisation. Our ablations showed the added signal improved macro F1 by +1.2 points over the non-gram baseline.
32
  - **Causal Conv1D × 3** captures local byte patterns (script ranges, common digraphs, morpheme boundaries).
 
27
 
28
  CommonLingua uses a new original architecture, optimized for task accuracy in an extremely small model size range.
29
 
30
+ <p align="center">
31
+ <img width="80%" src="common_lingua.png">
32
+ </p>
33
+
34
+ Main features:
35
  - **No tokenizer.** The model operates directly on raw UTF-8 bytes, padded to 512. This makes it inherently script-agnostic — Latin, Arabic, Ethiopic, N'Ko, Tifinagh, Devanagari, CJK, all handled by the same byte stream.
36
  - **Trigram hash embedding**: a polynomial rolling hash of byte 3-grams indexes a 4096-bucket embedding table. Hash collisions act as regularisation. Our ablations showed the added signal improved macro F1 by +1.2 points over the non-gram baseline.
37
  - **Causal Conv1D × 3** captures local byte patterns (script ranges, common digraphs, morpheme boundaries).