Update README.md
Browse files
README.md
CHANGED
|
@@ -27,6 +27,11 @@ Since CommonLingua is trained exclusively on open data under free license, we re
|
|
| 27 |
|
| 28 |
CommonLingua uses a new original architecture, optimized for task accuracy in an extremely small model size range.
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
- **No tokenizer.** The model operates directly on raw UTF-8 bytes, padded to 512. This makes it inherently script-agnostic — Latin, Arabic, Ethiopic, N'Ko, Tifinagh, Devanagari, CJK, all handled by the same byte stream.
|
| 31 |
- **Trigram hash embedding**: a polynomial rolling hash of byte 3-grams indexes a 4096-bucket embedding table. Hash collisions act as regularisation. Our ablations showed the added signal improved macro F1 by +1.2 points over the non-gram baseline.
|
| 32 |
- **Causal Conv1D × 3** captures local byte patterns (script ranges, common digraphs, morpheme boundaries).
|
|
|
|
| 27 |
|
| 28 |
CommonLingua uses a new original architecture, optimized for task accuracy in an extremely small model size range.
|
| 29 |
|
| 30 |
+
<p align="center">
|
| 31 |
+
<img width="80%" src="common_lingua.png">
|
| 32 |
+
</p>
|
| 33 |
+
|
| 34 |
+
Main features:
|
| 35 |
- **No tokenizer.** The model operates directly on raw UTF-8 bytes, padded to 512. This makes it inherently script-agnostic — Latin, Arabic, Ethiopic, N'Ko, Tifinagh, Devanagari, CJK, all handled by the same byte stream.
|
| 36 |
- **Trigram hash embedding**: a polynomial rolling hash of byte 3-grams indexes a 4096-bucket embedding table. Hash collisions act as regularisation. Our ablations showed the added signal improved macro F1 by +1.2 points over the non-gram baseline.
|
| 37 |
- **Causal Conv1D × 3** captures local byte patterns (script ranges, common digraphs, morpheme boundaries).
|