ArinUmut commited on
Commit
89ebd36
·
verified ·
1 Parent(s): b61bb18

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Pan-Turkic BPE Tokenizer
2
+
3
+ A SentencePiece BPE tokenizer with 65,536 vocabulary size, purpose-built for the Turkic language family. Covers Latin, Cyrillic, and Arabic scripts used across Turkic languages.
4
+
5
+ ## Overview
6
+
7
+ Most existing tokenizers fail on Turkic languages outside of Turkish — particularly on Cyrillic-script languages like Kazakh, Kyrgyz, Bashkir, and Tatar, where they fall back to byte-level tokenization. This tokenizer was trained specifically on a pan-Turkic corpus covering 20+ languages, and handles all major scripts natively.
8
+
9
+ **Languages with strong coverage:**
10
+
11
+ | Language | Script |
12
+ |---|---|
13
+ | Turkish | Latin |
14
+ | Kazakh | Cyrillic |
15
+ | Kyrgyz | Cyrillic |
16
+ | Uzbek | Latin |
17
+ | Uyghur | Arabic |
18
+ | Bashkir | Cyrillic |
19
+ | Tatar | Cyrillic |
20
+ | Azerbaijani | Latin |
21
+ | Crimean Tatar | Latin |
22
+ | Turkmen | Latin |
23
+
24
+ ## FLORES-200 Fertility Benchmark
25
+
26
+ Fertility = average tokens per word (lower is better). Evaluated on 1,012 sentences per language from the FLORES-200 devtest set.
27
+
28
+ | Language | **Ours** | Kumru-2B | GPT-2 | mT5 | NLLB-200 | XLM-R |
29
+ |---|---|---|---|---|---|---|
30
+ | Turkish | 1.78 | **1.59** | 3.79 | 2.16 | 2.00 | 1.83 |
31
+ | Kazakh (Cyrl) | **1.79** | 10.96 | 9.25 | 2.35 | 2.09 | 2.04 |
32
+ | Kyrgyz (Cyrl) | **1.73** | 11.18 | 8.95 | 2.57 | 2.21 | 2.07 |
33
+ | Uzbek (Latn) | **1.96** | 3.40 | 3.44 | 2.57 | 2.24 | 2.26 |
34
+ | Uyghur (Arab) | **1.72** | 9.42 | 10.92 | 4.91 | 2.45 | 2.46 |
35
+ | Bashkir (Cyrl) | **1.92** | 10.93 | 9.07 | 3.01 | 2.10 | 3.52 |
36
+ | Tatar (Cyrl) | **1.88** | 10.74 | 8.72 | 2.63 | 2.06 | 3.07 |
37
+ | Azerbaijani (Latn) | **1.72** | 3.34 | 4.92 | 2.40 | 2.16 | 1.86 |
38
+ | Crimean Tatar (Latn) | 2.19 | 2.61 | 3.75 | 2.49 | **2.14** | 2.36 |
39
+ | Turkmen (Latn) | 2.48 | 3.56 | 4.27 | 3.18 | **2.33** | 3.05 |
40
+ | **Turkic Avg (10 langs)** | **1.92** | 6.77 | 6.71 | 2.83 | 2.18 | 2.45 |
41
+ | English | 2.27 | 2.01 | **1.24** | 1.55 | 1.41 | 1.41 |
42
+
43
+ Vocab sizes: Ours 65,536 · Kumru-2B 50,176 · GPT-2 50,257 · mT5 250,100 · NLLB-200 256,204 · XLM-R 250,002
44
+
45
+ **Key result:** Best on 7 of 10 Turkic languages. Achieves similar Turkic coverage to NLLB-200 (256K vocab) with a 4× smaller vocabulary.
46
+
47
+ For Cyrillic-script Turkic languages (Kazakh, Kyrgyz, Bashkir, Tatar), competing tokenizers degrade to byte-level encoding (10–11 tokens/word). This tokenizer maintains ~1.8 tokens/word on the same languages.
48
+
49
+ ## Notable Examples
50
+
51
+ Morphologically complex Turkish words encode efficiently:
52
+
53
+ ```python
54
+ "Cumhurbaşkanlığı" → 1 token # (Presidency)
55
+ "yapamayacaklarından" → 3 tokens # (from those they cannot do)
56
+ "sağlıklaştırılamayabileceklerden" → 6 tokens
57
+ ```
58
+
59
+ Perfect round-trip for all supported scripts:
60
+ ```python
61
+ encode → decode # lossless for Latin, Cyrillic, and Arabic Turkic scripts
62
+ ```
63
+
64
+ ## Usage
65
+
66
+ ```python
67
+ from transformers import AutoTokenizer
68
+
69
+ tokenizer = AutoTokenizer.from_pretrained("ArinUmut/pan-turkic-tokenizer")
70
+
71
+ # Turkish
72
+ tokenizer.encode("Türkiye Cumhuriyeti")
73
+
74
+ # Kazakh (Cyrillic)
75
+ tokenizer.encode("Алматы Қазақстанның ең үлкен қаласы")
76
+
77
+ # Uyghur (Arabic)
78
+ tokenizer.encode("بىز ئۇيغۇر تىلىدە سۆزلىشىمىز")
79
+ ```
80
+
81
+ ## Specs
82
+
83
+ | Property | Value |
84
+ |---|---|
85
+ | Type | SentencePiece BPE |
86
+ | Vocabulary size | 65,536 |
87
+ | Scripts | Latin, Cyrillic, Arabic |
88
+ | Languages trained on | 20+ Turkic languages |
89
+ | Benchmark | FLORES-200 devtest |
90
+
91
+ ## Limitations
92
+
93
+ - English fertility (2.27) is higher than English-specialized tokenizers, as the vocabulary is optimized for Turkic languages.