Instructions to use ArinUmut/pan-turkic-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ArinUmut/pan-turkic-tokenizer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ArinUmut/pan-turkic-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - tr | |
| - kk | |
| - ky | |
| - uz | |
| - ug | |
| - ba | |
| - tt | |
| - az | |
| - crh | |
| - tk | |
| license: apache-2.0 | |
| tags: | |
| - tokenizer | |
| - sentencepiece | |
| - bpe | |
| - turkic | |
| - multilingual | |
| library_name: transformers | |
| # Pan-Turkic BPE Tokenizer | |
| A SentencePiece BPE tokenizer with 65,536 vocabulary size, purpose-built for the Turkic language family. Covers Latin, Cyrillic, and Arabic scripts used across Turkic languages. | |
| ## Overview | |
| Most existing tokenizers fail on Turkic languages outside of Turkish — particularly on Cyrillic-script languages like Kazakh, Kyrgyz, Bashkir, and Tatar, where they fall back to byte-level tokenization. This tokenizer was trained specifically on a pan-Turkic corpus covering 20+ languages, and handles all major scripts natively. | |
| **Languages with strong coverage:** | |
| | Language | Script | | |
| |---|---| | |
| | Turkish | Latin | | |
| | Kazakh | Cyrillic | | |
| | Kyrgyz | Cyrillic | | |
| | Uzbek | Latin | | |
| | Uyghur | Arabic | | |
| | Bashkir | Cyrillic | | |
| | Tatar | Cyrillic | | |
| | Azerbaijani | Latin | | |
| | Crimean Tatar | Latin | | |
| | Turkmen | Latin | | |
| ## FLORES-200 Fertility Benchmark | |
| Fertility = average tokens per word (lower is better). Evaluated on 1,012 sentences per language from the FLORES-200 devtest set. | |
| | Language | **Ours** | Kumru-2B | GPT-2 | mT5 | NLLB-200 | XLM-R | | |
| |---|---|---|---|---|---|---| | |
| | Turkish | 1.78 | **1.59** | 3.79 | 2.16 | 2.00 | 1.83 | | |
| | Kazakh (Cyrl) | **1.79** | 10.96 | 9.25 | 2.35 | 2.09 | 2.04 | | |
| | Kyrgyz (Cyrl) | **1.73** | 11.18 | 8.95 | 2.57 | 2.21 | 2.07 | | |
| | Uzbek (Latn) | **1.96** | 3.40 | 3.44 | 2.57 | 2.24 | 2.26 | | |
| | Uyghur (Arab) | **1.72** | 9.42 | 10.92 | 4.91 | 2.45 | 2.46 | | |
| | Bashkir (Cyrl) | **1.92** | 10.93 | 9.07 | 3.01 | 2.10 | 3.52 | | |
| | Tatar (Cyrl) | **1.88** | 10.74 | 8.72 | 2.63 | 2.06 | 3.07 | | |
| | Azerbaijani (Latn) | **1.72** | 3.34 | 4.92 | 2.40 | 2.16 | 1.86 | | |
| | Crimean Tatar (Latn) | 2.19 | 2.61 | 3.75 | 2.49 | **2.14** | 2.36 | | |
| | Turkmen (Latn) | 2.48 | 3.56 | 4.27 | 3.18 | **2.33** | 3.05 | | |
| | **Turkic Avg (10 langs)** | **1.92** | 6.77 | 6.71 | 2.83 | 2.18 | 2.45 | | |
| | English | 2.27 | 2.01 | **1.24** | 1.55 | 1.41 | 1.41 | | |
| Vocab sizes: Ours 65,536 · Kumru-2B 50,176 · GPT-2 50,257 · mT5 250,100 · NLLB-200 256,204 · XLM-R 250,002 | |
| **Key result:** Best on 7 of 10 Turkic languages. Achieves similar Turkic coverage to NLLB-200 (256K vocab) with a 4× smaller vocabulary. | |
| For Cyrillic-script Turkic languages (Kazakh, Kyrgyz, Bashkir, Tatar), competing tokenizers degrade to byte-level encoding (10–11 tokens/word). This tokenizer maintains ~1.8 tokens/word on the same languages. | |
| ## Notable Examples | |
| Morphologically complex Turkish words encode efficiently: | |
| ```python | |
| "Cumhurbaşkanlığı" → 1 token # (Presidency) | |
| "yapamayacaklarından" → 3 tokens # (from those they cannot do) | |
| "sağlıklaştırılamayabileceklerden" → 6 tokens | |
| ``` | |
| Perfect round-trip for all supported scripts: | |
| ```python | |
| encode → decode # lossless for Latin, Cyrillic, and Arabic Turkic scripts | |
| ``` | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("ArinUmut/pan-turkic-tokenizer") | |
| # Turkish | |
| tokenizer.encode("Türkiye Cumhuriyeti") | |
| # Kazakh (Cyrillic) | |
| tokenizer.encode("Алматы Қазақстанның ең үлкен қаласы") | |
| # Uyghur (Arabic) | |
| tokenizer.encode("بىز ئۇيغۇر تىلىدە سۆزلىشىمىز") | |
| ``` | |
| ## Specs | |
| | Property | Value | | |
| |---|---| | |
| | Type | SentencePiece BPE | | |
| | Vocabulary size | 65,536 | | |
| | Scripts | Latin, Cyrillic, Arabic | | |
| | Languages trained on | 20+ Turkic languages | | |
| | Benchmark | FLORES-200 devtest | | |
| ## Limitations | |
| - English fertility (2.27) is higher than English-specialized tokenizers, as the vocabulary is optimized for Turkic languages. | |