Instructions to use ArinUmut/pan-turkic-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ArinUmut/pan-turkic-tokenizer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ArinUmut/pan-turkic-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Add README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Pan-Turkic BPE Tokenizer
|
| 2 |
+
|
| 3 |
+
A SentencePiece BPE tokenizer with 65,536 vocabulary size, purpose-built for the Turkic language family. Covers Latin, Cyrillic, and Arabic scripts used across Turkic languages.
|
| 4 |
+
|
| 5 |
+
## Overview
|
| 6 |
+
|
| 7 |
+
Most existing tokenizers fail on Turkic languages outside of Turkish — particularly on Cyrillic-script languages like Kazakh, Kyrgyz, Bashkir, and Tatar, where they fall back to byte-level tokenization. This tokenizer was trained specifically on a pan-Turkic corpus covering 20+ languages, and handles all major scripts natively.
|
| 8 |
+
|
| 9 |
+
**Languages with strong coverage:**
|
| 10 |
+
|
| 11 |
+
| Language | Script |
|
| 12 |
+
|---|---|
|
| 13 |
+
| Turkish | Latin |
|
| 14 |
+
| Kazakh | Cyrillic |
|
| 15 |
+
| Kyrgyz | Cyrillic |
|
| 16 |
+
| Uzbek | Latin |
|
| 17 |
+
| Uyghur | Arabic |
|
| 18 |
+
| Bashkir | Cyrillic |
|
| 19 |
+
| Tatar | Cyrillic |
|
| 20 |
+
| Azerbaijani | Latin |
|
| 21 |
+
| Crimean Tatar | Latin |
|
| 22 |
+
| Turkmen | Latin |
|
| 23 |
+
|
| 24 |
+
## FLORES-200 Fertility Benchmark
|
| 25 |
+
|
| 26 |
+
Fertility = average tokens per word (lower is better). Evaluated on 1,012 sentences per language from the FLORES-200 devtest set.
|
| 27 |
+
|
| 28 |
+
| Language | **Ours** | Kumru-2B | GPT-2 | mT5 | NLLB-200 | XLM-R |
|
| 29 |
+
|---|---|---|---|---|---|---|
|
| 30 |
+
| Turkish | 1.78 | **1.59** | 3.79 | 2.16 | 2.00 | 1.83 |
|
| 31 |
+
| Kazakh (Cyrl) | **1.79** | 10.96 | 9.25 | 2.35 | 2.09 | 2.04 |
|
| 32 |
+
| Kyrgyz (Cyrl) | **1.73** | 11.18 | 8.95 | 2.57 | 2.21 | 2.07 |
|
| 33 |
+
| Uzbek (Latn) | **1.96** | 3.40 | 3.44 | 2.57 | 2.24 | 2.26 |
|
| 34 |
+
| Uyghur (Arab) | **1.72** | 9.42 | 10.92 | 4.91 | 2.45 | 2.46 |
|
| 35 |
+
| Bashkir (Cyrl) | **1.92** | 10.93 | 9.07 | 3.01 | 2.10 | 3.52 |
|
| 36 |
+
| Tatar (Cyrl) | **1.88** | 10.74 | 8.72 | 2.63 | 2.06 | 3.07 |
|
| 37 |
+
| Azerbaijani (Latn) | **1.72** | 3.34 | 4.92 | 2.40 | 2.16 | 1.86 |
|
| 38 |
+
| Crimean Tatar (Latn) | 2.19 | 2.61 | 3.75 | 2.49 | **2.14** | 2.36 |
|
| 39 |
+
| Turkmen (Latn) | 2.48 | 3.56 | 4.27 | 3.18 | **2.33** | 3.05 |
|
| 40 |
+
| **Turkic Avg (10 langs)** | **1.92** | 6.77 | 6.71 | 2.83 | 2.18 | 2.45 |
|
| 41 |
+
| English | 2.27 | 2.01 | **1.24** | 1.55 | 1.41 | 1.41 |
|
| 42 |
+
|
| 43 |
+
Vocab sizes: Ours 65,536 · Kumru-2B 50,176 · GPT-2 50,257 · mT5 250,100 · NLLB-200 256,204 · XLM-R 250,002
|
| 44 |
+
|
| 45 |
+
**Key result:** Best on 7 of 10 Turkic languages. Achieves similar Turkic coverage to NLLB-200 (256K vocab) with a 4× smaller vocabulary.
|
| 46 |
+
|
| 47 |
+
For Cyrillic-script Turkic languages (Kazakh, Kyrgyz, Bashkir, Tatar), competing tokenizers degrade to byte-level encoding (10–11 tokens/word). This tokenizer maintains ~1.8 tokens/word on the same languages.
|
| 48 |
+
|
| 49 |
+
## Notable Examples
|
| 50 |
+
|
| 51 |
+
Morphologically complex Turkish words encode efficiently:
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
"Cumhurbaşkanlığı" → 1 token # (Presidency)
|
| 55 |
+
"yapamayacaklarından" → 3 tokens # (from those they cannot do)
|
| 56 |
+
"sağlıklaştırılamayabileceklerden" → 6 tokens
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
Perfect round-trip for all supported scripts:
|
| 60 |
+
```python
|
| 61 |
+
encode → decode # lossless for Latin, Cyrillic, and Arabic Turkic scripts
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
## Usage
|
| 65 |
+
|
| 66 |
+
```python
|
| 67 |
+
from transformers import AutoTokenizer
|
| 68 |
+
|
| 69 |
+
tokenizer = AutoTokenizer.from_pretrained("ArinUmut/pan-turkic-tokenizer")
|
| 70 |
+
|
| 71 |
+
# Turkish
|
| 72 |
+
tokenizer.encode("Türkiye Cumhuriyeti")
|
| 73 |
+
|
| 74 |
+
# Kazakh (Cyrillic)
|
| 75 |
+
tokenizer.encode("Алматы Қазақстанның ең үлкен қаласы")
|
| 76 |
+
|
| 77 |
+
# Uyghur (Arabic)
|
| 78 |
+
tokenizer.encode("بىز ئۇيغۇر تىلىدە سۆزلىشىمىز")
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
## Specs
|
| 82 |
+
|
| 83 |
+
| Property | Value |
|
| 84 |
+
|---|---|
|
| 85 |
+
| Type | SentencePiece BPE |
|
| 86 |
+
| Vocabulary size | 65,536 |
|
| 87 |
+
| Scripts | Latin, Cyrillic, Arabic |
|
| 88 |
+
| Languages trained on | 20+ Turkic languages |
|
| 89 |
+
| Benchmark | FLORES-200 devtest |
|
| 90 |
+
|
| 91 |
+
## Limitations
|
| 92 |
+
|
| 93 |
+
- English fertility (2.27) is higher than English-specialized tokenizers, as the vocabulary is optimized for Turkic languages.
|