File size: 3,734 Bytes
7482c16
 
 
 
 
 
 
 
 
 
 
 
804fae1
7482c16
 
 
 
 
 
 
 
 
89ebd36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
language:
  - tr
  - kk
  - ky
  - uz
  - ug
  - ba
  - tt
  - az
  - crh
  - tk
license: apache-2.0
tags:
  - tokenizer
  - sentencepiece
  - bpe
  - turkic
  - multilingual
library_name: transformers
---

# Pan-Turkic BPE Tokenizer

A SentencePiece BPE tokenizer with 65,536 vocabulary size, purpose-built for the Turkic language family. Covers Latin, Cyrillic, and Arabic scripts used across Turkic languages.

## Overview

Most existing tokenizers fail on Turkic languages outside of Turkish — particularly on Cyrillic-script languages like Kazakh, Kyrgyz, Bashkir, and Tatar, where they fall back to byte-level tokenization. This tokenizer was trained specifically on a pan-Turkic corpus covering 20+ languages, and handles all major scripts natively.

**Languages with strong coverage:**

| Language | Script |
|---|---|
| Turkish | Latin |
| Kazakh | Cyrillic |
| Kyrgyz | Cyrillic |
| Uzbek | Latin |
| Uyghur | Arabic |
| Bashkir | Cyrillic |
| Tatar | Cyrillic |
| Azerbaijani | Latin |
| Crimean Tatar | Latin |
| Turkmen | Latin |

## FLORES-200 Fertility Benchmark

Fertility = average tokens per word (lower is better). Evaluated on 1,012 sentences per language from the FLORES-200 devtest set.

| Language | **Ours** | Kumru-2B | GPT-2 | mT5 | NLLB-200 | XLM-R |
|---|---|---|---|---|---|---|
| Turkish | 1.78 | **1.59** | 3.79 | 2.16 | 2.00 | 1.83 |
| Kazakh (Cyrl) | **1.79** | 10.96 | 9.25 | 2.35 | 2.09 | 2.04 |
| Kyrgyz (Cyrl) | **1.73** | 11.18 | 8.95 | 2.57 | 2.21 | 2.07 |
| Uzbek (Latn) | **1.96** | 3.40 | 3.44 | 2.57 | 2.24 | 2.26 |
| Uyghur (Arab) | **1.72** | 9.42 | 10.92 | 4.91 | 2.45 | 2.46 |
| Bashkir (Cyrl) | **1.92** | 10.93 | 9.07 | 3.01 | 2.10 | 3.52 |
| Tatar (Cyrl) | **1.88** | 10.74 | 8.72 | 2.63 | 2.06 | 3.07 |
| Azerbaijani (Latn) | **1.72** | 3.34 | 4.92 | 2.40 | 2.16 | 1.86 |
| Crimean Tatar (Latn) | 2.19 | 2.61 | 3.75 | 2.49 | **2.14** | 2.36 |
| Turkmen (Latn) | 2.48 | 3.56 | 4.27 | 3.18 | **2.33** | 3.05 |
| **Turkic Avg (10 langs)** | **1.92** | 6.77 | 6.71 | 2.83 | 2.18 | 2.45 |
| English | 2.27 | 2.01 | **1.24** | 1.55 | 1.41 | 1.41 |

Vocab sizes: Ours 65,536 · Kumru-2B 50,176 · GPT-2 50,257 · mT5 250,100 · NLLB-200 256,204 · XLM-R 250,002

**Key result:** Best on 7 of 10 Turkic languages. Achieves similar Turkic coverage to NLLB-200 (256K vocab) with a 4× smaller vocabulary.

For Cyrillic-script Turkic languages (Kazakh, Kyrgyz, Bashkir, Tatar), competing tokenizers degrade to byte-level encoding (10–11 tokens/word). This tokenizer maintains ~1.8 tokens/word on the same languages.

## Notable Examples

Morphologically complex Turkish words encode efficiently:

```python
"Cumhurbaşkanlığı"      → 1 token   # (Presidency)
"yapamayacaklarından"   → 3 tokens  # (from those they cannot do)
"sağlıklaştırılamayabileceklerden" → 6 tokens
```

Perfect round-trip for all supported scripts:
```python
encode → decode  # lossless for Latin, Cyrillic, and Arabic Turkic scripts
```

## Usage

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ArinUmut/pan-turkic-tokenizer")

# Turkish
tokenizer.encode("Türkiye Cumhuriyeti")

# Kazakh (Cyrillic)
tokenizer.encode("Алматы Қазақстанның ең үлкен қаласы")

# Uyghur (Arabic)
tokenizer.encode("بىز ئۇيغۇر تىلىدە سۆزلىشىمىز")
```

## Specs

| Property | Value |
|---|---|
| Type | SentencePiece BPE |
| Vocabulary size | 65,536 |
| Scripts | Latin, Cyrillic, Arabic |
| Languages trained on | 20+ Turkic languages |
| Benchmark | FLORES-200 devtest |

## Limitations

- English fertility (2.27) is higher than English-specialized tokenizers, as the vocabulary is optimized for Turkic languages.