Multrenizer
Multrenizer is a bilingual English-Turkish Unigram tokenizer built from scratch for Turkish morphology, Turkish-aware casing, and mixed TR-EN text.
Links
- Repository: github.com/fzengin19/multrenizer
- Hugging Face: huggingface.co/fzengin18/multrenizer
Why Multrenizer?
Standard multilingual tokenizers routinely break Turkish at poor boundaries, waste context on agglutinative suffixes, and mishandle the Turkish dotted/dotless I/i rule. Multrenizer is designed to fix those failure modes without discarding punctuation and chat-critical symbols.
Core design goals:
- Turkish-aware normalization: hardcoded
ฤฐ -> iandI -> ฤฑbefore Unicode normalization - Apostrophe preservation: forms like
feature'ฤฑ,merge'lemek,ฤฐstanbul'da, andcan'tkeep'as a real token - Compact vocabulary budget:
~26Ktarget vocab for a Turkish-first bilingual tokenizer - Fixed utility budget: dedicated punctuation, emoji, math, currency, and chat symbols
- Code-switching support: trained on mixed TR-EN text instead of treating it as noise
Benchmark Results
Evaluated on 5,000 Turkish sentences, 5,000 English sentences, and 500 code-switching sentences from the prepared corpus against 5 reference tokenizers.
Notes:
- Multrenizer's shipped local artifact is auto-read from
multrenizer-tokenizer/tokenizer.json; the current released artifact is25,917tokens. - Example token strings for byte-level models are shown as raw tokenizer pieces. Metrics are based on exact token counts, not prettified decoding.
Compared Tokenizers
| Tokenizer | Source | Vocab Size | Algorithm | Type |
|---|---|---|---|---|
| Multrenizer | This project | 25,917 | Unigram | Bilingual EN-TR, purpose-built |
| Kumru-2B | vngrs-ai/Kumru-2B | 50,176 | BPE | Turkish LLM (VNGRS, Sep 2025, Mistral-based) |
| Turkcell-7B | TURKCELL/Turkcell-LLM-7b-v1 | 48,351 | BPE | Turkish LLM (Turkcell, Apr 2024, Mistral-based) |
| GPT-2 | openai-community/gpt2 | 50,257 | BPE | English-centric baseline (OpenAI, 2019) |
| Qwen-3 | Qwen/Qwen3-0.6B | 151,643 | BPE | Multilingual (Alibaba, 2025) |
| Mistral-3.1 | mistralai/Mistral-Small-3.1-24B-Base-2503 | 131,072 | BPE/SP | Multilingual (Mistral AI, Mar 2025) |
Fertility, Compression, and Token Count
Lower fertility means fewer tokens per word. Higher compression means more characters carried per token.
| Metric | Multrenizer | Kumru-2B | Turkcell-7B | GPT-2 | Qwen-3 | Mistral-3.1 |
|---|---|---|---|---|---|---|
| Vocab Size | 25,917 | 50,176 | 48,351 | 50,257 | 151,643 | 131,072 |
| TR Fertility | 1.627 | 1.649 | 1.917 | 3.785 | 2.616 | 2.384 |
| EN Fertility | 1.525 | 2.151 | 1.555 | 1.314 | 1.372 | 1.381 |
| CS Fertility | 1.756 | 1.923 | 1.832 | 3.475 | 2.445 | 2.479 |
| TR Compression | 4.783 | 4.719 | 4.060 | 2.056 | 2.976 | 3.265 |
| EN Compression | 4.148 | 2.942 | 4.068 | 4.816 | 4.610 | 4.580 |
| TR Total Tokens (5K) | 130,844 | 132,637 | 154,166 | 304,345 | 210,334 | 191,682 |
| EN Total Tokens (5K) | 157,027 | 221,420 | 160,121 | 135,235 | 141,275 | 142,196 |
| CS Total Tokens (500) | 5,525 | 6,050 | 5,762 | 10,933 | 7,693 | 7,799 |
Current position:
- Best Turkish efficiency in this comparison set: TR fertility, TR compression, TR total tokens
- Best code-switching efficiency in this comparison set: CS fertility and CS total tokens
- Competitive English coverage for a Turkish-first tokenizer, but not better than English-native GPT-2 on EN-only token count
- Only tokenizer here that passes Turkish
I/inormalization correctly
Morphological Splitting
Total tokens needed to represent 10 difficult Turkish words:
| Tokenizer | Vocab Size | Total Tokens | Avg per Word |
|---|---|---|---|
| Multrenizer | 25,917 | 32 | 3.2 |
| Kumru-2B | 50,176 | 35 | 3.5 |
| Turkcell-7B | 48,351 | 38 | 3.8 |
| Mistral-3.1 | 131,072 | 71 | 7.1 |
| Qwen-3 | 151,643 | 73 | 7.3 |
| GPT-2 | 50,257 | 105 | 10.5 |
Selected examples:
gรผzelleลtirilmiล
Multrenizer: gรผzel + leลtirilmiล [2 tokens]
Kumru-2B: gรยผzel + leร
ลtirilmiร
ล [2 tokens]
Turkcell-7B: gรผzel + leลtirilmiล [2 tokens]
Qwen-3: g + รยผz + elle + ร
ลtir + ilmiร
ล [5 tokens]
Mistral-3.1: g + รยผz + elle + ร
ลtir + ilmiร
ล [5 tokens]
GPT-2: g + รยผ + z + elle + ร
ล + t + ir + il + mi + ร
ล [10 tokens]
ฤฐstanbul'da
Multrenizer: istanbul + ' + da [3 tokens]
Kumru-2B: รยฐstanbul + ' + da [3 tokens]
Turkcell-7B: ฤฐstanbul + ' + da [3 tokens]
Qwen-3: รยฐ + stanbul + 'd + a [4 tokens]
Mistral-3.1: รยฐ + stanbul + 'd + a [4 tokens]
GPT-2: ร + ยฐ + stanbul + 'd + a [5 tokens]
Afyonkarahisarlฤฑlaลtฤฑramadฤฑklarฤฑmฤฑzdan
Multrenizer: afyonkarahisar + lฤฑ + laลtฤฑ + r + ama + dฤฑklarฤฑ + mฤฑzda + n [8 tokens]
Kumru-2B: Af + yonkarahisar + lรยฑ + laร
ลtรยฑr + ama + dรยฑk + larรยฑmรยฑz + dan [8 tokens]
Turkcell-7B: Afyon + kar + ah + is + arlฤฑ + laลtฤฑr + a + madฤฑk + larฤฑmฤฑzdan [9 tokens]
Qwen-3: Af + yon + kar + ah + is + ar + lรยฑ + la + ร
ลt + รยฑ + ram + ad + รยฑkl + ar + รยฑmรยฑz + dan [16 tokens]
Mistral-3.1: Af + yon + kar + ah + is + arl + รยฑ + laร
ลt + รยฑ + ram + ad + รยฑklarรยฑ + m + รยฑ + zd + an [16 tokens]
GPT-2: Af + yon + kar + ah + is + arl + รยฑ + la + ร
ล + t + รยฑ + ram + ad + รยฑ + k + lar + รยฑ + m + รยฑ + z + dan [21 tokens]
Turkish I/i Normalization
This is the critical locale-sensitive test:
ฤฐmust lowercase toiImust lowercase toฤฑ
| Input | Expected | Multrenizer | Kumru-2B | Turkcell-7B | GPT-2 | Qwen-3 | Mistral-3.1 |
|---|---|---|---|---|---|---|---|
| ฤฐstanbul | istanbul | OK | FAIL | FAIL | FAIL | FAIL | FAIL |
| IลIK | ฤฑลฤฑk | OK | FAIL | FAIL | FAIL | FAIL | FAIL |
| SIR | sฤฑr | OK | FAIL | FAIL | FAIL | FAIL | FAIL |
| ฤฐNSAN | insan | OK | FAIL | FAIL | FAIL | FAIL | FAIL |
| ISITMAK | ฤฑsฤฑtmak | OK | FAIL | FAIL | FAIL | FAIL | FAIL |
| Score | 8/8 | 0/8 | 0/8 | 0/8 | 0/8 | 0/8 |
Multrenizer is the only tokenizer in this comparison that handles Turkish casing correctly.
Code-Switching
"Bu feature'ฤฑ implement ederken edge case'leri handle etmeyi unutmayalฤฑm."
Multrenizer [15 tok] bu | feature | ' | ฤฑ | implement | ederken | edge | case | ' | leri | handle | etmeyi | unutmaya | lฤฑm | .
Kumru-2B [20 tok] Bu | fe | ature | ' | รยฑ | imp | lement | ederken | ed | ge | cas | e | ' | leri | hand | le | etmeyi | unutma | yalรยฑm | .
Turkcell-7B [15 tok] Bu | feature | ' | ฤฑ | implement | ederken | edge | case | ' | leri | handle | etmeyi | unut | mayalฤฑm | .
GPT-2 [24 tok] Bu | feature | ' | รยฑ | implement | ed | er | ken | edge | case | ' | ler | i | handle | et | me | yi | un | ut | may | al | รยฑ | m | .
Qwen-3 [22 tok] Bu | feature | ' | รยฑ | implement | ed | er | ken | edge | case | ' | leri | handle | et | m | ey | i | un | ut | may | alรยฑm | .
Mistral-3.1 [20 tok] Bu | feature | 'รยฑ | implement | eder | ken | edge | case | ' | leri | handle | et | me | yi | un | ut | may | al | รยฑm | .
"merge'lemek istediฤim branch conflict veriyor."
Multrenizer [ 8 tok] merge | ' | lemek | istediฤim | branch | conflict | veriyor | .
Kumru-2B [14 tok] mer | ge | ' | lemek | istediรลim | b | ran | ch | con | f | lic | t | veriyor | .
Turkcell-7B [ 8 tok] merge | ' | lemek | istediฤim | branch | conflict | veriyor | .
GPT-2 [16 tok] mer | ge | ' | lem | ek | is | ted | i | รล | im | branch | conflict | ver | iy | or | .
Qwen-3 [11 tok] merge | ' | lem | ek | istediรล | im | branch | conflict | ver | iyor | .
Mistral-3.1 [13 tok] merge | ' | le | mek | ist | edi | รล | im | branch | conflict | ver | iyor | .
Quick Start
Installation
git clone https://github.com/fzengin19/multrenizer.git
cd multrenizer
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Use the shipped tokenizer locally
from tokenizers import Tokenizer
tok = Tokenizer.from_file("multrenizer-tokenizer/tokenizer.json")
encoded = tok.encode("ฤฐstanbul'da gรผzel bir gรผn")
print(encoded.tokens)
# ['<s>', 'istanbul', "'", 'da', 'gรผzel', 'bir', 'gรผn', '</s>']
print(tok.normalizer.normalize_str("IลIK"))
# 'ฤฑลฤฑk'
Load from Hugging Face
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("fzengin18/multrenizer")
encoded = tok.encode("ฤฐstanbul'da gรผzel bir gรผn")
print(encoded.tokens)
# ['<s>', 'istanbul', "'", 'da', 'gรผzel', 'bir', 'gรผn', '</s>']
If you use transformers, this also works:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("fzengin18/multrenizer")
print(tok.tokenize("ฤฐstanbul'da gรผzel bir gรผn"))
Train from scratch
# 1. Download and prepare corpus
python prepare_data.py --size medium
# 2. Train tokenizer
python train_tokenizer.py --data-dir data/
# 3. Optional: push tokenizer files to Hugging Face Hub
python train_tokenizer.py --data-dir data/ \
--repo-id fzengin18/multrenizer \
--hf-token "$HF_TOKEN"
Run benchmarks
python benchmark.py --tr-lines 5000 --en-lines 5000
Architecture
Pipeline
Raw text
-> Turkish I/i normalizer (Replace: ฤฐ->i, I->ฤฑ, iฬ->i)
-> Quote canonicalization (โ โ สผ ๏ผ -> ')
-> NFKC normalization
-> Lowercase
-> Strip whitespace
-> Pre-tokenizer (whitespace + apostrophe + punctuation split)
-> Unigram model (~26K target vocab)
-> Post-processor (<s> ... </s>)
Data Mix
The released artifact is trained with the default file-based interleave in train_tokenizer.py, which approximates:
| Stream | Share | Purpose |
|---|---|---|
| Turkish | ~60% | Core Turkish morphology |
| English | ~30% | English coverage |
| Code-switching | ~10% | TR-EN boundary handling |
Corpus collection is Turkish-forward, and code-switching examples are generated from OPUS parallel pairs during data preparation.
Exact source configs used during corpus preparation:
wikimedia/wikipediawith20231101.trwikimedia/wikipediawith20231101.enHelsinki-NLP/opus-100withen-tr
The synthetic code-switching stream is generated locally from OPUS-100 parallel pairs, so it does not appear as a separate Hugging Face dataset entry.
Vocabulary Budget
Multrenizer is designed around a 26,000 target vocabulary, with a fixed budget reserved for always-preserved tokens:
32named special tokens512reserved tokens292utility tokens- up to
25,164learned subword tokens
Current shipped artifact: 25,917 total tokens.
Special Tokens
| Category | IDs | Tokens | Purpose |
|---|---|---|---|
| Core | 0-3 | <unk> <s> </s> <pad> |
Basic tokenizer operation |
| Chat | 4-8 | <|system|> <|user|> <|assistant|> <|end|> <|sep|> |
Instruction tuning and chat models |
| Reasoning | 9-12 | <think> </think> <|step|> <|reflection|> |
Reasoning traces and self-check markers |
| Tool Use | 13-16 | <tool_call> </tool_call> <tool_response> </tool_response> |
Tool and function calling |
| Code/FIM | 17-20 | <|code|> <|fim_prefix|> <|fim_middle|> <|fim_suffix|> |
Code and fill-in-middle workflows |
| Bilingual | 21-22 | <|tr|> <|en|> |
Language tags |
| RAG | 23-24 | <|context|> <|/context|> |
Retrieval boundaries |
| Multi-modal | 25-28 | <|image|> <|audio|> <|video|> <|file|> |
Placeholder tokens |
| Structured | 29-31 | <|json|> <|table|> <|cite|> |
Structured output markers |
| Reserved | 32-543 | <|reserved_0|> ... <|reserved_511|> |
Future growth without retraining |
| Utility | 544+ | Punctuation, emoji, math, currency, typography | Critical text symbols kept intact |
Utility Tokens
| Category | Count | Examples |
|---|---|---|
| Punctuation | 31 | . , ! ? ; : - ( ) [ ] { } / \ " ' ... |
| Currency & Business | 15 | โบ $ โฌ ยฃ ยฅ % @ # & |
| Math & Science | 25 | ยฑ ร รท โ โค โฅ โ โ ฯ ฮฑ ฮฒ ฮณ |
| Arrows & Symbols | 15 | โ โ โ โ โข โ
โ โ โ ยฉ ยฎ โข |
| Typography | 10 | ยซ ยป โ โ โ โ โน โบ โ โ |
| Emoji (faces) | 70 | ๐ ๐ ๐คฃ ๐ ๐ ๐ค ๐ญ ๐ก ๐ ๐ค |
| Emoji (hands) | 28 | ๐ ๐ ๐ ๐ ๐ ๐ช โ โ๏ธ |
| Emoji (hearts) | 18 | โค๏ธ ๐ ๐ ๐ ๐ ๐ค ๐ |
| Emoji (symbols) | 36 | ๐ฅ โจ โญ โ
โ โ ๏ธ ๐ฏ ๐ |
| Emoji (objects) | 36 | ๐ป ๐ฑ ๐ฏ ๐ ๐ โ ๐ ๐ฐ |
| Emoji (flags) | 8 | ๐น๐ท ๐บ๐ธ ๐ฌ๐ง ๐ฉ๐ช ๐ซ๐ท ๐ช๐ธ ๐ฎ๐น ๐ฏ๐ต |
Project Structure
multrenizer/
โโโ multrenizer-tokenizer/ # Trained tokenizer artifact
โ โโโ tokenizer.json
โ โโโ tokenizer_config.json
โ โโโ special_tokens_map.json
โโโ prepare_data.py # Corpus download and preparation
โโโ train_tokenizer.py # Tokenizer training script
โโโ benchmark.py # Benchmark against 5 reference tokenizers
โโโ benchmark_results.json # Full benchmark output
โโโ tests/ # Regression tests for tokenizer behavior
โโโ requirements.txt
โโโ pyproject.toml
References
- Tokens with Meaning: A Hybrid Tokenization Approach for Turkish
- Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark
- Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE
- Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration
License
Apache 2.0