Multrenizer

Multrenizer is a bilingual English-Turkish Unigram tokenizer built from scratch for Turkish morphology, Turkish-aware casing, and mixed TR-EN text.

Links

Why Multrenizer?

Standard multilingual tokenizers routinely break Turkish at poor boundaries, waste context on agglutinative suffixes, and mishandle the Turkish dotted/dotless I/i rule. Multrenizer is designed to fix those failure modes without discarding punctuation and chat-critical symbols.

Core design goals:

  • Turkish-aware normalization: hardcoded ฤฐ -> i and I -> ฤฑ before Unicode normalization
  • Apostrophe preservation: forms like feature'ฤฑ, merge'lemek, ฤฐstanbul'da, and can't keep ' as a real token
  • Compact vocabulary budget: ~26K target vocab for a Turkish-first bilingual tokenizer
  • Fixed utility budget: dedicated punctuation, emoji, math, currency, and chat symbols
  • Code-switching support: trained on mixed TR-EN text instead of treating it as noise

Benchmark Results

Evaluated on 5,000 Turkish sentences, 5,000 English sentences, and 500 code-switching sentences from the prepared corpus against 5 reference tokenizers.

Notes:

  • Multrenizer's shipped local artifact is auto-read from multrenizer-tokenizer/tokenizer.json; the current released artifact is 25,917 tokens.
  • Example token strings for byte-level models are shown as raw tokenizer pieces. Metrics are based on exact token counts, not prettified decoding.

Compared Tokenizers

Tokenizer Source Vocab Size Algorithm Type
Multrenizer This project 25,917 Unigram Bilingual EN-TR, purpose-built
Kumru-2B vngrs-ai/Kumru-2B 50,176 BPE Turkish LLM (VNGRS, Sep 2025, Mistral-based)
Turkcell-7B TURKCELL/Turkcell-LLM-7b-v1 48,351 BPE Turkish LLM (Turkcell, Apr 2024, Mistral-based)
GPT-2 openai-community/gpt2 50,257 BPE English-centric baseline (OpenAI, 2019)
Qwen-3 Qwen/Qwen3-0.6B 151,643 BPE Multilingual (Alibaba, 2025)
Mistral-3.1 mistralai/Mistral-Small-3.1-24B-Base-2503 131,072 BPE/SP Multilingual (Mistral AI, Mar 2025)

Fertility, Compression, and Token Count

Lower fertility means fewer tokens per word. Higher compression means more characters carried per token.

Metric Multrenizer Kumru-2B Turkcell-7B GPT-2 Qwen-3 Mistral-3.1
Vocab Size 25,917 50,176 48,351 50,257 151,643 131,072
TR Fertility 1.627 1.649 1.917 3.785 2.616 2.384
EN Fertility 1.525 2.151 1.555 1.314 1.372 1.381
CS Fertility 1.756 1.923 1.832 3.475 2.445 2.479
TR Compression 4.783 4.719 4.060 2.056 2.976 3.265
EN Compression 4.148 2.942 4.068 4.816 4.610 4.580
TR Total Tokens (5K) 130,844 132,637 154,166 304,345 210,334 191,682
EN Total Tokens (5K) 157,027 221,420 160,121 135,235 141,275 142,196
CS Total Tokens (500) 5,525 6,050 5,762 10,933 7,693 7,799

Current position:

  • Best Turkish efficiency in this comparison set: TR fertility, TR compression, TR total tokens
  • Best code-switching efficiency in this comparison set: CS fertility and CS total tokens
  • Competitive English coverage for a Turkish-first tokenizer, but not better than English-native GPT-2 on EN-only token count
  • Only tokenizer here that passes Turkish I/i normalization correctly

Morphological Splitting

Total tokens needed to represent 10 difficult Turkish words:

Tokenizer Vocab Size Total Tokens Avg per Word
Multrenizer 25,917 32 3.2
Kumru-2B 50,176 35 3.5
Turkcell-7B 48,351 38 3.8
Mistral-3.1 131,072 71 7.1
Qwen-3 151,643 73 7.3
GPT-2 50,257 105 10.5

Selected examples:

gรผzelleลŸtirilmiลŸ
  Multrenizer: gรผzel + leลŸtirilmiลŸ                                   [2 tokens]
  Kumru-2B: gรƒยผzel + leร…ลtirilmiร…ล                                  [2 tokens]
  Turkcell-7B: gรผzel + leลŸtirilmiลŸ                                   [2 tokens]
  Qwen-3: g + รƒยผz + elle + ร…ลtir + ilmiร…ล                         [5 tokens]
  Mistral-3.1: g + รƒยผz + elle + ร…ลtir + ilmiร…ล                     [5 tokens]
  GPT-2: g + รƒยผ + z + elle + ร…ล + t + ir + il + mi + ร…ล          [10 tokens]

ฤฐstanbul'da
  Multrenizer: istanbul + ' + da                                     [3 tokens]
  Kumru-2B: ร„ยฐstanbul + ' + da                                      [3 tokens]
  Turkcell-7B: ฤฐstanbul + ' + da                                     [3 tokens]
  Qwen-3: ร„ยฐ + stanbul + 'd + a                                    [4 tokens]
  Mistral-3.1: ร„ยฐ + stanbul + 'd + a                               [4 tokens]
  GPT-2: ร„ + ยฐ + stanbul + 'd + a                                  [5 tokens]

AfyonkarahisarlฤฑlaลŸtฤฑramadฤฑklarฤฑmฤฑzdan
  Multrenizer: afyonkarahisar + lฤฑ + laลŸtฤฑ + r + ama + dฤฑklarฤฑ + mฤฑzda + n   [8 tokens]
  Kumru-2B: Af + yonkarahisar + lร„ยฑ + laร…ลtร„ยฑr + ama + dร„ยฑk + larร„ยฑmร„ยฑz + dan [8 tokens]
  Turkcell-7B: Afyon + kar + ah + is + arlฤฑ + laลŸtฤฑr + a + madฤฑk + larฤฑmฤฑzdan [9 tokens]
  Qwen-3: Af + yon + kar + ah + is + ar + lร„ยฑ + la + ร…ลt + ร„ยฑ + ram + ad + ร„ยฑkl + ar + ร„ยฑmร„ยฑz + dan [16 tokens]
  Mistral-3.1: Af + yon + kar + ah + is + arl + ร„ยฑ + laร…ลt + ร„ยฑ + ram + ad + ร„ยฑklarร„ยฑ + m + ร„ยฑ + zd + an [16 tokens]
  GPT-2: Af + yon + kar + ah + is + arl + ร„ยฑ + la + ร…ล + t + ร„ยฑ + ram + ad + ร„ยฑ + k + lar + ร„ยฑ + m + ร„ยฑ + z + dan [21 tokens]

Turkish I/i Normalization

This is the critical locale-sensitive test:

  • ฤฐ must lowercase to i
  • I must lowercase to ฤฑ
Input Expected Multrenizer Kumru-2B Turkcell-7B GPT-2 Qwen-3 Mistral-3.1
ฤฐstanbul istanbul OK FAIL FAIL FAIL FAIL FAIL
IลžIK ฤฑลŸฤฑk OK FAIL FAIL FAIL FAIL FAIL
SIR sฤฑr OK FAIL FAIL FAIL FAIL FAIL
ฤฐNSAN insan OK FAIL FAIL FAIL FAIL FAIL
ISITMAK ฤฑsฤฑtmak OK FAIL FAIL FAIL FAIL FAIL
Score 8/8 0/8 0/8 0/8 0/8 0/8

Multrenizer is the only tokenizer in this comparison that handles Turkish casing correctly.

Code-Switching

"Bu feature'ฤฑ implement ederken edge case'leri handle etmeyi unutmayalฤฑm."

  Multrenizer  [15 tok]  bu | feature | ' | ฤฑ | implement | ederken | edge | case | ' | leri | handle | etmeyi | unutmaya | lฤฑm | .
  Kumru-2B     [20 tok]  Bu | fe | ature | ' | ร„ยฑ | imp | lement | ederken | ed | ge | cas | e | ' | leri | hand | le | etmeyi | unutma | yalร„ยฑm | .
  Turkcell-7B  [15 tok]  Bu | feature | ' | ฤฑ | implement | ederken | edge | case | ' | leri | handle | etmeyi | unut | mayalฤฑm | .
  GPT-2        [24 tok]  Bu | feature | ' | ร„ยฑ | implement | ed | er | ken | edge | case | ' | ler | i | handle | et | me | yi | un | ut | may | al | ร„ยฑ | m | .
  Qwen-3       [22 tok]  Bu | feature | ' | ร„ยฑ | implement | ed | er | ken | edge | case | ' | leri | handle | et | m | ey | i | un | ut | may | alร„ยฑm | .
  Mistral-3.1  [20 tok]  Bu | feature | 'ร„ยฑ | implement | eder | ken | edge | case | ' | leri | handle | et | me | yi | un | ut | may | al | ร„ยฑm | .

"merge'lemek istediฤŸim branch conflict veriyor."

  Multrenizer  [ 8 tok]  merge | ' | lemek | istediฤŸim | branch | conflict | veriyor | .
  Kumru-2B     [14 tok]  mer | ge | ' | lemek | istediร„ลim | b | ran | ch | con | f | lic | t | veriyor | .
  Turkcell-7B  [ 8 tok]  merge | ' | lemek | istediฤŸim | branch | conflict | veriyor | .
  GPT-2        [16 tok]  mer | ge | ' | lem | ek | is | ted | i | ร„ล | im | branch | conflict | ver | iy | or | .
  Qwen-3       [11 tok]  merge | ' | lem | ek | istediร„ล | im | branch | conflict | ver | iyor | .
  Mistral-3.1  [13 tok]  merge | ' | le | mek | ist | edi | ร„ล | im | branch | conflict | ver | iyor | .

Quick Start

Installation

git clone https://github.com/fzengin19/multrenizer.git
cd multrenizer
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Use the shipped tokenizer locally

from tokenizers import Tokenizer

tok = Tokenizer.from_file("multrenizer-tokenizer/tokenizer.json")

encoded = tok.encode("ฤฐstanbul'da gรผzel bir gรผn")
print(encoded.tokens)
# ['<s>', 'istanbul', "'", 'da', 'gรผzel', 'bir', 'gรผn', '</s>']

print(tok.normalizer.normalize_str("IลžIK"))
# 'ฤฑลŸฤฑk'

Load from Hugging Face

from tokenizers import Tokenizer

tok = Tokenizer.from_pretrained("fzengin18/multrenizer")

encoded = tok.encode("ฤฐstanbul'da gรผzel bir gรผn")
print(encoded.tokens)
# ['<s>', 'istanbul', "'", 'da', 'gรผzel', 'bir', 'gรผn', '</s>']

If you use transformers, this also works:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("fzengin18/multrenizer")
print(tok.tokenize("ฤฐstanbul'da gรผzel bir gรผn"))

Train from scratch

# 1. Download and prepare corpus
python prepare_data.py --size medium

# 2. Train tokenizer
python train_tokenizer.py --data-dir data/

# 3. Optional: push tokenizer files to Hugging Face Hub
python train_tokenizer.py --data-dir data/ \
  --repo-id fzengin18/multrenizer \
  --hf-token "$HF_TOKEN"

Run benchmarks

python benchmark.py --tr-lines 5000 --en-lines 5000

Architecture

Pipeline

Raw text
  -> Turkish I/i normalizer (Replace: ฤฐ->i, I->ฤฑ, iฬ‡->i)
  -> Quote canonicalization (โ€™ โ€˜ สผ ๏ผ‡ -> ')
  -> NFKC normalization
  -> Lowercase
  -> Strip whitespace
  -> Pre-tokenizer (whitespace + apostrophe + punctuation split)
  -> Unigram model (~26K target vocab)
  -> Post-processor (<s> ... </s>)

Data Mix

The released artifact is trained with the default file-based interleave in train_tokenizer.py, which approximates:

Stream Share Purpose
Turkish ~60% Core Turkish morphology
English ~30% English coverage
Code-switching ~10% TR-EN boundary handling

Corpus collection is Turkish-forward, and code-switching examples are generated from OPUS parallel pairs during data preparation.

Exact source configs used during corpus preparation:

  • wikimedia/wikipedia with 20231101.tr
  • wikimedia/wikipedia with 20231101.en
  • Helsinki-NLP/opus-100 with en-tr

The synthetic code-switching stream is generated locally from OPUS-100 parallel pairs, so it does not appear as a separate Hugging Face dataset entry.

Vocabulary Budget

Multrenizer is designed around a 26,000 target vocabulary, with a fixed budget reserved for always-preserved tokens:

  • 32 named special tokens
  • 512 reserved tokens
  • 292 utility tokens
  • up to 25,164 learned subword tokens

Current shipped artifact: 25,917 total tokens.

Special Tokens

Category IDs Tokens Purpose
Core 0-3 <unk> <s> </s> <pad> Basic tokenizer operation
Chat 4-8 <|system|> <|user|> <|assistant|> <|end|> <|sep|> Instruction tuning and chat models
Reasoning 9-12 <think> </think> <|step|> <|reflection|> Reasoning traces and self-check markers
Tool Use 13-16 <tool_call> </tool_call> <tool_response> </tool_response> Tool and function calling
Code/FIM 17-20 <|code|> <|fim_prefix|> <|fim_middle|> <|fim_suffix|> Code and fill-in-middle workflows
Bilingual 21-22 <|tr|> <|en|> Language tags
RAG 23-24 <|context|> <|/context|> Retrieval boundaries
Multi-modal 25-28 <|image|> <|audio|> <|video|> <|file|> Placeholder tokens
Structured 29-31 <|json|> <|table|> <|cite|> Structured output markers
Reserved 32-543 <|reserved_0|> ... <|reserved_511|> Future growth without retraining
Utility 544+ Punctuation, emoji, math, currency, typography Critical text symbols kept intact

Utility Tokens

Category Count Examples
Punctuation 31 . , ! ? ; : - ( ) [ ] { } / \ " ' ...
Currency & Business 15 โ‚บ $ โ‚ฌ ยฃ ยฅ % @ # &
Math & Science 25 ยฑ ร— รท โ‰  โ‰ค โ‰ฅ โˆž โˆš ฯ€ ฮฑ ฮฒ ฮณ
Arrows & Symbols 15 โ†’ โ† โ†‘ โ†“ โ€ข โ˜… โ˜† โœ“ โœ— ยฉ ยฎ โ„ข
Typography 10 ยซ ยป โ€œ โ€ โ€˜ โ€™ โ€น โ€บ โ€ž โ€š
Emoji (faces) 70 ๐Ÿ˜€ ๐Ÿ˜‚ ๐Ÿคฃ ๐Ÿ˜Š ๐Ÿ˜ ๐Ÿค” ๐Ÿ˜ญ ๐Ÿ˜ก ๐Ÿ’€ ๐Ÿค–
Emoji (hands) 28 ๐Ÿ‘‹ ๐Ÿ‘ ๐Ÿ‘Ž ๐Ÿ‘ ๐Ÿ™ ๐Ÿ’ช โœŠ โœŒ๏ธ
Emoji (hearts) 18 โค๏ธ ๐Ÿ’› ๐Ÿ’š ๐Ÿ’™ ๐Ÿ’œ ๐Ÿ–ค ๐Ÿ’”
Emoji (symbols) 36 ๐Ÿ”ฅ โœจ โญ โœ… โŒ โš ๏ธ ๐Ÿ’ฏ ๐Ÿš€
Emoji (objects) 36 ๐Ÿ’ป ๐Ÿ“ฑ ๐ŸŽฏ ๐Ÿ† ๐Ÿ“Š โ˜• ๐Ÿ”— ๐Ÿ’ฐ
Emoji (flags) 8 ๐Ÿ‡น๐Ÿ‡ท ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ‡ฉ๐Ÿ‡ช ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ‡ช๐Ÿ‡ธ ๐Ÿ‡ฎ๐Ÿ‡น ๐Ÿ‡ฏ๐Ÿ‡ต

Project Structure

multrenizer/
โ”œโ”€โ”€ multrenizer-tokenizer/     # Trained tokenizer artifact
โ”‚   โ”œโ”€โ”€ tokenizer.json
โ”‚   โ”œโ”€โ”€ tokenizer_config.json
โ”‚   โ””โ”€โ”€ special_tokens_map.json
โ”œโ”€โ”€ prepare_data.py            # Corpus download and preparation
โ”œโ”€โ”€ train_tokenizer.py         # Tokenizer training script
โ”œโ”€โ”€ benchmark.py               # Benchmark against 5 reference tokenizers
โ”œโ”€โ”€ benchmark_results.json     # Full benchmark output
โ”œโ”€โ”€ tests/                     # Regression tests for tokenizer behavior
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ pyproject.toml

References

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Datasets used to train fzengin18/multrenizer

Paper for fzengin18/multrenizer