vettu — Tamil Morpheme Tokenizer for HuggingFace
A linguistically-aware tokenizer for Tamil that decomposes words into their root morphemes and grammatical suffixes instead of arbitrary byte-pair subwords.
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(
"tamiltheorist/vettu-tokenizer",
trust_remote_code=True,
use_fast=False,
)
tok.tokenize("படிக்கிறான்")
# ['படி', 'கிற', 'ான்'] ← root + present-tense + 3sg.m marker
tok.tokenize("வீட்டிற்கு சென்றான்")
# ['வீடு', 'கு', 'செல்', 'ன்ற', 'ான்'] ← roots + dative + past marker
Why morpheme tokenization for Tamil?
Tamil is an agglutinative language — a single word can encode root + tense + person + case + number + politeness, all fused together:
| Surface form | Gloss |
|---|---|
| படிக்கிறான் | padik (study) + kiṟāṉ (3sg.m present) |
| படித்தாள் | padik + tāḷ (3sg.f past) |
| படிப்பார்கள் | padik + pār̤kaḷ (3pl future) |
A BPE tokenizer sees three completely different token sequences for the same verb.
vettu always yields படி as the first token — the model learns shared semantics
across all inflections from the very first training step.
Benchmark: NanoGPT trained on Tamil Wikipedia
Same model (4-layer GPT, 128-dim, 128 context), same corpus, same steps:
| Metric | BPE (4k vocab) | vettu |
|---|---|---|
| Vocab size | 4,000 | 8,420 |
| Best val loss | 5.70 | 4.62 |
| Perplexity | 299 | ★ 102 |
| Root consistency | 58% | 100% |
| Fertility (tokens/word) | 2.96 | 1.65 |
vettu reaches 3× lower perplexity on the same data.
Generated text comparison
Prompt: தமிழ்நாட்டில்
BPE: , . ஆ ரம் , உ ற் பாலை வன த்துறை மற்றும் உ த்தி
→ Fragmented sub-character pieces, no linguistic meaning.
vettu: சென்னை இல் மாநிலம் இன் மாநில சென்னை ஐ மாவட்டம் இல் உள்ள து
→ Proper Tamil words with case markers (இல், இன், ஐ).
Installation
pip install vettu>=1.0.4 transformers huggingface-hub
vettu is used for full morphological analysis of unseen words.
The pre-built word cache covers 15,000 most frequent Tamil Wikipedia words (~97% of
corpus tokens) — these work without vettu installed at runtime.
Usage
Basic tokenization
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(
"tamiltheorist/vettu-tokenizer",
trust_remote_code=True,
use_fast=False,
)
# Tokenize
tokens = tok.tokenize("தமிழ்நாட்டில் வாழும் மக்கள்")
# ['தமிழ்நாடு', 'இல்', 'வாழ்', 'வ', 'மக்கள்']
# Encode to IDs
ids = tok.encode("படிக்கிறான்", add_special_tokens=True)
# [2, 5579, 2700, 8414, 3] (<s> + morpheme IDs + </s>)
# Decode
tok.decode(ids)
# 'படி கிற ான்'
With HuggingFace models
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained(
"tamiltheorist/vettu-tokenizer",
trust_remote_code=True,
use_fast=False,
)
model = AutoModelForCausalLM.from_pretrained("your-tamil-gpt-model")
inputs = tok("தமிழ் இலக்கியம்", return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=50)
print(tok.decode(out[0]))
Morpheme inspection
tok.get_morphemes("பார்க்கிறீர்கள்")
# ['பார்', 'கிறீர்கள்']
tok.get_morphemes("நாட்டிற்காக")
# ['நாடு', 'இற்காக']
Note:
use_fast=Falseis required — there is no Rust/fast tokenizer for vettu.
Vocabulary
- Size: ~8,400 tokens (4 special + morphemes)
- Special tokens:
<pad>(0),<unk>(1),<s>(2),</s>(3),<mask>(4),<sep>(5),<cls>(6) - Coverage: top-15,000 most frequent Tamil Wikipedia words (~97% of tokens) in the pre-built word cache; vettu handles unseen words via rule-based morphology
Files in this repo
| File | Description |
|---|---|
vettu_tokenizer.py |
VettuTokenizer(PreTrainedTokenizer) implementation |
vocab.json |
Token → ID mapping (~8,400 entries) |
word_cache.json |
Pre-analysed word → morpheme list (15k words) |
tokenizer_config.json |
HuggingFace tokenizer config |
special_tokens_map.json |
Special token definitions |
Citation
@software{vettu2024,
author = {Tamilarasan},
title = {vettu: Tamil Morpheme Tokenizer},
year = {2024},
url = {https://pypi.org/project/vettu/},
version = {1.0.4}
}
Related
- vettu PyPI: https://pypi.org/project/vettu/
- Source: https://github.com/tamil-phy/tamil_tokenizer
- Benchmark code:
tamil-phy/tamil-nanogpt-bench