Kabyle BPE Tokenizer v2
A clean, production-ready BPE tokenizer for the Kabyle language (Taqbaylit), built from scratch with proper Unicode handling and a CLDR-grounded alphabet.
Overview
| Property | Value |
|---|---|
| Language | Kabyle / Taqbaylit (kab) |
| Script | Latin (Mammeri orthography) |
| Algorithm | Byte-Pair Encoding (BPE) |
| Vocab size | 48,000 |
| Normalizer | NFC + Lowercase |
| Pre-tokenizer | Metaspace + Punctuation split |
| Model max length | 512 |
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("boffire/Kabyle-BPE-Tokenizer-v2")
text = "Taqbaylit d tutlayt tameẓẓyant yellan deg tmurt n Leqbayel."
ids = tok.encode(text)
tokens = tok.convert_ids_to_tokens(ids)
print(tokens)
# ['▁taqbaylit', '▁d', '▁tutlayt', '▁tameẓẓyant', '▁yellan',
# '▁deg', '▁tmurt', '▁n', '▁leqbayel', '.']
Design Decisions
1. NFC Unicode Normalization
Kabyle special characters (ẓ, ɣ, ɛ, ḥ…) can be encoded in two ways in UTF-8: as a single precomposed codepoint, or as a base character + combining diacritic. Without normalization, the same word can produce different token sequences depending on how the source file was saved. NFC normalization is applied before tokenization to guarantee consistency.
2. Metaspace Pre-tokenizer
Instead of ByteLevel BPE (used by GPT-2-based tokenizers), this tokenizer uses a Metaspace pre-tokenizer. Word-initial spaces are marked as ▁, and all Unicode characters — including Kabyle-specific extended Latin letters — remain as single atomic units. This eliminates the byte-level garbling (áºĵ for ẓ, ÉĽ for ɛ) seen in byte-level tokenizers.
3. Hyphen Preserved as Morphological Marker
In Kabyle, the hyphen (-) is a morphological connector, not punctuation. It marks clitic attachment (verbal object clitics: yewwi-t-id), directional clitics (yeffeɣ-d), and possessive suffixes (tamurt-nneɣ). The punctuation splitter explicitly excludes - so these forms are pre-tokenized as single units and BPE can learn them correctly.
4. CLDR-Grounded Alphabet
The initial_alphabet is derived exclusively from the Unicode CLDR kab.xml exemplar character set:
Core: a b c č d ḍ e ɛ f g ǧ ɣ h ḥ i j k l m n p q r ṛ s ṣ t ṭ u w x y z ẓ
Auxiliary: o v
Every character in this set is guaranteed to exist as an atomic token in the vocabulary, regardless of its frequency in the training corpus.
5. Distinct Special Tokens
Each special token has exactly one role:
| Token | Role |
|---|---|
<s> |
Beginning of sequence |
</s> |
End of sequence |
<unk> |
Unknown token |
<pad> |
Padding |
<mask> |
Masked token (for MLM) |
Benchmark
Comparison against 3 other publicly available Kabyle tokenizers on 10 morphologically complex Kabyle sentences (verbal clitics, negation, construct state, pronominal suffixes, dense special characters, aspectual contrast):
| Tokenizer | Vocab | Avg fertility ↓ | Design |
|---|---|---|---|
boffire/kabyle-gpt2-tokenizer |
50,257 | 2.058 | ByteLevel BPE, broken Kabyle chars |
Hillal-titouh/kabyle-bpe-tokenizer |
50,257 | 1.887 | No NFC, no mask token |
boffire/kabyle-bpe-tokenizer-v2 |
48,000 | 1.736 | ✅ NFC, CLDR alphabet, clean punctuation |
boffire/bpe-tokenizer-for-kabyle |
48,000 | 1.476 | Absorbs punctuation into words |
Note on BPE (boffire) fertility: Its lower fertility is structural — it absorbs punctuation marks into word tokens (e.g.
berra.as a single token), which reduces the token count but makes punctuation boundaries invisible to the model. This tokenizer separates punctuation cleanly, which costs ~10 tokens across the benchmark but produces linguistically correct segmentation.
Sample tokenization
Yewwi-t-id ukan i gma-s.
→ ▁yewwi- | t- | id | ▁ukan | ▁i | ▁gma- | s | .
Ur iẓri ara acu i d-yuɣen ɣer taddart-nneɣ.
→ ▁ur | ▁iẓri | ▁ara | ▁acu | ▁i | ▁d | - | yu | ɣen | ▁ɣer | ▁taddart- | n | neɣ | .
Yessekcem-it deg taddart imi yeffeɣ-d ɣer berra.
→ ▁yessekcem- | it | ▁deg | ▁taddart | ▁imi | ▁yeffeɣ- | d | ▁ɣer | ▁berra | .
Training Data
Trained on a curated Kabyle text corpus (~20 MB, ~212,000 word-level units after pre-tokenization). The corpus covers written Kabyle in the standard Mammeri Latin orthography.
Limitations
- Lowercase only. The normalizer lowercases all input. Kabyle capitalization is positional (sentence-initial) and carries no morphological information, so this is lossless for NLP tasks. Casing cannot be recovered from token IDs.
- Corpus size. At 20 MB, some low-frequency morpheme combinations (e.g. the possessive suffix
-nneɣafter varied stems) may not have been merged and will be split across 2–3 tokens. A larger corpus will improve this. - No Tifinagh or Arabic script support. This tokenizer covers the Latin orthography only.
Citation
If you use this tokenizer in your work, please cite it as:
@misc{boffire2026kabyle,
author = {boffire},
title = {Kabyle BPE Tokenizer v2},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/boffire/Kabyle-BPE-Tokenizer-v2}
}