boffire's picture
Update README.md
51a4240 verified
metadata
language:
  - kab
  - ber
license: apache-2.0
tags:
  - tokenizer
  - bpe
  - kabyle
  - taqbaylit
  - tamazight
  - nlp
library_name: tokenizers
pipeline_tag: text-generation

Kabyle BPE Tokenizer v2

A clean, production-ready BPE tokenizer for the Kabyle language (Taqbaylit), built from scratch with proper Unicode handling and a CLDR-grounded alphabet.

Overview

Property Value
Language Kabyle / Taqbaylit (kab)
Script Latin (Mammeri orthography)
Algorithm Byte-Pair Encoding (BPE)
Vocab size 48,000
Normalizer NFC + Lowercase
Pre-tokenizer Metaspace + Punctuation split
Model max length 512

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("boffire/Kabyle-BPE-Tokenizer-v2")

text = "Taqbaylit d tutlayt tameẓẓyant yellan deg tmurt n Leqbayel."
ids = tok.encode(text)
tokens = tok.convert_ids_to_tokens(ids)
print(tokens)
# ['▁taqbaylit', '▁d', '▁tutlayt', '▁tameẓẓyant', '▁yellan',
#  '▁deg', '▁tmurt', '▁n', '▁leqbayel', '.']

Design Decisions

1. NFC Unicode Normalization

Kabyle special characters (ẓ, ɣ, ɛ, ḥ…) can be encoded in two ways in UTF-8: as a single precomposed codepoint, or as a base character + combining diacritic. Without normalization, the same word can produce different token sequences depending on how the source file was saved. NFC normalization is applied before tokenization to guarantee consistency.

2. Metaspace Pre-tokenizer

Instead of ByteLevel BPE (used by GPT-2-based tokenizers), this tokenizer uses a Metaspace pre-tokenizer. Word-initial spaces are marked as , and all Unicode characters — including Kabyle-specific extended Latin letters — remain as single atomic units. This eliminates the byte-level garbling (áºĵ for , ÉĽ for ɛ) seen in byte-level tokenizers.

3. Hyphen Preserved as Morphological Marker

In Kabyle, the hyphen (-) is a morphological connector, not punctuation. It marks clitic attachment (verbal object clitics: yewwi-t-id), directional clitics (yeffeɣ-d), and possessive suffixes (tamurt-nneɣ). The punctuation splitter explicitly excludes - so these forms are pre-tokenized as single units and BPE can learn them correctly.

4. CLDR-Grounded Alphabet

The initial_alphabet is derived exclusively from the Unicode CLDR kab.xml exemplar character set:

Core: a b c č d ḍ e ɛ f g ǧ ɣ h ḥ i j k l m n p q r ṛ s ṣ t ṭ u w x y z ẓ

Auxiliary: o v

Every character in this set is guaranteed to exist as an atomic token in the vocabulary, regardless of its frequency in the training corpus.

5. Distinct Special Tokens

Each special token has exactly one role:

Token Role
<s> Beginning of sequence
</s> End of sequence
<unk> Unknown token
<pad> Padding
<mask> Masked token (for MLM)

Benchmark

Comparison against 3 other publicly available Kabyle tokenizers on 10 morphologically complex Kabyle sentences (verbal clitics, negation, construct state, pronominal suffixes, dense special characters, aspectual contrast):

Tokenizer Vocab Avg fertility ↓ Design
boffire/kabyle-gpt2-tokenizer 50,257 2.058 ByteLevel BPE, broken Kabyle chars
Hillal-titouh/kabyle-bpe-tokenizer 50,257 1.887 No NFC, no mask token
boffire/kabyle-bpe-tokenizer-v2 48,000 1.736 ✅ NFC, CLDR alphabet, clean punctuation
boffire/bpe-tokenizer-for-kabyle 48,000 1.476 Absorbs punctuation into words

Note on BPE (boffire) fertility: Its lower fertility is structural — it absorbs punctuation marks into word tokens (e.g. berra. as a single token), which reduces the token count but makes punctuation boundaries invisible to the model. This tokenizer separates punctuation cleanly, which costs ~10 tokens across the benchmark but produces linguistically correct segmentation.

Sample tokenization

Yewwi-t-id ukan i gma-s.
→ ▁yewwi- | t- | id | ▁ukan | ▁i | ▁gma- | s | .

Ur iẓri ara acu i d-yuɣen ɣer taddart-nneɣ.
→ ▁ur | ▁iẓri | ▁ara | ▁acu | ▁i | ▁d | - | yu | ɣen | ▁ɣer | ▁taddart- | n | neɣ | .

Yessekcem-it deg taddart imi yeffeɣ-d ɣer berra.
→ ▁yessekcem- | it | ▁deg | ▁taddart | ▁imi | ▁yeffeɣ- | d | ▁ɣer | ▁berra | .

Training Data

Trained on a curated Kabyle text corpus (~20 MB, ~212,000 word-level units after pre-tokenization). The corpus covers written Kabyle in the standard Mammeri Latin orthography.

Limitations

  • Lowercase only. The normalizer lowercases all input. Kabyle capitalization is positional (sentence-initial) and carries no morphological information, so this is lossless for NLP tasks. Casing cannot be recovered from token IDs.
  • Corpus size. At 20 MB, some low-frequency morpheme combinations (e.g. the possessive suffix -nneɣ after varied stems) may not have been merged and will be split across 2–3 tokens. A larger corpus will improve this.
  • No Tifinagh or Arabic script support. This tokenizer covers the Latin orthography only.

Citation

If you use this tokenizer in your work, please cite it as:

@misc{boffire2026kabyle,
  author    = {boffire},
  title     = {Kabyle BPE Tokenizer v2},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/boffire/Kabyle-BPE-Tokenizer-v2}
}

License

Apache 2.0