Update README.md

51a4240 verified about 6 hours ago

5.63 kB

language:
  - kab
  - ber
license: apache-2.0
tags:
  - tokenizer
  - bpe
  - kabyle
  - taqbaylit
  - tamazight
  - nlp
library_name: tokenizers
pipeline_tag: text-generation

Kabyle BPE Tokenizer v2

A clean, production-ready BPE tokenizer for the Kabyle language (Taqbaylit), built from scratch with proper Unicode handling and a CLDR-grounded alphabet.

Overview

Property	Value
Language	Kabyle / Taqbaylit (`kab`)
Script	Latin (Mammeri orthography)
Algorithm	Byte-Pair Encoding (BPE)
Vocab size	48,000
Normalizer	NFC + Lowercase
Pre-tokenizer	Metaspace + Punctuation split
Model max length	512

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("boffire/Kabyle-BPE-Tokenizer-v2")

text = "Taqbaylit d tutlayt tameẓẓyant yellan deg tmurt n Leqbayel."
ids = tok.encode(text)
tokens = tok.convert_ids_to_tokens(ids)
print(tokens)
# ['▁taqbaylit', '▁d', '▁tutlayt', '▁tameẓẓyant', '▁yellan',
#  '▁deg', '▁tmurt', '▁n', '▁leqbayel', '.']

Design Decisions

1. NFC Unicode Normalization

Kabyle special characters (ẓ, ɣ, ɛ, ḥ…) can be encoded in two ways in UTF-8: as a single precomposed codepoint, or as a base character + combining diacritic. Without normalization, the same word can produce different token sequences depending on how the source file was saved. NFC normalization is applied before tokenization to guarantee consistency.

2. Metaspace Pre-tokenizer

Instead of ByteLevel BPE (used by GPT-2-based tokenizers), this tokenizer uses a Metaspace pre-tokenizer. Word-initial spaces are marked as ▁, and all Unicode characters — including Kabyle-specific extended Latin letters — remain as single atomic units. This eliminates the byte-level garbling (áºĵ for ẓ, ÉĽ for ɛ) seen in byte-level tokenizers.

3. Hyphen Preserved as Morphological Marker

In Kabyle, the hyphen (-) is a morphological connector, not punctuation. It marks clitic attachment (verbal object clitics: yewwi-t-id), directional clitics (yeffeɣ-d), and possessive suffixes (tamurt-nneɣ). The punctuation splitter explicitly excludes - so these forms are pre-tokenized as single units and BPE can learn them correctly.

4. CLDR-Grounded Alphabet

The initial_alphabet is derived exclusively from the Unicode CLDR kab.xml exemplar character set:

Core: a b c č d ḍ e ɛ f g ǧ ɣ h ḥ i j k l m n p q r ṛ s ṣ t ṭ u w x y z ẓ

Auxiliary: o v

Every character in this set is guaranteed to exist as an atomic token in the vocabulary, regardless of its frequency in the training corpus.

5. Distinct Special Tokens

Each special token has exactly one role:

Token	Role
`<s>`	Beginning of sequence
`</s>`	End of sequence
`<unk>`	Unknown token
`<pad>`	Padding
`<mask>`	Masked token (for MLM)

Benchmark

Comparison against 3 other publicly available Kabyle tokenizers on 10 morphologically complex Kabyle sentences (verbal clitics, negation, construct state, pronominal suffixes, dense special characters, aspectual contrast):

Tokenizer	Vocab	Avg fertility ↓	Design
`boffire/kabyle-gpt2-tokenizer`	50,257	2.058	ByteLevel BPE, broken Kabyle chars
`Hillal-titouh/kabyle-bpe-tokenizer`	50,257	1.887	No NFC, no mask token
`boffire/kabyle-bpe-tokenizer-v2`	48,000	1.736	✅ NFC, CLDR alphabet, clean punctuation
`boffire/bpe-tokenizer-for-kabyle`	48,000	1.476	Absorbs punctuation into words

Note on BPE (boffire) fertility: Its lower fertility is structural — it absorbs punctuation marks into word tokens (e.g. berra. as a single token), which reduces the token count but makes punctuation boundaries invisible to the model. This tokenizer separates punctuation cleanly, which costs ~10 tokens across the benchmark but produces linguistically correct segmentation.

Sample tokenization

Yewwi-t-id ukan i gma-s.
→ ▁yewwi- | t- | id | ▁ukan | ▁i | ▁gma- | s | .

Ur iẓri ara acu i d-yuɣen ɣer taddart-nneɣ.
→ ▁ur | ▁iẓri | ▁ara | ▁acu | ▁i | ▁d | - | yu | ɣen | ▁ɣer | ▁taddart- | n | neɣ | .

Yessekcem-it deg taddart imi yeffeɣ-d ɣer berra.
→ ▁yessekcem- | it | ▁deg | ▁taddart | ▁imi | ▁yeffeɣ- | d | ▁ɣer | ▁berra | .

Training Data

Trained on a curated Kabyle text corpus (~20 MB, ~212,000 word-level units after pre-tokenization). The corpus covers written Kabyle in the standard Mammeri Latin orthography.

Limitations

Lowercase only. The normalizer lowercases all input. Kabyle capitalization is positional (sentence-initial) and carries no morphological information, so this is lossless for NLP tasks. Casing cannot be recovered from token IDs.
Corpus size. At 20 MB, some low-frequency morpheme combinations (e.g. the possessive suffix -nneɣ after varied stems) may not have been merged and will be split across 2–3 tokens. A larger corpus will improve this.
No Tifinagh or Arabic script support. This tokenizer covers the Latin orthography only.

Citation

If you use this tokenizer in your work, please cite it as:

@misc{boffire2026kabyle,
  author    = {boffire},
  title     = {Kabyle BPE Tokenizer v2},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/boffire/Kabyle-BPE-Tokenizer-v2}
}

License

Apache 2.0