| --- |
| language: |
| - kab |
| - ber |
| license: apache-2.0 |
| tags: |
| - tokenizer |
| - bpe |
| - kabyle |
| - taqbaylit |
| - tamazight |
| - nlp |
| library_name: tokenizers |
| pipeline_tag: text-generation |
| --- |
| |
| # Kabyle BPE Tokenizer v2 |
|
|
| A clean, production-ready BPE tokenizer for the Kabyle language (Taqbaylit), built from scratch with proper Unicode handling and a CLDR-grounded alphabet. |
|
|
| ## Overview |
|
|
| | Property | Value | |
| |---|---| |
| | Language | Kabyle / Taqbaylit (`kab`) | |
| | Script | Latin (Mammeri orthography) | |
| | Algorithm | Byte-Pair Encoding (BPE) | |
| | Vocab size | 48,000 | |
| | Normalizer | NFC + Lowercase | |
| | Pre-tokenizer | Metaspace + Punctuation split | |
| | Model max length | 512 | |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| tok = AutoTokenizer.from_pretrained("boffire/Kabyle-BPE-Tokenizer-v2") |
| |
| text = "Taqbaylit d tutlayt tameẓẓyant yellan deg tmurt n Leqbayel." |
| ids = tok.encode(text) |
| tokens = tok.convert_ids_to_tokens(ids) |
| print(tokens) |
| # ['▁taqbaylit', '▁d', '▁tutlayt', '▁tameẓẓyant', '▁yellan', |
| # '▁deg', '▁tmurt', '▁n', '▁leqbayel', '.'] |
| ``` |
|
|
| ## Design Decisions |
|
|
| ### 1. NFC Unicode Normalization |
| Kabyle special characters (ẓ, ɣ, ɛ, ḥ…) can be encoded in two ways in UTF-8: as a single precomposed codepoint, or as a base character + combining diacritic. Without normalization, the same word can produce different token sequences depending on how the source file was saved. NFC normalization is applied before tokenization to guarantee consistency. |
|
|
| ### 2. Metaspace Pre-tokenizer |
| Instead of ByteLevel BPE (used by GPT-2-based tokenizers), this tokenizer uses a Metaspace pre-tokenizer. Word-initial spaces are marked as `▁`, and all Unicode characters — including Kabyle-specific extended Latin letters — remain as single atomic units. This eliminates the byte-level garbling (`áºĵ` for `ẓ`, `ÉĽ` for `ɛ`) seen in byte-level tokenizers. |
|
|
| ### 3. Hyphen Preserved as Morphological Marker |
| In Kabyle, the hyphen (`-`) is a morphological connector, not punctuation. It marks clitic attachment (verbal object clitics: `yewwi-t-id`), directional clitics (`yeffeɣ-d`), and possessive suffixes (`tamurt-nneɣ`). The punctuation splitter explicitly excludes `-` so these forms are pre-tokenized as single units and BPE can learn them correctly. |
|
|
| ### 4. CLDR-Grounded Alphabet |
| The `initial_alphabet` is derived exclusively from the [Unicode CLDR `kab.xml`](https://github.com/unicode-org/cldr/blob/main/common/main/kab.xml) exemplar character set: |
|
|
| **Core:** `a b c č d ḍ e ɛ f g ǧ ɣ h ḥ i j k l m n p q r ṛ s ṣ t ṭ u w x y z ẓ` |
|
|
| **Auxiliary:** `o v` |
|
|
| Every character in this set is guaranteed to exist as an atomic token in the vocabulary, regardless of its frequency in the training corpus. |
|
|
| ### 5. Distinct Special Tokens |
| Each special token has exactly one role: |
|
|
| | Token | Role | |
| |---|---| |
| | `<s>` | Beginning of sequence | |
| | `</s>` | End of sequence | |
| | `<unk>` | Unknown token | |
| | `<pad>` | Padding | |
| | `<mask>` | Masked token (for MLM) | |
|
|
| ## Benchmark |
|
|
| Comparison against 3 other publicly available Kabyle tokenizers on 10 morphologically complex Kabyle sentences (verbal clitics, negation, construct state, pronominal suffixes, dense special characters, aspectual contrast): |
|
|
| | Tokenizer | Vocab | Avg fertility ↓ | Design | |
| |---|---|---|---| |
| | `boffire/kabyle-gpt2-tokenizer` | 50,257 | 2.058 | ByteLevel BPE, broken Kabyle chars | |
| | `Hillal-titouh/kabyle-bpe-tokenizer` | 50,257 | 1.887 | No NFC, no mask token | |
| | **`boffire/kabyle-bpe-tokenizer-v2`** | **48,000** | **1.736** | ✅ NFC, CLDR alphabet, clean punctuation | |
| | `boffire/bpe-tokenizer-for-kabyle` | 48,000 | 1.476 | Absorbs punctuation into words | |
|
|
| > **Note on BPE (boffire) fertility:** Its lower fertility is structural — it absorbs punctuation marks into word tokens (e.g. `berra.` as a single token), which reduces the token count but makes punctuation boundaries invisible to the model. This tokenizer separates punctuation cleanly, which costs ~10 tokens across the benchmark but produces linguistically correct segmentation. |
|
|
| ### Sample tokenization |
|
|
| ``` |
| Yewwi-t-id ukan i gma-s. |
| → ▁yewwi- | t- | id | ▁ukan | ▁i | ▁gma- | s | . |
| |
| Ur iẓri ara acu i d-yuɣen ɣer taddart-nneɣ. |
| → ▁ur | ▁iẓri | ▁ara | ▁acu | ▁i | ▁d | - | yu | ɣen | ▁ɣer | ▁taddart- | n | neɣ | . |
| |
| Yessekcem-it deg taddart imi yeffeɣ-d ɣer berra. |
| → ▁yessekcem- | it | ▁deg | ▁taddart | ▁imi | ▁yeffeɣ- | d | ▁ɣer | ▁berra | . |
| ``` |
|
|
| ## Training Data |
|
|
| Trained on a curated Kabyle text corpus (~20 MB, ~212,000 word-level units after pre-tokenization). The corpus covers written Kabyle in the standard Mammeri Latin orthography. |
|
|
| ## Limitations |
|
|
| - **Lowercase only.** The normalizer lowercases all input. Kabyle capitalization is positional (sentence-initial) and carries no morphological information, so this is lossless for NLP tasks. Casing cannot be recovered from token IDs. |
| - **Corpus size.** At 20 MB, some low-frequency morpheme combinations (e.g. the possessive suffix `-nneɣ` after varied stems) may not have been merged and will be split across 2–3 tokens. A larger corpus will improve this. |
| - **No Tifinagh or Arabic script support.** This tokenizer covers the Latin orthography only. |
|
|
| ## Citation |
|
|
| If you use this tokenizer in your work, please cite it as: |
|
|
| ```bibtex |
| @misc{boffire2026kabyle, |
| author = {boffire}, |
| title = {Kabyle BPE Tokenizer v2}, |
| year = {2026}, |
| publisher = {HuggingFace}, |
| url = {https://huggingface.co/boffire/Kabyle-BPE-Tokenizer-v2} |
| } |
| ``` |
|
|
| ## License |
|
|
| [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |