Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,131 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- kab
|
| 4 |
+
- ber
|
| 5 |
+
license: apache-2.0
|
| 6 |
+
tags:
|
| 7 |
+
- tokenizer
|
| 8 |
+
- bpe
|
| 9 |
+
- kabyle
|
| 10 |
+
- taqbaylit
|
| 11 |
+
- tamazight
|
| 12 |
+
- nlp
|
| 13 |
+
library_name: tokenizers
|
| 14 |
+
pipeline_tag: text-generation
|
| 15 |
---
|
| 16 |
+
|
| 17 |
+
# Kabyle BPE Tokenizer v2
|
| 18 |
+
|
| 19 |
+
A clean, production-ready BPE tokenizer for the Kabyle language (Taqbaylit), built from scratch with proper Unicode handling and a CLDR-grounded alphabet.
|
| 20 |
+
|
| 21 |
+
## Overview
|
| 22 |
+
|
| 23 |
+
| Property | Value |
|
| 24 |
+
|---|---|
|
| 25 |
+
| Language | Kabyle / Taqbaylit (`kab`) |
|
| 26 |
+
| Script | Latin (Mammeri orthography) |
|
| 27 |
+
| Algorithm | Byte-Pair Encoding (BPE) |
|
| 28 |
+
| Vocab size | 48,000 |
|
| 29 |
+
| Normalizer | NFC + Lowercase |
|
| 30 |
+
| Pre-tokenizer | Metaspace + Punctuation split |
|
| 31 |
+
| Model max length | 512 |
|
| 32 |
+
|
| 33 |
+
## Usage
|
| 34 |
+
|
| 35 |
+
```python
|
| 36 |
+
from transformers import AutoTokenizer
|
| 37 |
+
|
| 38 |
+
tok = AutoTokenizer.from_pretrained("boffire/Kabyle-BPE-Tokenizer-v2")
|
| 39 |
+
|
| 40 |
+
text = "Taqbaylit d tutlayt tameẓẓyant yellan deg tmurt n Leqbayel."
|
| 41 |
+
ids = tok.encode(text)
|
| 42 |
+
tokens = tok.convert_ids_to_tokens(ids)
|
| 43 |
+
print(tokens)
|
| 44 |
+
# ['▁taqbaylit', '▁d', '▁tutlayt', '▁tameẓẓyant', '▁yellan',
|
| 45 |
+
# '▁deg', '▁tmurt', '▁n', '▁leqbayel', '.']
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
## Design Decisions
|
| 49 |
+
|
| 50 |
+
### 1. NFC Unicode Normalization
|
| 51 |
+
Kabyle special characters (ẓ, ɣ, ɛ, ḥ…) can be encoded in two ways in UTF-8: as a single precomposed codepoint, or as a base character + combining diacritic. Without normalization, the same word can produce different token sequences depending on how the source file was saved. NFC normalization is applied before tokenization to guarantee consistency.
|
| 52 |
+
|
| 53 |
+
### 2. Metaspace Pre-tokenizer
|
| 54 |
+
Instead of ByteLevel BPE (used by GPT-2-based tokenizers), this tokenizer uses a Metaspace pre-tokenizer. Word-initial spaces are marked as `▁`, and all Unicode characters — including Kabyle-specific extended Latin letters — remain as single atomic units. This eliminates the byte-level garbling (`áºĵ` for `ẓ`, `ÉĽ` for `ɛ`) seen in byte-level tokenizers.
|
| 55 |
+
|
| 56 |
+
### 3. Hyphen Preserved as Morphological Marker
|
| 57 |
+
In Kabyle, the hyphen (`-`) is a morphological connector, not punctuation. It marks clitic attachment (verbal object clitics: `yewwi-t-id`), directional clitics (`yeffeɣ-d`), and possessive suffixes (`tamurt-nneɣ`). The punctuation splitter explicitly excludes `-` so these forms are pre-tokenized as single units and BPE can learn them correctly.
|
| 58 |
+
|
| 59 |
+
### 4. CLDR-Grounded Alphabet
|
| 60 |
+
The `initial_alphabet` is derived exclusively from the [Unicode CLDR `kab.xml`](https://github.com/unicode-org/cldr/blob/main/common/main/kab.xml) exemplar character set:
|
| 61 |
+
|
| 62 |
+
**Core:** `a b c č d ḍ e ɛ f g ǧ ɣ h ḥ i j k l m n p q r ṛ s ṣ t ṭ u w x y z ẓ`
|
| 63 |
+
|
| 64 |
+
**Auxiliary:** `o v`
|
| 65 |
+
|
| 66 |
+
Every character in this set is guaranteed to exist as an atomic token in the vocabulary, regardless of its frequency in the training corpus.
|
| 67 |
+
|
| 68 |
+
### 5. Distinct Special Tokens
|
| 69 |
+
Each special token has exactly one role:
|
| 70 |
+
|
| 71 |
+
| Token | Role |
|
| 72 |
+
|---|---|
|
| 73 |
+
| `<s>` | Beginning of sequence |
|
| 74 |
+
| `</s>` | End of sequence |
|
| 75 |
+
| `<unk>` | Unknown token |
|
| 76 |
+
| `<pad>` | Padding |
|
| 77 |
+
| `<mask>` | Masked token (for MLM) |
|
| 78 |
+
|
| 79 |
+
## Benchmark
|
| 80 |
+
|
| 81 |
+
Comparison against 3 other publicly available Kabyle tokenizers on 10 morphologically complex Kabyle sentences (verbal clitics, negation, construct state, pronominal suffixes, dense special characters, aspectual contrast):
|
| 82 |
+
|
| 83 |
+
| Tokenizer | Vocab | Avg fertility ↓ | Design |
|
| 84 |
+
|---|---|---|---|
|
| 85 |
+
| `boffire/kabyle-gpt2-tokenizer` | 50,257 | 2.058 | ByteLevel BPE, broken Kabyle chars |
|
| 86 |
+
| `Hillal-titouh/kabyle-bpe-tokenizer` | 50,257 | 1.887 | No NFC, no mask token |
|
| 87 |
+
| **`boffire/kabyle-bpe-tokenizer-v2`** | **48,000** | **1.736** | ✅ NFC, CLDR alphabet, clean punctuation |
|
| 88 |
+
| `boffire/bpe-tokenizer-for-kabyle` | 48,000 | 1.476 | Absorbs punctuation into words |
|
| 89 |
+
|
| 90 |
+
> **Note on BPE (boffire) fertility:** Its lower fertility is structural — it absorbs punctuation marks into word tokens (e.g. `berra.` as a single token), which reduces the token count but makes punctuation boundaries invisible to the model. This tokenizer separates punctuation cleanly, which costs ~10 tokens across the benchmark but produces linguistically correct segmentation.
|
| 91 |
+
|
| 92 |
+
### Sample tokenization
|
| 93 |
+
|
| 94 |
+
```
|
| 95 |
+
Yewwi-t-id ukan i gma-s.
|
| 96 |
+
→ ▁yewwi- | t- | id | ▁ukan | ▁i | ▁gma- | s | .
|
| 97 |
+
|
| 98 |
+
Ur iẓri ara acu i d-yuɣen ɣer taddart-nneɣ.
|
| 99 |
+
→ ▁ur | ▁iẓri | ▁ara | ▁acu | ▁i | ▁d | - | yu | ɣen | ▁ɣer | ▁taddart- | n | neɣ | .
|
| 100 |
+
|
| 101 |
+
Yessekcem-it deg taddart imi yeffeɣ-d ɣer berra.
|
| 102 |
+
→ ▁yessekcem- | it | ▁deg | ▁taddart | ▁imi | ▁yeffeɣ- | d | ▁ɣer | ▁berra | .
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
## Training Data
|
| 106 |
+
|
| 107 |
+
Trained on a curated Kabyle text corpus (~20 MB, ~212,000 word-level units after pre-tokenization). The corpus covers written Kabyle in the standard Mammeri Latin orthography.
|
| 108 |
+
|
| 109 |
+
## Limitations
|
| 110 |
+
|
| 111 |
+
- **Lowercase only.** The normalizer lowercases all input. Kabyle capitalization is positional (sentence-initial) and carries no morphological information, so this is lossless for NLP tasks. Casing cannot be recovered from token IDs.
|
| 112 |
+
- **Corpus size.** At 20 MB, some low-frequency morpheme combinations (e.g. the possessive suffix `-nneɣ` after varied stems) may not have been merged and will be split across 2–3 tokens. A larger corpus will improve this.
|
| 113 |
+
- **No Tifinagh or Arabic script support.** This tokenizer covers the Latin orthography only.
|
| 114 |
+
|
| 115 |
+
## Citation
|
| 116 |
+
|
| 117 |
+
If you use this tokenizer in your work, please cite it as:
|
| 118 |
+
|
| 119 |
+
```bibtex
|
| 120 |
+
@misc{boffire2024kabyle,
|
| 121 |
+
author = {boffire},
|
| 122 |
+
title = {Kabyle BPE Tokenizer v2},
|
| 123 |
+
year = {2025},
|
| 124 |
+
publisher = {HuggingFace},
|
| 125 |
+
url = {https://huggingface.co/boffire/Kabyle-BPE-Tokenizer-v2}
|
| 126 |
+
}
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
## License
|
| 130 |
+
|
| 131 |
+
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|