| --- |
| language: |
| - kab |
| - ber |
| license: mit |
| tags: |
| - tokenizer |
| - kabyle |
| - berber |
| - bpe |
| - low-resource |
| - amazigh |
| - tamaziɣt |
| - taqbaylit |
| --- |
| |
| # BPE Tokenizer for kabyle |
|
|
| A high-performance Byte-Pair Encoding tokenizer optimized for the Kabyle language (Taqbaylit). Designed for low-resource NLP and LLM training. |
|
|
| ## Key Specifications |
| - **Vocabulary Size:** 48,011 tokens |
| - **Fertility:** 1.376 tokens/word |
| - **Corpus:** ~19M characters of cleaned Kabyle text |
| - **Encoding Speed:** ~34,000 sentences/second (CPU) |
|
|
| ## Usage |
| ```python |
| from transformers import AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("boffire/bpe-tokenizer-for-kabyle") |
| tokens = tokenizer.tokenize("aɣ-d-yini tameddurt-nneɣ deg taddart.") |
| # ['aɣ-d-yini', 'tameddurt-nneɣ', 'deg', 'taddart', '.'] |
| ``` |
|
|
| ## Performance Comparison |
| Benchmarked against the primary public Kabyle tokenizer (Hillal-titouh/kabyle-bpe-tokenizer): |
|
|
| - **Sequence Efficiency:** 15.5% lower fertility (1.376 vs 1.593 tokens/word) |
| - **Unicode Handling:** Native support for Kabyle characters (ɛ, ẓ, ṛ, ṭ, ṣ, ḍ, ǧ, ḥ, ɣ, č) without byte-level fallback |
| - **Morphological Awareness:** High-frequency prefixes, clitics, and compounds preserved as single tokens |
|
|
| ## Training Details |
| Trained using the Hugging Face tokenizers library on a deduplicated, NFC-normalized Kabyle corpus. Post-training analysis identified and corrected suboptimal BPE splits on high-frequency words by adding them to the vocabulary as fixed tokens. This ensures consistent tokenization without corrupting the BPE merge graph. |
|
|
| ## Limitations |
|
|
| - Compound token overrides are exact-match. Unseen morphological variants may still split. |
| - Optimized specifically for Kabyle. Performance may degrade on multilingual or code-switched text. |
|
|
| ## License & Citation |
|
|
| - **License:** MIT |