--- language: - kab - ber license: mit tags: - tokenizer - kabyle - berber - bpe - low-resource - amazigh - tamaziɣt - taqbaylit --- # BPE Tokenizer for kabyle A high-performance Byte-Pair Encoding tokenizer optimized for the Kabyle language (Taqbaylit). Designed for low-resource NLP and LLM training. ## Key Specifications - **Vocabulary Size:** 48,011 tokens - **Fertility:** 1.376 tokens/word - **Corpus:** ~19M characters of cleaned Kabyle text - **Encoding Speed:** ~34,000 sentences/second (CPU) ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("boffire/bpe-tokenizer-for-kabyle") tokens = tokenizer.tokenize("aɣ-d-yini tameddurt-nneɣ deg taddart.") # ['aɣ-d-yini', 'tameddurt-nneɣ', 'deg', 'taddart', '.'] ``` ## Performance Comparison Benchmarked against the primary public Kabyle tokenizer (Hillal-titouh/kabyle-bpe-tokenizer): - **Sequence Efficiency:** 15.5% lower fertility (1.376 vs 1.593 tokens/word) - **Unicode Handling:** Native support for Kabyle characters (ɛ, ẓ, ṛ, ṭ, ṣ, ḍ, ǧ, ḥ, ɣ, č) without byte-level fallback - **Morphological Awareness:** High-frequency prefixes, clitics, and compounds preserved as single tokens ## Training Details Trained using the Hugging Face tokenizers library on a deduplicated, NFC-normalized Kabyle corpus. Post-training analysis identified and corrected suboptimal BPE splits on high-frequency words by adding them to the vocabulary as fixed tokens. This ensures consistent tokenization without corrupting the BPE merge graph. ## Limitations - Compound token overrides are exact-match. Unseen morphological variants may still split. - Optimized specifically for Kabyle. Performance may degrade on multilingual or code-switched text. ## License & Citation - **License:** MIT