File size: 1,812 Bytes
b6394fe fcf9913 4f8f92e fcf9913 4f8f92e b6394fe 4f8f92e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | ---
language:
- kab
- ber
license: mit
tags:
- tokenizer
- kabyle
- berber
- bpe
- low-resource
- amazigh
- tamaziɣt
- taqbaylit
---
# BPE Tokenizer for kabyle
A high-performance Byte-Pair Encoding tokenizer optimized for the Kabyle language (Taqbaylit). Designed for low-resource NLP and LLM training.
## Key Specifications
- **Vocabulary Size:** 48,011 tokens
- **Fertility:** 1.376 tokens/word
- **Corpus:** ~19M characters of cleaned Kabyle text
- **Encoding Speed:** ~34,000 sentences/second (CPU)
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("boffire/bpe-tokenizer-for-kabyle")
tokens = tokenizer.tokenize("aɣ-d-yini tameddurt-nneɣ deg taddart.")
# ['aɣ-d-yini', 'tameddurt-nneɣ', 'deg', 'taddart', '.']
```
## Performance Comparison
Benchmarked against the primary public Kabyle tokenizer (Hillal-titouh/kabyle-bpe-tokenizer):
- **Sequence Efficiency:** 15.5% lower fertility (1.376 vs 1.593 tokens/word)
- **Unicode Handling:** Native support for Kabyle characters (ɛ, ẓ, ṛ, ṭ, ṣ, ḍ, ǧ, ḥ, ɣ, č) without byte-level fallback
- **Morphological Awareness:** High-frequency prefixes, clitics, and compounds preserved as single tokens
## Training Details
Trained using the Hugging Face tokenizers library on a deduplicated, NFC-normalized Kabyle corpus. Post-training analysis identified and corrected suboptimal BPE splits on high-frequency words by adding them to the vocabulary as fixed tokens. This ensures consistent tokenization without corrupting the BPE merge graph.
## Limitations
- Compound token overrides are exact-match. Unseen morphological variants may still split.
- Optimized specifically for Kabyle. Performance may degrade on multilingual or code-switched text.
## License & Citation
- **License:** MIT |