boffire
/

bpe-tokenizer-for-kabyle

Model card Files Files and versions

boffire commited on 1 day ago

Commit

4f8f92e

·

verified ·

1 Parent(s): 08edfb9

Update README.md

Files changed (1) hide show

README.md +48 -1

README.md CHANGED Viewed

@@ -1,3 +1,50 @@
 ---
-license: apache-2.0
 ---

 ---
+language: kab
+license: mit
+tags:
+- tokenizer
+- kabyle
+- berber
+- bpe
+- low-resource
+- amazigh
+- taqbaylit
 ---
+# BPE Tokenizer for kabyle
+A high-performance Byte-Pair Encoding tokenizer optimized for the Kabyle language (Taqbaylit). Designed for low-resource NLP and LLM training.
+## Key Specifications
+- **Vocabulary Size:** 48,011 tokens
+- **Fertility:** 1.376 tokens/word
+- **Corpus:** ~19M characters of cleaned Kabyle text
+- **Encoding Speed:** ~34,000 sentences/second (CPU)
+## Usage
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("boffire/bpe-tokenizer-for-kabyle")
+tokens = tokenizer.tokenize("aɣ-d-yini tameddurt-nneɣ deg taddart.")
+# ['aɣ-d-yini', 'tameddurt-nneɣ', 'deg', 'taddart', '.']
+```
+## Performance Comparison
+Benchmarked against the primary public Kabyle tokenizer (Hillal-titouh/kabyle-bpe-tokenizer):
+- **Sequence Efficiency:** 15.5% lower fertility (1.376 vs 1.593 tokens/word)
+- **Unicode Handling:** Native support for Kabyle characters (ɛ, ẓ, ṛ, ṭ, ṣ, ḍ, ǧ, ḥ, ɣ, č) without byte-level fallback
+- **Morphological Awareness:** High-frequency prefixes, clitics, and compounds preserved as single tokens
+## Training Details
+Trained using the Hugging Face tokenizers library on a deduplicated, NFC-normalized Kabyle corpus. Post-training analysis identified and corrected suboptimal BPE splits on high-frequency words by adding them to the vocabulary as fixed tokens. This ensures consistent tokenization without corrupting the BPE merge graph.
+## Limitations
+- Compound token overrides are exact-match. Unseen morphological variants may still split.
+- Optimized specifically for Kabyle. Performance may degrade on multilingual or code-switched text.
+## License & Citation
+- **License:** MIT