boffire
/

bpe-tokenizer-for-kabyle

Model card Files Files and versions

bpe-tokenizer-for-kabyle / README.md

boffire's picture

Update README.md

fcf9913 verified about 21 hours ago

|

history blame contribute delete

1.81 kB

	---
	language:
	- kab
	- ber
	license: mit
	tags:
	- tokenizer
	- kabyle
	- berber
	- bpe
	- low-resource
	- amazigh
	- tamaziɣt
	- taqbaylit
	---

	# BPE Tokenizer for kabyle

	A high-performance Byte-Pair Encoding tokenizer optimized for the Kabyle language (Taqbaylit). Designed for low-resource NLP and LLM training.

	## Key Specifications
	- Vocabulary Size: 48,011 tokens
	- Fertility: 1.376 tokens/word
	- Corpus: ~19M characters of cleaned Kabyle text
	- Encoding Speed: ~34,000 sentences/second (CPU)

	## Usage
	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("boffire/bpe-tokenizer-for-kabyle")
	tokens = tokenizer.tokenize("aɣ-d-yini tameddurt-nneɣ deg taddart.")
	# ['aɣ-d-yini', 'tameddurt-nneɣ', 'deg', 'taddart', '.']
	```

	## Performance Comparison
	Benchmarked against the primary public Kabyle tokenizer (Hillal-titouh/kabyle-bpe-tokenizer):

	- Sequence Efficiency: 15.5% lower fertility (1.376 vs 1.593 tokens/word)
	- Unicode Handling: Native support for Kabyle characters (ɛ, ẓ, ṛ, ṭ, ṣ, ḍ, ǧ, ḥ, ɣ, č) without byte-level fallback
	- Morphological Awareness: High-frequency prefixes, clitics, and compounds preserved as single tokens

	## Training Details
	Trained using the Hugging Face tokenizers library on a deduplicated, NFC-normalized Kabyle corpus. Post-training analysis identified and corrected suboptimal BPE splits on high-frequency words by adding them to the vocabulary as fixed tokens. This ensures consistent tokenization without corrupting the BPE merge graph.

	## Limitations

	- Compound token overrides are exact-match. Unseen morphological variants may still split.
	- Optimized specifically for Kabyle. Performance may degrade on multilingual or code-switched text.

	## License & Citation

	- License: MIT