BPE Tokenizer for kabyle

A high-performance Byte-Pair Encoding tokenizer optimized for the Kabyle language (Taqbaylit). Designed for low-resource NLP and LLM training.

Key Specifications

  • Vocabulary Size: 48,011 tokens
  • Fertility: 1.376 tokens/word
  • Corpus: ~19M characters of cleaned Kabyle text
  • Encoding Speed: ~34,000 sentences/second (CPU)

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("boffire/bpe-tokenizer-for-kabyle")
tokens = tokenizer.tokenize("aɣ-d-yini tameddurt-nneɣ deg taddart.")
# ['aɣ-d-yini', 'tameddurt-nneɣ', 'deg', 'taddart', '.']

Performance Comparison

Benchmarked against the primary public Kabyle tokenizer (Hillal-titouh/kabyle-bpe-tokenizer):

  • Sequence Efficiency: 15.5% lower fertility (1.376 vs 1.593 tokens/word)
  • Unicode Handling: Native support for Kabyle characters (ɛ, ẓ, ṛ, ṭ, ṣ, ḍ, ǧ, ḥ, ɣ, č) without byte-level fallback
  • Morphological Awareness: High-frequency prefixes, clitics, and compounds preserved as single tokens

Training Details

Trained using the Hugging Face tokenizers library on a deduplicated, NFC-normalized Kabyle corpus. Post-training analysis identified and corrected suboptimal BPE splits on high-frequency words by adding them to the vocabulary as fixed tokens. This ensures consistent tokenization without corrupting the BPE merge graph.

Limitations

  • Compound token overrides are exact-match. Unseen morphological variants may still split.
  • Optimized specifically for Kabyle. Performance may degrade on multilingual or code-switched text.

License & Citation

  • License: MIT
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support