boffire commited on
Commit
4f8f92e
·
verified ·
1 Parent(s): 08edfb9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -1
README.md CHANGED
@@ -1,3 +1,50 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: kab
3
+ license: mit
4
+ tags:
5
+ - tokenizer
6
+ - kabyle
7
+ - berber
8
+ - bpe
9
+ - low-resource
10
+ - amazigh
11
+ - taqbaylit
12
  ---
13
+
14
+ # BPE Tokenizer for kabyle
15
+
16
+ A high-performance Byte-Pair Encoding tokenizer optimized for the Kabyle language (Taqbaylit). Designed for low-resource NLP and LLM training.
17
+
18
+ ## Key Specifications
19
+ - **Vocabulary Size:** 48,011 tokens
20
+ - **Fertility:** 1.376 tokens/word
21
+ - **Corpus:** ~19M characters of cleaned Kabyle text
22
+ - **Encoding Speed:** ~34,000 sentences/second (CPU)
23
+
24
+ ## Usage
25
+ ```python
26
+ from transformers import AutoTokenizer
27
+
28
+ tokenizer = AutoTokenizer.from_pretrained("boffire/bpe-tokenizer-for-kabyle")
29
+ tokens = tokenizer.tokenize("aɣ-d-yini tameddurt-nneɣ deg taddart.")
30
+ # ['aɣ-d-yini', 'tameddurt-nneɣ', 'deg', 'taddart', '.']
31
+ ```
32
+
33
+ ## Performance Comparison
34
+ Benchmarked against the primary public Kabyle tokenizer (Hillal-titouh/kabyle-bpe-tokenizer):
35
+
36
+ - **Sequence Efficiency:** 15.5% lower fertility (1.376 vs 1.593 tokens/word)
37
+ - **Unicode Handling:** Native support for Kabyle characters (ɛ, ẓ, ṛ, ṭ, ṣ, ḍ, ǧ, ḥ, ɣ, č) without byte-level fallback
38
+ - **Morphological Awareness:** High-frequency prefixes, clitics, and compounds preserved as single tokens
39
+
40
+ ## Training Details
41
+ Trained using the Hugging Face tokenizers library on a deduplicated, NFC-normalized Kabyle corpus. Post-training analysis identified and corrected suboptimal BPE splits on high-frequency words by adding them to the vocabulary as fixed tokens. This ensures consistent tokenization without corrupting the BPE merge graph.
42
+
43
+ ## Limitations
44
+
45
+ - Compound token overrides are exact-match. Unseen morphological variants may still split.
46
+ - Optimized specifically for Kabyle. Performance may degrade on multilingual or code-switched text.
47
+
48
+ ## License & Citation
49
+
50
+ - **License:** MIT