boffire commited on
Commit
a4d6921
·
verified ·
1 Parent(s): 4941498

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -1
README.md CHANGED
@@ -1,3 +1,131 @@
1
  ---
2
- license: mit
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - kab
4
+ - ber
5
+ license: apache-2.0
6
+ tags:
7
+ - tokenizer
8
+ - bpe
9
+ - kabyle
10
+ - taqbaylit
11
+ - tamazight
12
+ - nlp
13
+ library_name: tokenizers
14
+ pipeline_tag: text-generation
15
  ---
16
+
17
+ # Kabyle BPE Tokenizer v2
18
+
19
+ A clean, production-ready BPE tokenizer for the Kabyle language (Taqbaylit), built from scratch with proper Unicode handling and a CLDR-grounded alphabet.
20
+
21
+ ## Overview
22
+
23
+ | Property | Value |
24
+ |---|---|
25
+ | Language | Kabyle / Taqbaylit (`kab`) |
26
+ | Script | Latin (Mammeri orthography) |
27
+ | Algorithm | Byte-Pair Encoding (BPE) |
28
+ | Vocab size | 48,000 |
29
+ | Normalizer | NFC + Lowercase |
30
+ | Pre-tokenizer | Metaspace + Punctuation split |
31
+ | Model max length | 512 |
32
+
33
+ ## Usage
34
+
35
+ ```python
36
+ from transformers import AutoTokenizer
37
+
38
+ tok = AutoTokenizer.from_pretrained("boffire/Kabyle-BPE-Tokenizer-v2")
39
+
40
+ text = "Taqbaylit d tutlayt tameẓẓyant yellan deg tmurt n Leqbayel."
41
+ ids = tok.encode(text)
42
+ tokens = tok.convert_ids_to_tokens(ids)
43
+ print(tokens)
44
+ # ['▁taqbaylit', '▁d', '▁tutlayt', '▁tameẓẓyant', '▁yellan',
45
+ # '▁deg', '▁tmurt', '▁n', '▁leqbayel', '.']
46
+ ```
47
+
48
+ ## Design Decisions
49
+
50
+ ### 1. NFC Unicode Normalization
51
+ Kabyle special characters (ẓ, ɣ, ɛ, ḥ…) can be encoded in two ways in UTF-8: as a single precomposed codepoint, or as a base character + combining diacritic. Without normalization, the same word can produce different token sequences depending on how the source file was saved. NFC normalization is applied before tokenization to guarantee consistency.
52
+
53
+ ### 2. Metaspace Pre-tokenizer
54
+ Instead of ByteLevel BPE (used by GPT-2-based tokenizers), this tokenizer uses a Metaspace pre-tokenizer. Word-initial spaces are marked as `▁`, and all Unicode characters — including Kabyle-specific extended Latin letters — remain as single atomic units. This eliminates the byte-level garbling (`áºĵ` for `ẓ`, `ÉĽ` for `ɛ`) seen in byte-level tokenizers.
55
+
56
+ ### 3. Hyphen Preserved as Morphological Marker
57
+ In Kabyle, the hyphen (`-`) is a morphological connector, not punctuation. It marks clitic attachment (verbal object clitics: `yewwi-t-id`), directional clitics (`yeffeɣ-d`), and possessive suffixes (`tamurt-nneɣ`). The punctuation splitter explicitly excludes `-` so these forms are pre-tokenized as single units and BPE can learn them correctly.
58
+
59
+ ### 4. CLDR-Grounded Alphabet
60
+ The `initial_alphabet` is derived exclusively from the [Unicode CLDR `kab.xml`](https://github.com/unicode-org/cldr/blob/main/common/main/kab.xml) exemplar character set:
61
+
62
+ **Core:** `a b c č d ḍ e ɛ f g ǧ ɣ h ḥ i j k l m n p q r ṛ s ṣ t ṭ u w x y z ẓ`
63
+
64
+ **Auxiliary:** `o v`
65
+
66
+ Every character in this set is guaranteed to exist as an atomic token in the vocabulary, regardless of its frequency in the training corpus.
67
+
68
+ ### 5. Distinct Special Tokens
69
+ Each special token has exactly one role:
70
+
71
+ | Token | Role |
72
+ |---|---|
73
+ | `<s>` | Beginning of sequence |
74
+ | `</s>` | End of sequence |
75
+ | `<unk>` | Unknown token |
76
+ | `<pad>` | Padding |
77
+ | `<mask>` | Masked token (for MLM) |
78
+
79
+ ## Benchmark
80
+
81
+ Comparison against 3 other publicly available Kabyle tokenizers on 10 morphologically complex Kabyle sentences (verbal clitics, negation, construct state, pronominal suffixes, dense special characters, aspectual contrast):
82
+
83
+ | Tokenizer | Vocab | Avg fertility ↓ | Design |
84
+ |---|---|---|---|
85
+ | `boffire/kabyle-gpt2-tokenizer` | 50,257 | 2.058 | ByteLevel BPE, broken Kabyle chars |
86
+ | `Hillal-titouh/kabyle-bpe-tokenizer` | 50,257 | 1.887 | No NFC, no mask token |
87
+ | **`boffire/kabyle-bpe-tokenizer-v2`** | **48,000** | **1.736** | ✅ NFC, CLDR alphabet, clean punctuation |
88
+ | `boffire/bpe-tokenizer-for-kabyle` | 48,000 | 1.476 | Absorbs punctuation into words |
89
+
90
+ > **Note on BPE (boffire) fertility:** Its lower fertility is structural — it absorbs punctuation marks into word tokens (e.g. `berra.` as a single token), which reduces the token count but makes punctuation boundaries invisible to the model. This tokenizer separates punctuation cleanly, which costs ~10 tokens across the benchmark but produces linguistically correct segmentation.
91
+
92
+ ### Sample tokenization
93
+
94
+ ```
95
+ Yewwi-t-id ukan i gma-s.
96
+ → ▁yewwi- | t- | id | ▁ukan | ▁i | ▁gma- | s | .
97
+
98
+ Ur iẓri ara acu i d-yuɣen ɣer taddart-nneɣ.
99
+ → ▁ur | ▁iẓri | ▁ara | ▁acu | ▁i | ▁d | - | yu | ɣen | ▁ɣer | ▁taddart- | n | neɣ | .
100
+
101
+ Yessekcem-it deg taddart imi yeffeɣ-d ɣer berra.
102
+ → ▁yessekcem- | it | ▁deg | ▁taddart | ▁imi | ▁yeffeɣ- | d | ▁ɣer | ▁berra | .
103
+ ```
104
+
105
+ ## Training Data
106
+
107
+ Trained on a curated Kabyle text corpus (~20 MB, ~212,000 word-level units after pre-tokenization). The corpus covers written Kabyle in the standard Mammeri Latin orthography.
108
+
109
+ ## Limitations
110
+
111
+ - **Lowercase only.** The normalizer lowercases all input. Kabyle capitalization is positional (sentence-initial) and carries no morphological information, so this is lossless for NLP tasks. Casing cannot be recovered from token IDs.
112
+ - **Corpus size.** At 20 MB, some low-frequency morpheme combinations (e.g. the possessive suffix `-nneɣ` after varied stems) may not have been merged and will be split across 2–3 tokens. A larger corpus will improve this.
113
+ - **No Tifinagh or Arabic script support.** This tokenizer covers the Latin orthography only.
114
+
115
+ ## Citation
116
+
117
+ If you use this tokenizer in your work, please cite it as:
118
+
119
+ ```bibtex
120
+ @misc{boffire2024kabyle,
121
+ author = {boffire},
122
+ title = {Kabyle BPE Tokenizer v2},
123
+ year = {2025},
124
+ publisher = {HuggingFace},
125
+ url = {https://huggingface.co/boffire/Kabyle-BPE-Tokenizer-v2}
126
+ }
127
+ ```
128
+
129
+ ## License
130
+
131
+ [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)