File size: 5,632 Bytes
a63a442
a4d6921
 
 
 
 
 
 
 
 
 
 
 
 
a63a442
a4d6921
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51a4240
a4d6921
 
51a4240
a4d6921
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
language:
- kab
- ber
license: apache-2.0
tags:
- tokenizer
- bpe
- kabyle
- taqbaylit
- tamazight
- nlp
library_name: tokenizers
pipeline_tag: text-generation
---

# Kabyle BPE Tokenizer v2

A clean, production-ready BPE tokenizer for the Kabyle language (Taqbaylit), built from scratch with proper Unicode handling and a CLDR-grounded alphabet.

## Overview

| Property | Value |
|---|---|
| Language | Kabyle / Taqbaylit (`kab`) |
| Script | Latin (Mammeri orthography) |
| Algorithm | Byte-Pair Encoding (BPE) |
| Vocab size | 48,000 |
| Normalizer | NFC + Lowercase |
| Pre-tokenizer | Metaspace + Punctuation split |
| Model max length | 512 |

## Usage

```python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("boffire/Kabyle-BPE-Tokenizer-v2")

text = "Taqbaylit d tutlayt tameẓẓyant yellan deg tmurt n Leqbayel."
ids = tok.encode(text)
tokens = tok.convert_ids_to_tokens(ids)
print(tokens)
# ['▁taqbaylit', '▁d', '▁tutlayt', '▁tameẓẓyant', '▁yellan',
#  '▁deg', '▁tmurt', '▁n', '▁leqbayel', '.']
```

## Design Decisions

### 1. NFC Unicode Normalization
Kabyle special characters (ẓ, ɣ, ɛ, ḥ…) can be encoded in two ways in UTF-8: as a single precomposed codepoint, or as a base character + combining diacritic. Without normalization, the same word can produce different token sequences depending on how the source file was saved. NFC normalization is applied before tokenization to guarantee consistency.

### 2. Metaspace Pre-tokenizer
Instead of ByteLevel BPE (used by GPT-2-based tokenizers), this tokenizer uses a Metaspace pre-tokenizer. Word-initial spaces are marked as `▁`, and all Unicode characters — including Kabyle-specific extended Latin letters — remain as single atomic units. This eliminates the byte-level garbling (`áºĵ` for `ẓ`, `ÉĽ` for `ɛ`) seen in byte-level tokenizers.

### 3. Hyphen Preserved as Morphological Marker
In Kabyle, the hyphen (`-`) is a morphological connector, not punctuation. It marks clitic attachment (verbal object clitics: `yewwi-t-id`), directional clitics (`yeffeɣ-d`), and possessive suffixes (`tamurt-nneɣ`). The punctuation splitter explicitly excludes `-` so these forms are pre-tokenized as single units and BPE can learn them correctly.

### 4. CLDR-Grounded Alphabet
The `initial_alphabet` is derived exclusively from the [Unicode CLDR `kab.xml`](https://github.com/unicode-org/cldr/blob/main/common/main/kab.xml) exemplar character set:

**Core:** `a b c č d ḍ e ɛ f g ǧ ɣ h ḥ i j k l m n p q r ṛ s ṣ t ṭ u w x y z ẓ`

**Auxiliary:** `o v`

Every character in this set is guaranteed to exist as an atomic token in the vocabulary, regardless of its frequency in the training corpus.

### 5. Distinct Special Tokens
Each special token has exactly one role:

| Token | Role |
|---|---|
| `<s>` | Beginning of sequence |
| `</s>` | End of sequence |
| `<unk>` | Unknown token |
| `<pad>` | Padding |
| `<mask>` | Masked token (for MLM) |

## Benchmark

Comparison against 3 other publicly available Kabyle tokenizers on 10 morphologically complex Kabyle sentences (verbal clitics, negation, construct state, pronominal suffixes, dense special characters, aspectual contrast):

| Tokenizer | Vocab | Avg fertility ↓ | Design |
|---|---|---|---|
| `boffire/kabyle-gpt2-tokenizer` | 50,257 | 2.058 | ByteLevel BPE, broken Kabyle chars |
| `Hillal-titouh/kabyle-bpe-tokenizer` | 50,257 | 1.887 | No NFC, no mask token |
| **`boffire/kabyle-bpe-tokenizer-v2`** | **48,000** | **1.736** | ✅ NFC, CLDR alphabet, clean punctuation |
| `boffire/bpe-tokenizer-for-kabyle` | 48,000 | 1.476 | Absorbs punctuation into words |

> **Note on BPE (boffire) fertility:** Its lower fertility is structural — it absorbs punctuation marks into word tokens (e.g. `berra.` as a single token), which reduces the token count but makes punctuation boundaries invisible to the model. This tokenizer separates punctuation cleanly, which costs ~10 tokens across the benchmark but produces linguistically correct segmentation.

### Sample tokenization

```
Yewwi-t-id ukan i gma-s.
→ ▁yewwi- | t- | id | ▁ukan | ▁i | ▁gma- | s | .

Ur iẓri ara acu i d-yuɣen ɣer taddart-nneɣ.
→ ▁ur | ▁iẓri | ▁ara | ▁acu | ▁i | ▁d | - | yu | ɣen | ▁ɣer | ▁taddart- | n | neɣ | .

Yessekcem-it deg taddart imi yeffeɣ-d ɣer berra.
→ ▁yessekcem- | it | ▁deg | ▁taddart | ▁imi | ▁yeffeɣ- | d | ▁ɣer | ▁berra | .
```

## Training Data

Trained on a curated Kabyle text corpus (~20 MB, ~212,000 word-level units after pre-tokenization). The corpus covers written Kabyle in the standard Mammeri Latin orthography.

## Limitations

- **Lowercase only.** The normalizer lowercases all input. Kabyle capitalization is positional (sentence-initial) and carries no morphological information, so this is lossless for NLP tasks. Casing cannot be recovered from token IDs.
- **Corpus size.** At 20 MB, some low-frequency morpheme combinations (e.g. the possessive suffix `-nneɣ` after varied stems) may not have been merged and will be split across 2–3 tokens. A larger corpus will improve this.
- **No Tifinagh or Arabic script support.** This tokenizer covers the Latin orthography only.

## Citation

If you use this tokenizer in your work, please cite it as:

```bibtex
@misc{boffire2026kabyle,
  author    = {boffire},
  title     = {Kabyle BPE Tokenizer v2},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/boffire/Kabyle-BPE-Tokenizer-v2}
}
```

## License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)