Update README.md

51a4240 verified about 7 hours ago

5.63 kB

	---
	language:
	- kab
	- ber
	license: apache-2.0
	tags:
	- tokenizer
	- bpe
	- kabyle
	- taqbaylit
	- tamazight
	- nlp
	library_name: tokenizers
	pipeline_tag: text-generation
	---

	# Kabyle BPE Tokenizer v2

	A clean, production-ready BPE tokenizer for the Kabyle language (Taqbaylit), built from scratch with proper Unicode handling and a CLDR-grounded alphabet.

	## Overview

	\| Property \| Value \|
	\|---\|---\|
	\| Language \| Kabyle / Taqbaylit (`kab`) \|
	\| Script \| Latin (Mammeri orthography) \|
	\| Algorithm \| Byte-Pair Encoding (BPE) \|
	\| Vocab size \| 48,000 \|
	\| Normalizer \| NFC + Lowercase \|
	\| Pre-tokenizer \| Metaspace + Punctuation split \|
	\| Model max length \| 512 \|

	## Usage

	```python
	from transformers import AutoTokenizer

	tok = AutoTokenizer.from_pretrained("boffire/Kabyle-BPE-Tokenizer-v2")

	text = "Taqbaylit d tutlayt tameẓẓyant yellan deg tmurt n Leqbayel."
	ids = tok.encode(text)
	tokens = tok.convert_ids_to_tokens(ids)
	print(tokens)
	# ['▁taqbaylit', '▁d', '▁tutlayt', '▁tameẓẓyant', '▁yellan',
	# '▁deg', '▁tmurt', '▁n', '▁leqbayel', '.']
	```

	## Design Decisions

	### 1. NFC Unicode Normalization
	Kabyle special characters (ẓ, ɣ, ɛ, ḥ…) can be encoded in two ways in UTF-8: as a single precomposed codepoint, or as a base character + combining diacritic. Without normalization, the same word can produce different token sequences depending on how the source file was saved. NFC normalization is applied before tokenization to guarantee consistency.

	### 2. Metaspace Pre-tokenizer
	Instead of ByteLevel BPE (used by GPT-2-based tokenizers), this tokenizer uses a Metaspace pre-tokenizer. Word-initial spaces are marked as `▁`, and all Unicode characters — including Kabyle-specific extended Latin letters — remain as single atomic units. This eliminates the byte-level garbling (`áºĵ` for `ẓ`, `ÉĽ` for `ɛ`) seen in byte-level tokenizers.

	### 3. Hyphen Preserved as Morphological Marker
	In Kabyle, the hyphen (`-`) is a morphological connector, not punctuation. It marks clitic attachment (verbal object clitics: `yewwi-t-id`), directional clitics (`yeffeɣ-d`), and possessive suffixes (`tamurt-nneɣ`). The punctuation splitter explicitly excludes `-` so these forms are pre-tokenized as single units and BPE can learn them correctly.

	### 4. CLDR-Grounded Alphabet
	The `initial_alphabet` is derived exclusively from the [Unicode CLDR `kab.xml`](https://github.com/unicode-org/cldr/blob/main/common/main/kab.xml) exemplar character set:

	Core: `a b c č d ḍ e ɛ f g ǧ ɣ h ḥ i j k l m n p q r ṛ s ṣ t ṭ u w x y z ẓ`

	Auxiliary: `o v`

	Every character in this set is guaranteed to exist as an atomic token in the vocabulary, regardless of its frequency in the training corpus.

	### 5. Distinct Special Tokens
	Each special token has exactly one role:

	\| Token \| Role \|
	\|---\|---\|
	\| `<s>` \| Beginning of sequence \|
	\| `</s>` \| End of sequence \|
	\| `<unk>` \| Unknown token \|
	\| `<pad>` \| Padding \|
	\| `<mask>` \| Masked token (for MLM) \|

	## Benchmark

	Comparison against 3 other publicly available Kabyle tokenizers on 10 morphologically complex Kabyle sentences (verbal clitics, negation, construct state, pronominal suffixes, dense special characters, aspectual contrast):

	\| Tokenizer \| Vocab \| Avg fertility ↓ \| Design \|
	\|---\|---\|---\|---\|
	\| `boffire/kabyle-gpt2-tokenizer` \| 50,257 \| 2.058 \| ByteLevel BPE, broken Kabyle chars \|
	\| `Hillal-titouh/kabyle-bpe-tokenizer` \| 50,257 \| 1.887 \| No NFC, no mask token \|
	\| `boffire/kabyle-bpe-tokenizer-v2` \| 48,000 \| 1.736 \| ✅ NFC, CLDR alphabet, clean punctuation \|
	\| `boffire/bpe-tokenizer-for-kabyle` \| 48,000 \| 1.476 \| Absorbs punctuation into words \|

	> Note on BPE (boffire) fertility: Its lower fertility is structural — it absorbs punctuation marks into word tokens (e.g. `berra.` as a single token), which reduces the token count but makes punctuation boundaries invisible to the model. This tokenizer separates punctuation cleanly, which costs ~10 tokens across the benchmark but produces linguistically correct segmentation.

	### Sample tokenization

	```
	Yewwi-t-id ukan i gma-s.
	→ ▁yewwi- \| t- \| id \| ▁ukan \| ▁i \| ▁gma- \| s \| .

	Ur iẓri ara acu i d-yuɣen ɣer taddart-nneɣ.
	→ ▁ur \| ▁iẓri \| ▁ara \| ▁acu \| ▁i \| ▁d \| - \| yu \| ɣen \| ▁ɣer \| ▁taddart- \| n \| neɣ \| .

	Yessekcem-it deg taddart imi yeffeɣ-d ɣer berra.
	→ ▁yessekcem- \| it \| ▁deg \| ▁taddart \| ▁imi \| ▁yeffeɣ- \| d \| ▁ɣer \| ▁berra \| .
	```

	## Training Data

	Trained on a curated Kabyle text corpus (~20 MB, ~212,000 word-level units after pre-tokenization). The corpus covers written Kabyle in the standard Mammeri Latin orthography.

	## Limitations

	- Lowercase only. The normalizer lowercases all input. Kabyle capitalization is positional (sentence-initial) and carries no morphological information, so this is lossless for NLP tasks. Casing cannot be recovered from token IDs.
	- Corpus size. At 20 MB, some low-frequency morpheme combinations (e.g. the possessive suffix `-nneɣ` after varied stems) may not have been merged and will be split across 2–3 tokens. A larger corpus will improve this.
	- No Tifinagh or Arabic script support. This tokenizer covers the Latin orthography only.

	## Citation

	If you use this tokenizer in your work, please cite it as:

	```bibtex
	@misc{boffire2026kabyle,
	author = {boffire},
	title = {Kabyle BPE Tokenizer v2},
	year = {2026},
	publisher = {HuggingFace},
	url = {https://huggingface.co/boffire/Kabyle-BPE-Tokenizer-v2}
	}
	```

	## License

	[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)