boffire
/

kabyle-gpt2-tokenizer

Model card Files Files and versions

kabyle-gpt2-tokenizer / README.md

boffire's picture

Update README.md

d39b893 verified about 9 hours ago

|

history blame contribute delete

2.45 kB

	---
	language:
	- kab
	tags:
	- gpt2
	- causal-lm
	- custom-vocab
	license: mit
	datasets:
	- custom-kabyle-corpus
	metrics:
	- perplexity
	---

	# Kabyle GPT-2 Base Model (Optimized BPE)

	This is a custom, lightweight GPT-2 style causal language model built from scratch specifically for the Kabyle (Taqbaylit) language. It utilizes a highly optimized morphological subword tokenizer trained with byte-aware rules to natively preserve and parse Latin-Tamazight text structures without visual noise artifacts.

	## Model Highlights
	* Architecture: Custom 8-layer, 8-attention-head Transformer (512 hidden dimensions) built from scratch.
	* Context Window: 256 tokens.
	* Vocabulary Size: 50,257 tokens.
	* Tokenizer Efficiency: Achieves an exceptional 97.95% vocabulary utilization rate on native Kabyle corpuses, maximizing embedding row saturation and eliminating dead parameters common in massive multilingual tokenizers.

	## Tokenizer Performance

	Our custom Byte-Pair Encoding (BPE) pipeline maps linguistic affixes accurately. Compared to standard tokenizers that introduce raw byte visual noise (e.g., `ÉĽ`, `áºĵ`), this model correctly keeps character boundaries intact during inference:

	\| Input Text Fragment \| Standard Decoders (Noisy) \| Our Native Pipeline (Clean) \|
	\| :--- \| :--- \| :--- \|
	\| ... yettɛawad ... \| `['yett', 'ÉĽawad']` \| `['yett', 'ɛawad']` \|
	\| ... iẓerfan ... \| `['Ġiáºĵer', 'fan']` \| `['Ġiẓer', 'fan']` \|

	## Quickstart Usage

	You can load this model and its accompanying optimized tokenizer directly into your PyTorch environment:

	```python
	from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel

	# Load the custom assets
	tokenizer = PreTrainedTokenizerFast.from_pretrained("boffire/kabyle-gpt2-tokenizer")
	model = GPT2LMHeadModel.from_pretrained("your-username/kabyle-llm-base")

	# Quick inference test
	text = "Wa d amcic-is aberkan,"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=40, do_sample=True, top_k=50)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))

	## Training Data & Methodology

	The model was pre-trained using a meticulously cleaned and normalized Kabyle text corpus (~20 MB / 5.01M total tokens).

	### Optimization Settings
	* Training Duration: 3 Epochs
	* Optimizer: AdamW
	* Learning Rate: `5e-4`
	* Batch Strategy: Dynamic batch padding to maximize hardware and VRAM efficiency.