Update README.md

d39b893 verified about 8 hours ago

2.45 kB

language:
  - kab
tags:
  - gpt2
  - causal-lm
  - custom-vocab
license: mit
datasets:
  - custom-kabyle-corpus
metrics:
  - perplexity

Kabyle GPT-2 Base Model (Optimized BPE)

This is a custom, lightweight GPT-2 style causal language model built from scratch specifically for the Kabyle (Taqbaylit) language. It utilizes a highly optimized morphological subword tokenizer trained with byte-aware rules to natively preserve and parse Latin-Tamazight text structures without visual noise artifacts.

Model Highlights

Architecture: Custom 8-layer, 8-attention-head Transformer (512 hidden dimensions) built from scratch.
Context Window: 256 tokens.
Vocabulary Size: 50,257 tokens.
Tokenizer Efficiency: Achieves an exceptional 97.95% vocabulary utilization rate on native Kabyle corpuses, maximizing embedding row saturation and eliminating dead parameters common in massive multilingual tokenizers.

Tokenizer Performance

Our custom Byte-Pair Encoding (BPE) pipeline maps linguistic affixes accurately. Compared to standard tokenizers that introduce raw byte visual noise (e.g., ÉĽ, áºĵ), this model correctly keeps character boundaries intact during inference:

Input Text Fragment	Standard Decoders (Noisy)	Our Native Pipeline (Clean)
... yettɛawad ...	`['yett', 'ÉĽawad']`	`['yett', 'ɛawad']`
... iẓerfan ...	`['Ġiáºĵer', 'fan']`	`['Ġiẓer', 'fan']`

Quickstart Usage

You can load this model and its accompanying optimized tokenizer directly into your PyTorch environment:

from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel

# Load the custom assets
tokenizer = PreTrainedTokenizerFast.from_pretrained("boffire/kabyle-gpt2-tokenizer")
model = GPT2LMHeadModel.from_pretrained("your-username/kabyle-llm-base")

# Quick inference test
text = "Wa d amcic-is aberkan,"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=40, do_sample=True, top_k=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

## Training Data & Methodology

The model was pre-trained using a meticulously cleaned and normalized **Kabyle text corpus** (~20 MB / 5.01M total tokens). 

### Optimization Settings
* **Training Duration:** 3 Epochs
* **Optimizer:** AdamW
* **Learning Rate:** `5e-4`
* **Batch Strategy:** Dynamic batch padding to maximize hardware and VRAM efficiency.