| --- |
| language: |
| - kab |
| tags: |
| - gpt2 |
| - causal-lm |
| - custom-vocab |
| license: mit |
| datasets: |
| - custom-kabyle-corpus |
| metrics: |
| - perplexity |
| --- |
| |
| # Kabyle GPT-2 Base Model (Optimized BPE) |
|
|
| This is a custom, lightweight GPT-2 style causal language model built from scratch specifically for the **Kabyle (Taqbaylit)** language. It utilizes a highly optimized morphological subword tokenizer trained with byte-aware rules to natively preserve and parse Latin-Tamazight text structures without visual noise artifacts. |
|
|
| ## Model Highlights |
| * **Architecture:** Custom 8-layer, 8-attention-head Transformer (512 hidden dimensions) built from scratch. |
| * **Context Window:** 256 tokens. |
| * **Vocabulary Size:** 50,257 tokens. |
| * **Tokenizer Efficiency:** Achieves an exceptional **97.95% vocabulary utilization rate** on native Kabyle corpuses, maximizing embedding row saturation and eliminating dead parameters common in massive multilingual tokenizers. |
|
|
| ## Tokenizer Performance |
|
|
| Our custom Byte-Pair Encoding (BPE) pipeline maps linguistic affixes accurately. Compared to standard tokenizers that introduce raw byte visual noise (e.g., `ÉĽ`, `áºĵ`), this model correctly keeps character boundaries intact during inference: |
|
|
| | Input Text Fragment | Standard Decoders (Noisy) | Our Native Pipeline (Clean) | |
| | :--- | :--- | :--- | |
| | **... yettɛawad ...** | `['yett', 'ÉĽawad']` | `['yett', 'ɛawad']` | |
| | **... iẓerfan ...** | `['Ġiáºĵer', 'fan']` | `['Ġiẓer', 'fan']` | |
|
|
| ## Quickstart Usage |
|
|
| You can load this model and its accompanying optimized tokenizer directly into your PyTorch environment: |
|
|
| ```python |
| from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel |
| |
| # Load the custom assets |
| tokenizer = PreTrainedTokenizerFast.from_pretrained("boffire/kabyle-gpt2-tokenizer") |
| model = GPT2LMHeadModel.from_pretrained("your-username/kabyle-llm-base") |
| |
| # Quick inference test |
| text = "Wa d amcic-is aberkan," |
| inputs = tokenizer(text, return_tensors="pt") |
| outputs = model.generate(**inputs, max_length=40, do_sample=True, top_k=50) |
| |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| |
| ## Training Data & Methodology |
| |
| The model was pre-trained using a meticulously cleaned and normalized **Kabyle text corpus** (~20 MB / 5.01M total tokens). |
| |
| ### Optimization Settings |
| * **Training Duration:** 3 Epochs |
| * **Optimizer:** AdamW |
| * **Learning Rate:** `5e-4` |
| * **Batch Strategy:** Dynamic batch padding to maximize hardware and VRAM efficiency. |