File size: 2,450 Bytes
38409b8 d39b893 38409b8 d39b893 38409b8 d39b893 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | ---
language:
- kab
tags:
- gpt2
- causal-lm
- custom-vocab
license: mit
datasets:
- custom-kabyle-corpus
metrics:
- perplexity
---
# Kabyle GPT-2 Base Model (Optimized BPE)
This is a custom, lightweight GPT-2 style causal language model built from scratch specifically for the **Kabyle (Taqbaylit)** language. It utilizes a highly optimized morphological subword tokenizer trained with byte-aware rules to natively preserve and parse Latin-Tamazight text structures without visual noise artifacts.
## Model Highlights
* **Architecture:** Custom 8-layer, 8-attention-head Transformer (512 hidden dimensions) built from scratch.
* **Context Window:** 256 tokens.
* **Vocabulary Size:** 50,257 tokens.
* **Tokenizer Efficiency:** Achieves an exceptional **97.95% vocabulary utilization rate** on native Kabyle corpuses, maximizing embedding row saturation and eliminating dead parameters common in massive multilingual tokenizers.
## Tokenizer Performance
Our custom Byte-Pair Encoding (BPE) pipeline maps linguistic affixes accurately. Compared to standard tokenizers that introduce raw byte visual noise (e.g., `ÉĽ`, `áºĵ`), this model correctly keeps character boundaries intact during inference:
| Input Text Fragment | Standard Decoders (Noisy) | Our Native Pipeline (Clean) |
| :--- | :--- | :--- |
| **... yettɛawad ...** | `['yett', 'ÉĽawad']` | `['yett', 'ɛawad']` |
| **... iẓerfan ...** | `['Ġiáºĵer', 'fan']` | `['Ġiẓer', 'fan']` |
## Quickstart Usage
You can load this model and its accompanying optimized tokenizer directly into your PyTorch environment:
```python
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel
# Load the custom assets
tokenizer = PreTrainedTokenizerFast.from_pretrained("boffire/kabyle-gpt2-tokenizer")
model = GPT2LMHeadModel.from_pretrained("your-username/kabyle-llm-base")
# Quick inference test
text = "Wa d amcic-is aberkan,"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=40, do_sample=True, top_k=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
## Training Data & Methodology
The model was pre-trained using a meticulously cleaned and normalized **Kabyle text corpus** (~20 MB / 5.01M total tokens).
### Optimization Settings
* **Training Duration:** 3 Epochs
* **Optimizer:** AdamW
* **Learning Rate:** `5e-4`
* **Batch Strategy:** Dynamic batch padding to maximize hardware and VRAM efficiency. |