Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,60 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- kab
|
| 4 |
+
tags:
|
| 5 |
+
- gpt2
|
| 6 |
+
- causal-lm
|
| 7 |
+
- custom-vocab
|
| 8 |
license: mit
|
| 9 |
+
datasets:
|
| 10 |
+
- custom-kabyle-corpus
|
| 11 |
+
metrics:
|
| 12 |
+
- perplexity
|
| 13 |
---
|
| 14 |
+
|
| 15 |
+
# Kabyle GPT-2 Base Model (Optimized BPE)
|
| 16 |
+
|
| 17 |
+
This is a custom, lightweight GPT-2 style causal language model built from scratch specifically for the **Kabyle (Taqbaylit)** language. It utilizes a highly optimized morphological subword tokenizer trained with byte-aware rules to natively preserve and parse Latin-Tamazight text structures without visual noise artifacts.
|
| 18 |
+
|
| 19 |
+
## Model Highlights
|
| 20 |
+
* **Architecture:** Custom 8-layer, 8-attention-head Transformer (512 hidden dimensions) built from scratch.
|
| 21 |
+
* **Context Window:** 256 tokens.
|
| 22 |
+
* **Vocabulary Size:** 50,257 tokens.
|
| 23 |
+
* **Tokenizer Efficiency:** Achieves an exceptional **97.95% vocabulary utilization rate** on native Kabyle corpuses, maximizing embedding row saturation and eliminating dead parameters common in massive multilingual tokenizers.
|
| 24 |
+
|
| 25 |
+
## Tokenizer Performance
|
| 26 |
+
|
| 27 |
+
Our custom Byte-Pair Encoding (BPE) pipeline maps linguistic affixes accurately. Compared to standard tokenizers that introduce raw byte visual noise (e.g., `ÉĽ`, `áºĵ`), this model correctly keeps character boundaries intact during inference:
|
| 28 |
+
|
| 29 |
+
| Input Text Fragment | Standard Decoders (Noisy) | Our Native Pipeline (Clean) |
|
| 30 |
+
| :--- | :--- | :--- |
|
| 31 |
+
| **... yettɛawad ...** | `['yett', 'ÉĽawad']` | `['yett', 'ɛawad']` |
|
| 32 |
+
| **... iẓerfan ...** | `['Ġiáºĵer', 'fan']` | `['Ġiẓer', 'fan']` |
|
| 33 |
+
|
| 34 |
+
## Quickstart Usage
|
| 35 |
+
|
| 36 |
+
You can load this model and its accompanying optimized tokenizer directly into your PyTorch environment:
|
| 37 |
+
|
| 38 |
+
```python
|
| 39 |
+
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel
|
| 40 |
+
|
| 41 |
+
# Load the custom assets
|
| 42 |
+
tokenizer = PreTrainedTokenizerFast.from_pretrained("boffire/kabyle-gpt2-tokenizer")
|
| 43 |
+
model = GPT2LMHeadModel.from_pretrained("your-username/kabyle-llm-base")
|
| 44 |
+
|
| 45 |
+
# Quick inference test
|
| 46 |
+
text = "Wa d amcic-is aberkan,"
|
| 47 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 48 |
+
outputs = model.generate(**inputs, max_length=40, do_sample=True, top_k=50)
|
| 49 |
+
|
| 50 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 51 |
+
|
| 52 |
+
## Training Data & Methodology
|
| 53 |
+
|
| 54 |
+
The model was pre-trained using a meticulously cleaned and normalized **Kabyle text corpus** (~20 MB / 5.01M total tokens).
|
| 55 |
+
|
| 56 |
+
### Optimization Settings
|
| 57 |
+
* **Training Duration:** 3 Epochs
|
| 58 |
+
* **Optimizer:** AdamW
|
| 59 |
+
* **Learning Rate:** `5e-4`
|
| 60 |
+
* **Batch Strategy:** Dynamic batch padding to maximize hardware and VRAM efficiency.
|