boffire commited on
Commit
d39b893
·
verified ·
1 Parent(s): 306507f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -0
README.md CHANGED
@@ -1,3 +1,60 @@
1
  ---
 
 
 
 
 
 
2
  license: mit
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - kab
4
+ tags:
5
+ - gpt2
6
+ - causal-lm
7
+ - custom-vocab
8
  license: mit
9
+ datasets:
10
+ - custom-kabyle-corpus
11
+ metrics:
12
+ - perplexity
13
  ---
14
+
15
+ # Kabyle GPT-2 Base Model (Optimized BPE)
16
+
17
+ This is a custom, lightweight GPT-2 style causal language model built from scratch specifically for the **Kabyle (Taqbaylit)** language. It utilizes a highly optimized morphological subword tokenizer trained with byte-aware rules to natively preserve and parse Latin-Tamazight text structures without visual noise artifacts.
18
+
19
+ ## Model Highlights
20
+ * **Architecture:** Custom 8-layer, 8-attention-head Transformer (512 hidden dimensions) built from scratch.
21
+ * **Context Window:** 256 tokens.
22
+ * **Vocabulary Size:** 50,257 tokens.
23
+ * **Tokenizer Efficiency:** Achieves an exceptional **97.95% vocabulary utilization rate** on native Kabyle corpuses, maximizing embedding row saturation and eliminating dead parameters common in massive multilingual tokenizers.
24
+
25
+ ## Tokenizer Performance
26
+
27
+ Our custom Byte-Pair Encoding (BPE) pipeline maps linguistic affixes accurately. Compared to standard tokenizers that introduce raw byte visual noise (e.g., `ÉĽ`, `áºĵ`), this model correctly keeps character boundaries intact during inference:
28
+
29
+ | Input Text Fragment | Standard Decoders (Noisy) | Our Native Pipeline (Clean) |
30
+ | :--- | :--- | :--- |
31
+ | **... yettɛawad ...** | `['yett', 'ÉĽawad']` | `['yett', 'ɛawad']` |
32
+ | **... iẓerfan ...** | `['Ġiáºĵer', 'fan']` | `['Ġiẓer', 'fan']` |
33
+
34
+ ## Quickstart Usage
35
+
36
+ You can load this model and its accompanying optimized tokenizer directly into your PyTorch environment:
37
+
38
+ ```python
39
+ from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel
40
+
41
+ # Load the custom assets
42
+ tokenizer = PreTrainedTokenizerFast.from_pretrained("boffire/kabyle-gpt2-tokenizer")
43
+ model = GPT2LMHeadModel.from_pretrained("your-username/kabyle-llm-base")
44
+
45
+ # Quick inference test
46
+ text = "Wa d amcic-is aberkan,"
47
+ inputs = tokenizer(text, return_tensors="pt")
48
+ outputs = model.generate(**inputs, max_length=40, do_sample=True, top_k=50)
49
+
50
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
51
+
52
+ ## Training Data & Methodology
53
+
54
+ The model was pre-trained using a meticulously cleaned and normalized **Kabyle text corpus** (~20 MB / 5.01M total tokens).
55
+
56
+ ### Optimization Settings
57
+ * **Training Duration:** 3 Epochs
58
+ * **Optimizer:** AdamW
59
+ * **Learning Rate:** `5e-4`
60
+ * **Batch Strategy:** Dynamic batch padding to maximize hardware and VRAM efficiency.