boffire commited on
Commit
3012359
·
verified ·
1 Parent(s): 7ff1915

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md CHANGED
@@ -1,3 +1,106 @@
1
  ---
 
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - kab
5
  license: apache-2.0
6
+ library_name: sentence-transformers
7
+ tags:
8
+ - sentence-transformers
9
+ - kabyle
10
+ - taqbaylit
11
+ - tamazight
12
+ - berber
13
+ - embeddings
14
+ - cross-lingual
15
+ - african-languages
16
+ - nlp
17
+ datasets:
18
+ - Imsidag-community/nllb_en_kab
19
+ - Imsidag-community/english-kabyle-parallel
20
+ - Imsidag-community/libretranslate-suggestions
21
+ - ayymen/Weblate-Translations
22
+ pipeline_tag: sentence-similarity
23
  ---
24
+
25
+ # Kabyle Sentence Transformer (MPNet)
26
+
27
+ A sentence embedding model specifically fine-tuned for **Kabyle (Taqbaylit)** - **English** cross-lingual semantic similarity.
28
+
29
+ ## Model Details
30
+
31
+ | Attribute | Value |
32
+ |-----------|-------|
33
+ | Base model | `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` |
34
+ | Fine-tuning data | ~2.5M unique EN–KAB parallel sentences |
35
+ | Embedding dimension | 768 |
36
+ | Training framework | SentenceTransformers |
37
+ | Training time | ~1h 16min (1 epoch, 15,593 steps) |
38
+ | Final loss | 0.043 (started at 0.278) |
39
+
40
+ ## Training Data
41
+
42
+ | Source | Pairs | Description |
43
+ |--------|-------|-------------|
44
+ | NLLB (cleaned) | ~2.35M | Diverse domain parallel corpus |
45
+ | Tatoeba + CS | ~202K | Community translations + software localization |
46
+ | Weblate | ~9K | FLOSS UI strings |
47
+ | LibreTranslate | ~449 | User-reviewed translations |
48
+
49
+ ## Performance
50
+
51
+ Compared to the base `paraphrase-multilingual-mpnet-base-v2` (untrained):
52
+
53
+ | Metric | Base | This Model | Gain |
54
+ |--------|------|------------|------|
55
+ | Avg. cosine similarity (EN<->KAB) | 0.278 | **0.857** | **+58 points** |
56
+
57
+ ## Usage
58
+
59
+ ```python
60
+ from sentence_transformers import SentenceTransformer
61
+
62
+ model = SentenceTransformer("boffire/kabyle-sentence-transformer-mpnet")
63
+
64
+ # Embed English and Kabyle
65
+ sentences = ["Hello!", "Azul!"]
66
+ embeddings = model.encode(sentences)
67
+
68
+ # Cross-lingual similarity
69
+ from sklearn.metrics.pairwise import cosine_similarity
70
+ sim = cosine_similarity([embeddings[0]], [embeddings[1]])
71
+ print(sim)
72
+ ```
73
+
74
+ ## Limitations
75
+
76
+ - Trained primarily on parallel data; monolingual Kabyle similarity not explicitly optimized
77
+ - Best for EN<->KAB cross-lingual tasks; Kabyle<->Kabyle may work but is untested
78
+ - Religious text overrepresented in NLLB portion; may underperform on highly technical/modern domains
79
+ - Evaluator used constant labels (all 1.0) due to all pairs being positive; correlation metrics were undefined
80
+
81
+ ## Future Work
82
+
83
+ - Train v2 with `Davlan/afro-xlmr-large` backbone for African-specific pretraining
84
+ - Add monolingual Kabyle data for better Kabyle<->Kabyle similarity
85
+ - Fix evaluator to use `AvgCosineEvaluator` instead of correlation-based metrics
86
+ - Evaluate against LASER on a proper benchmark
87
+
88
+ ## Citation
89
+
90
+ If you use this model, please cite:
91
+
92
+ ```bibtex
93
+ @misc{kabyle-st-mpnet,
94
+ title={Kabyle Sentence Transformer},
95
+ author={boffire},
96
+ year={2026},
97
+ howpublished={\url{https://huggingface.co/boffire/kabyle-sentence-transformer-mpnet}}
98
+ }
99
+ ```
100
+
101
+ ## Acknowledgments
102
+
103
+ - Imsidag-community for the cleaned parallel corpora
104
+ - Tatoeba contributors for community translations
105
+ - Meta AI for LASER and NLLB datasets
106
+ - boffire community for Kabyle NLP tooling