Sentence Similarity
sentence-transformers
Safetensors
English
Kabyle
xlm-roberta
kabyle
taqbaylit
tamazight
berber
embeddings
cross-lingual
african-languages
nlp
text-embeddings-inference
Instructions to use boffire/kabyle-sentence-transformer-mpnet with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use boffire/kabyle-sentence-transformer-mpnet with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("boffire/kabyle-sentence-transformer-mpnet") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
metadata
language:
- en
- kab
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- kabyle
- taqbaylit
- tamazight
- berber
- embeddings
- cross-lingual
- african-languages
- nlp
datasets:
- Imsidag-community/nllb_en_kab
- Imsidag-community/english-kabyle-parallel
- Imsidag-community/libretranslate-suggestions
- ayymen/Weblate-Translations
pipeline_tag: sentence-similarity
Kabyle Sentence Transformer (MPNet)
A sentence embedding model specifically fine-tuned for Kabyle (Taqbaylit) - English cross-lingual semantic similarity.
Model Details
| Attribute | Value |
|---|---|
| Base model | sentence-transformers/paraphrase-multilingual-mpnet-base-v2 |
| Fine-tuning data | ~2.5M unique EN–KAB parallel sentences |
| Embedding dimension | 768 |
| Training framework | SentenceTransformers |
| Training time | ~1h 16min (1 epoch, 15,593 steps) |
| Final loss | 0.043 (started at 0.278) |
Training Data
| Source | Pairs | Description |
|---|---|---|
| NLLB (cleaned) | ~2.35M | Diverse domain parallel corpus |
| Tatoeba + CS | ~202K | Community translations + software localization |
| Weblate | ~9K | FLOSS UI strings |
| LibreTranslate | ~449 | User-reviewed translations |
Performance
Compared to the base paraphrase-multilingual-mpnet-base-v2 (untrained):
| Metric | Base | This Model | Gain |
|---|---|---|---|
| Avg. cosine similarity (EN<->KAB) | 0.278 | 0.857 | +58 points |
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("boffire/kabyle-sentence-transformer-mpnet")
# Embed English and Kabyle
sentences = ["Hello!", "Azul!"]
embeddings = model.encode(sentences)
# Cross-lingual similarity
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity([embeddings[0]], [embeddings[1]])
print(sim)
Limitations
- Trained primarily on parallel data; monolingual Kabyle similarity not explicitly optimized
- Best for EN<->KAB cross-lingual tasks; Kabyle<->Kabyle may work but is untested
- Religious text overrepresented in NLLB portion; may underperform on highly technical/modern domains
- Evaluator used constant labels (all 1.0) due to all pairs being positive; correlation metrics were undefined
Future Work
- Train v2 with
Davlan/afro-xlmr-largebackbone for African-specific pretraining - Add monolingual Kabyle data for better Kabyle<->Kabyle similarity
- Fix evaluator to use
AvgCosineEvaluatorinstead of correlation-based metrics - Evaluate against LASER on a proper benchmark
Citation
If you use this model, please cite:
@misc{kabyle-st-mpnet,
title={Kabyle Sentence Transformer},
author={boffire},
year={2026},
howpublished={\url{https://huggingface.co/boffire/kabyle-sentence-transformer-mpnet}}
}
Acknowledgments
- Imsidag-community for the cleaned parallel corpora
- Tatoeba contributors for community translations
- Meta AI for LASER and NLLB datasets
- boffire community for Kabyle NLP tooling