Sentence Similarity
sentence-transformers
Safetensors
English
Kabyle
xlm-roberta
kabyle
taqbaylit
tamazight
berber
embeddings
cross-lingual
african-languages
nlp
text-embeddings-inference
Instructions to use boffire/kabyle-sentence-transformer-mpnet with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use boffire/kabyle-sentence-transformer-mpnet with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("boffire/kabyle-sentence-transformer-mpnet") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| - kab | |
| license: apache-2.0 | |
| library_name: sentence-transformers | |
| tags: | |
| - sentence-transformers | |
| - kabyle | |
| - taqbaylit | |
| - tamazight | |
| - berber | |
| - embeddings | |
| - cross-lingual | |
| - african-languages | |
| - nlp | |
| datasets: | |
| - Imsidag-community/nllb_en_kab | |
| - Imsidag-community/english-kabyle-parallel | |
| - Imsidag-community/libretranslate-suggestions | |
| - ayymen/Weblate-Translations | |
| pipeline_tag: sentence-similarity | |
| # Kabyle Sentence Transformer (MPNet) | |
| A sentence embedding model specifically fine-tuned for **Kabyle (Taqbaylit)** - **English** cross-lingual semantic similarity. | |
| ## Model Details | |
| | Attribute | Value | | |
| |-----------|-------| | |
| | Base model | `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` | | |
| | Fine-tuning data | ~2.5M unique EN–KAB parallel sentences | | |
| | Embedding dimension | 768 | | |
| | Training framework | SentenceTransformers | | |
| | Training time | ~1h 16min (1 epoch, 15,593 steps) | | |
| | Final loss | 0.043 (started at 0.278) | | |
| ## Training Data | |
| | Source | Pairs | Description | | |
| |--------|-------|-------------| | |
| | NLLB (cleaned) | ~2.35M | Diverse domain parallel corpus | | |
| | Tatoeba + CS | ~202K | Community translations + software localization | | |
| | Weblate | ~9K | FLOSS UI strings | | |
| | LibreTranslate | ~449 | User-reviewed translations | | |
| ## Performance | |
| Compared to the base `paraphrase-multilingual-mpnet-base-v2` (untrained): | |
| | Metric | Base | This Model | Gain | | |
| |--------|------|------------|------| | |
| | Avg. cosine similarity (EN<->KAB) | 0.278 | **0.857** | **+58 points** | | |
| ## Usage | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("boffire/kabyle-sentence-transformer-mpnet") | |
| # Embed English and Kabyle | |
| sentences = ["Hello!", "Azul!"] | |
| embeddings = model.encode(sentences) | |
| # Cross-lingual similarity | |
| from sklearn.metrics.pairwise import cosine_similarity | |
| sim = cosine_similarity([embeddings[0]], [embeddings[1]]) | |
| print(sim) | |
| ``` | |
| ## Limitations | |
| - Trained primarily on parallel data; monolingual Kabyle similarity not explicitly optimized | |
| - Best for EN<->KAB cross-lingual tasks; Kabyle<->Kabyle may work but is untested | |
| - Religious text overrepresented in NLLB portion; may underperform on highly technical/modern domains | |
| - Evaluator used constant labels (all 1.0) due to all pairs being positive; correlation metrics were undefined | |
| ## Future Work | |
| - Train v2 with `Davlan/afro-xlmr-large` backbone for African-specific pretraining | |
| - Add monolingual Kabyle data for better Kabyle<->Kabyle similarity | |
| - Fix evaluator to use `AvgCosineEvaluator` instead of correlation-based metrics | |
| - Evaluate against LASER on a proper benchmark | |
| ## Citation | |
| If you use this model, please cite: | |
| ```bibtex | |
| @misc{kabyle-st-mpnet, | |
| title={Kabyle Sentence Transformer}, | |
| author={boffire}, | |
| year={2026}, | |
| howpublished={\url{https://huggingface.co/boffire/kabyle-sentence-transformer-mpnet}} | |
| } | |
| ``` | |
| ## Acknowledgments | |
| - Imsidag-community for the cleaned parallel corpora | |
| - Tatoeba contributors for community translations | |
| - Meta AI for LASER and NLLB datasets | |
| - boffire community for Kabyle NLP tooling |