--- language: - en - kab license: apache-2.0 library_name: sentence-transformers tags: - sentence-transformers - kabyle - taqbaylit - tamazight - berber - embeddings - cross-lingual - african-languages - nlp datasets: - Imsidag-community/nllb_en_kab - Imsidag-community/english-kabyle-parallel - Imsidag-community/libretranslate-suggestions - ayymen/Weblate-Translations pipeline_tag: sentence-similarity --- # Kabyle Sentence Transformer (MPNet) A sentence embedding model specifically fine-tuned for **Kabyle (Taqbaylit)** - **English** cross-lingual semantic similarity. ## Model Details | Attribute | Value | |-----------|-------| | Base model | `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` | | Fine-tuning data | ~2.5M unique EN–KAB parallel sentences | | Embedding dimension | 768 | | Training framework | SentenceTransformers | | Training time | ~1h 16min (1 epoch, 15,593 steps) | | Final loss | 0.043 (started at 0.278) | ## Training Data | Source | Pairs | Description | |--------|-------|-------------| | NLLB (cleaned) | ~2.35M | Diverse domain parallel corpus | | Tatoeba + CS | ~202K | Community translations + software localization | | Weblate | ~9K | FLOSS UI strings | | LibreTranslate | ~449 | User-reviewed translations | ## Performance Compared to the base `paraphrase-multilingual-mpnet-base-v2` (untrained): | Metric | Base | This Model | Gain | |--------|------|------------|------| | Avg. cosine similarity (EN<->KAB) | 0.278 | **0.857** | **+58 points** | ## Usage ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("boffire/kabyle-sentence-transformer-mpnet") # Embed English and Kabyle sentences = ["Hello!", "Azul!"] embeddings = model.encode(sentences) # Cross-lingual similarity from sklearn.metrics.pairwise import cosine_similarity sim = cosine_similarity([embeddings[0]], [embeddings[1]]) print(sim) ``` ## Limitations - Trained primarily on parallel data; monolingual Kabyle similarity not explicitly optimized - Best for EN<->KAB cross-lingual tasks; Kabyle<->Kabyle may work but is untested - Religious text overrepresented in NLLB portion; may underperform on highly technical/modern domains - Evaluator used constant labels (all 1.0) due to all pairs being positive; correlation metrics were undefined ## Future Work - Train v2 with `Davlan/afro-xlmr-large` backbone for African-specific pretraining - Add monolingual Kabyle data for better Kabyle<->Kabyle similarity - Fix evaluator to use `AvgCosineEvaluator` instead of correlation-based metrics - Evaluate against LASER on a proper benchmark ## Citation If you use this model, please cite: ```bibtex @misc{kabyle-st-mpnet, title={Kabyle Sentence Transformer}, author={boffire}, year={2026}, howpublished={\url{https://huggingface.co/boffire/kabyle-sentence-transformer-mpnet}} } ``` ## Acknowledgments - Imsidag-community for the cleaned parallel corpora - Tatoeba contributors for community translations - Meta AI for LASER and NLLB datasets - boffire community for Kabyle NLP tooling