boffire's picture
Update README.md
3012359 verified
---
language:
- en
- kab
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- kabyle
- taqbaylit
- tamazight
- berber
- embeddings
- cross-lingual
- african-languages
- nlp
datasets:
- Imsidag-community/nllb_en_kab
- Imsidag-community/english-kabyle-parallel
- Imsidag-community/libretranslate-suggestions
- ayymen/Weblate-Translations
pipeline_tag: sentence-similarity
---
# Kabyle Sentence Transformer (MPNet)
A sentence embedding model specifically fine-tuned for **Kabyle (Taqbaylit)** - **English** cross-lingual semantic similarity.
## Model Details
| Attribute | Value |
|-----------|-------|
| Base model | `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` |
| Fine-tuning data | ~2.5M unique EN–KAB parallel sentences |
| Embedding dimension | 768 |
| Training framework | SentenceTransformers |
| Training time | ~1h 16min (1 epoch, 15,593 steps) |
| Final loss | 0.043 (started at 0.278) |
## Training Data
| Source | Pairs | Description |
|--------|-------|-------------|
| NLLB (cleaned) | ~2.35M | Diverse domain parallel corpus |
| Tatoeba + CS | ~202K | Community translations + software localization |
| Weblate | ~9K | FLOSS UI strings |
| LibreTranslate | ~449 | User-reviewed translations |
## Performance
Compared to the base `paraphrase-multilingual-mpnet-base-v2` (untrained):
| Metric | Base | This Model | Gain |
|--------|------|------------|------|
| Avg. cosine similarity (EN<->KAB) | 0.278 | **0.857** | **+58 points** |
## Usage
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("boffire/kabyle-sentence-transformer-mpnet")
# Embed English and Kabyle
sentences = ["Hello!", "Azul!"]
embeddings = model.encode(sentences)
# Cross-lingual similarity
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity([embeddings[0]], [embeddings[1]])
print(sim)
```
## Limitations
- Trained primarily on parallel data; monolingual Kabyle similarity not explicitly optimized
- Best for EN<->KAB cross-lingual tasks; Kabyle<->Kabyle may work but is untested
- Religious text overrepresented in NLLB portion; may underperform on highly technical/modern domains
- Evaluator used constant labels (all 1.0) due to all pairs being positive; correlation metrics were undefined
## Future Work
- Train v2 with `Davlan/afro-xlmr-large` backbone for African-specific pretraining
- Add monolingual Kabyle data for better Kabyle<->Kabyle similarity
- Fix evaluator to use `AvgCosineEvaluator` instead of correlation-based metrics
- Evaluate against LASER on a proper benchmark
## Citation
If you use this model, please cite:
```bibtex
@misc{kabyle-st-mpnet,
title={Kabyle Sentence Transformer},
author={boffire},
year={2026},
howpublished={\url{https://huggingface.co/boffire/kabyle-sentence-transformer-mpnet}}
}
```
## Acknowledgments
- Imsidag-community for the cleaned parallel corpora
- Tatoeba contributors for community translations
- Meta AI for LASER and NLLB datasets
- boffire community for Kabyle NLP tooling