E5 Uzbek Embedding Model (e5-uz-v3)
Uzbek text embedding model fine-tuned from multilingual-e5-base (XLM-RoBERTa) for semantic search and retrieval in Uzbek language.
Model Details
- Architecture: XLM-RoBERTa (12 layers, 768 hidden, 12 heads)
- Output Dimensions: 768
- Max Sequence Length: 512 tokens
- Similarity Function: Cosine
- Training Data: 36,306 Uzbek query-passage triplets (anchor, positive, negative)
- Loss: CachedMultipleNegativesRankingLoss
Evaluation
| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.499 |
| cosine_accuracy@3 | 0.723 |
| cosine_accuracy@5 | 0.801 |
| cosine_accuracy@10 | 0.870 |
| cosine_ndcg@10 | 0.686 |
| cosine_mrr@10 | 0.627 |
| cosine_map@100 | 0.633 |
Usage
pip install sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Dovud-Asadov/e5-uz-v3")
queries = ["query: O'zbekistonda eng katta shahar qaysi?"]
passages = [
"passage: Toshkent โ O'zbekistonning poytaxti va eng katta shahri.",
"passage: Bugun ob-havo issiq bo'ladi, harorat 35 darajaga yetadi.",
]
q_emb = model.encode(queries)
p_emb = model.encode(passages)
# Cosine similarity (range: -1 to 1)
cosine_scores = model.similarity(q_emb, p_emb)
# Normalized similarity (range: 0 to 1)
scores = (cosine_scores + 1) / 2
for i, passage in enumerate(passages):
print(f"Cosine: {cosine_scores[0][i]:.4f} | Score (0-1): {scores[0][i]:.4f} | {passage}")
# Output:
# Cosine: 0.3737 | Score (0-1): 0.6869 | passage: Toshkent โ O'zbekistonning poytaxti va eng katta shahri.
# Cosine: -0.0566 | Score (0-1): 0.4717 | passage: Bugun ob-havo issiq bo'ladi, harorat 35 darajaga yetadi.
Prefix queries with query: and documents with passage: for best results.
- Downloads last month
- 112