E5 Uzbek Embedding Model (e5-uz-v3)

Uzbek text embedding model fine-tuned from multilingual-e5-base (XLM-RoBERTa) for semantic search and retrieval in Uzbek language.

Model Details

  • Architecture: XLM-RoBERTa (12 layers, 768 hidden, 12 heads)
  • Output Dimensions: 768
  • Max Sequence Length: 512 tokens
  • Similarity Function: Cosine
  • Training Data: 36,306 Uzbek query-passage triplets (anchor, positive, negative)
  • Loss: CachedMultipleNegativesRankingLoss

Evaluation

Metric Value
cosine_accuracy@1 0.499
cosine_accuracy@3 0.723
cosine_accuracy@5 0.801
cosine_accuracy@10 0.870
cosine_ndcg@10 0.686
cosine_mrr@10 0.627
cosine_map@100 0.633

Usage

pip install sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Dovud-Asadov/e5-uz-v3")

queries = ["query: O'zbekistonda eng katta shahar qaysi?"]
passages = [
    "passage: Toshkent โ€” O'zbekistonning poytaxti va eng katta shahri.",
    "passage: Bugun ob-havo issiq bo'ladi, harorat 35 darajaga yetadi.",
]

q_emb = model.encode(queries)
p_emb = model.encode(passages)

# Cosine similarity (range: -1 to 1)
cosine_scores = model.similarity(q_emb, p_emb)

# Normalized similarity (range: 0 to 1)
scores = (cosine_scores + 1) / 2

for i, passage in enumerate(passages):
    print(f"Cosine: {cosine_scores[0][i]:.4f} | Score (0-1): {scores[0][i]:.4f} | {passage}")

# Output:
# Cosine: 0.3737 | Score (0-1): 0.6869 | passage: Toshkent โ€” O'zbekistonning poytaxti va eng katta shahri.
# Cosine: -0.0566 | Score (0-1): 0.4717 | passage: Bugun ob-havo issiq bo'ladi, harorat 35 darajaga yetadi.

Prefix queries with query: and documents with passage: for best results.

Downloads last month
112
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support