E5 Uzbek Embedding Model (e5-uz-v3)

Uzbek text embedding model fine-tuned from multilingual-e5-base (XLM-RoBERTa) for semantic search and retrieval in Uzbek language.

Model Details

Architecture: XLM-RoBERTa (12 layers, 768 hidden, 12 heads)
Output Dimensions: 768
Max Sequence Length: 512 tokens
Similarity Function: Cosine
Training Data: 36,306 Uzbek query-passage triplets (anchor, positive, negative)
Loss: CachedMultipleNegativesRankingLoss

Evaluation

Metric	Value
cosine_accuracy@1	0.499
cosine_accuracy@3	0.723
cosine_accuracy@5	0.801
cosine_accuracy@10	0.870
cosine_ndcg@10	0.686
cosine_mrr@10	0.627
cosine_map@100	0.633

Usage

pip install sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Dovud-Asadov/e5-uz-v3")

queries = ["query: O'zbekistonda eng katta shahar qaysi?"]
passages = [
    "passage: Toshkent — O'zbekistonning poytaxti va eng katta shahri.",
    "passage: Bugun ob-havo issiq bo'ladi, harorat 35 darajaga yetadi.",
]

q_emb = model.encode(queries)
p_emb = model.encode(passages)

# Cosine similarity (range: -1 to 1)
cosine_scores = model.similarity(q_emb, p_emb)

# Normalized similarity (range: 0 to 1)
scores = (cosine_scores + 1) / 2

for i, passage in enumerate(passages):
    print(f"Cosine: {cosine_scores[0][i]:.4f} | Score (0-1): {scores[0][i]:.4f} | {passage}")

# Output:
# Cosine: 0.3737 | Score (0-1): 0.6869 | passage: Toshkent — O'zbekistonning poytaxti va eng katta shahri.
# Cosine: -0.0566 | Score (0-1): 0.4717 | passage: Bugun ob-havo issiq bo'ladi, harorat 35 darajaga yetadi.

Prefix queries with query: and documents with passage: for best results.

Downloads last month: 112

Safetensors

Model size

0.3B params

Tensor type

F32