rasyosef/Amharic-Passage-Retrieval-Dataset-V2
Viewer • Updated • 68.3k • 70
How to use rasyosef/splade-amharic-medium with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("rasyosef/splade-amharic-medium")
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]This is a SPLADE Sparse Encoder model finetuned from rasyosef/roberta-medium-amharic using the sentence-transformers library. It maps sentences & paragraphs to a 32000-dimensional sparse vector space and can be used for semantic search and sparse retrieval.
The model was presented in the paper The Multilingual Curse at the Retrieval Layer: Evidence from Amharic.
SparseEncoder(
(0): MLMTransformer({'max_seq_length': 510, 'do_lower_case': False, 'architecture': 'XLMRobertaForMaskedLM'})
(1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 32000})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SparseEncoder
# Download from the 🤗 Hub
model = SparseEncoder("rasyosef/splade-amharic-medium")
# Run inference
sentences = [
'ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና',
'የኢትዮጵያ ዋነኛ የውጭ ምንዛሬ ምንጭ የሆነው ወደ ውጭ የሚላክ ቡና ዘርፍ በአሁኑ ጊዜ ከፍተኛ ውጥረት ውስጥ ገብቷል።',
'የቻይናው ፕሬዝዳንት ዚ ጂንፒንግ ከትራምፕ ጋር ባደረጉት ጉባኤ ትኩረታቸው በሁለቱ ሀገራት መካከል ለወራት ከተፈጠረ ውጥረት እና የንግድ ጦርነት በኋላ የተረገጋጋ ግንኙነትን ማስቀጠል ነበር።',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 32000]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[44.9874, 24.7096, 0.0000],
# [24.7096, 66.3428, 2.4125],
# [ 0.0000, 2.4125, 69.0888]])
SparseInformationRetrievalEvaluator| Metric | Value |
|---|---|
| dot_accuracy@1 | 0.6286 |
| dot_accuracy@3 | 0.8108 |
| dot_accuracy@5 | 0.8581 |
| dot_accuracy@10 | 0.8956 |
| dot_precision@1 | 0.6286 |
| dot_precision@3 | 0.2703 |
| dot_precision@5 | 0.1716 |
| dot_precision@10 | 0.0896 |
| dot_recall@1 | 0.6286 |
| dot_recall@3 | 0.8108 |
| dot_recall@5 | 0.8581 |
| dot_recall@10 | 0.8956 |
| dot_ndcg@10 | 0.7694 |
| dot_mrr@10 | 0.7282 |
| dot_map@100 | 0.7314 |
| query_active_dims | 60.9588 |
| query_sparsity_ratio | 0.9981 |
| corpus_active_dims | 117.9303 |
| corpus_sparsity_ratio | 0.9963 |
anchor, positive, and negativeSpladeLoss with these parameters:{
"loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score')",
"document_regularizer_weight": 0.003,
"query_regularizer_weight": 0.005
}
eval_strategy: epochper_device_train_batch_size: 48per_device_eval_batch_size: 48learning_rate: 6e-05num_train_epochs: 4lr_scheduler_type: cosinewarmup_ratio: 0.05fp16: Trueoptim: adamw_torch_fusedbatch_sampler: no_duplicates@inproceedings{alemneh2026amharicir,
title = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
author = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
year = {2026},
}
Base model
rasyosef/roberta-medium-amharic