SPLADE-Amharic-Medium

This is a SPLADE Sparse Encoder model finetuned from rasyosef/roberta-medium-amharic using the sentence-transformers library. It maps sentences & paragraphs to a 32000-dimensional sparse vector space and can be used for semantic search and sparse retrieval.

The model was presented in the paper The Multilingual Curse at the Retrieval Layer: Evidence from Amharic.

Model Details

Model Description

Model Type: SPLADE Sparse Encoder
Base model: rasyosef/roberta-medium-amharic
Maximum Sequence Length: 510 tokens
Output Dimensionality: 32000 dimensions
Similarity Function: Dot Product
Language: am
License: mit

Model Sources

Repository: GitHub
Paper: The Multilingual Curse at the Retrieval Layer: Evidence from Amharic
Documentation: Sentence Transformers Documentation
Documentation: Sparse Encoder Documentation

Full Model Architecture

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 510, 'do_lower_case': False, 'architecture': 'XLMRobertaForMaskedLM'})
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 32000})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("rasyosef/splade-amharic-medium")
# Run inference
sentences = [
    'ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና',
    'የኢትዮጵያ ዋነኛ የውጭ ምንዛሬ ምንጭ የሆነው ወደ ውጭ የሚላክ ቡና ዘርፍ በአሁኑ ጊዜ ከፍተኛ ውጥረት ውስጥ ገብቷል።',
    'የቻይናው ፕሬዝዳንት ዚ ጂንፒንግ ከትራምፕ ጋር ባደረጉት ጉባኤ ትኩረታቸው በሁለቱ ሀገራት መካከል ለወራት ከተፈጠረ ውጥረት እና የንግድ ጦርነት በኋላ የተረገጋጋ ግንኙነትን ማስቀጠል ነበር።',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 32000]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[44.9874, 24.7096,  0.0000],
#         [24.7096, 66.3428,  2.4125],
#         [ 0.0000,  2.4125, 69.0888]])

Evaluation

Metrics

Sparse Information Retrieval

Evaluated with SparseInformationRetrievalEvaluator

Metric	Value
dot_accuracy@1	0.6286
dot_accuracy@3	0.8108
dot_accuracy@5	0.8581
dot_accuracy@10	0.8956
dot_precision@1	0.6286
dot_precision@3	0.2703
dot_precision@5	0.1716
dot_precision@10	0.0896
dot_recall@1	0.6286
dot_recall@3	0.8108
dot_recall@5	0.8581
dot_recall@10	0.8956
dot_ndcg@10	0.7694
dot_mrr@10	0.7282
dot_map@100	0.7314
query_active_dims	60.9588
query_sparsity_ratio	0.9981
corpus_active_dims	117.9303
corpus_sparsity_ratio	0.9963

Training Details

Training Dataset

Amharic Passage Retrieval Dataset V2

Size: 245,876 training samples
Columns: anchor, positive, and negative

Loss: SpladeLoss with these parameters:

{
    "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score')",
    "document_regularizer_weight": 0.003,
    "query_regularizer_weight": 0.005
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: epoch
per_device_train_batch_size: 48
per_device_eval_batch_size: 48
learning_rate: 6e-05
num_train_epochs: 4
lr_scheduler_type: cosine
warmup_ratio: 0.05
fp16: True
optim: adamw_torch_fused
batch_sampler: no_duplicates

Citation

@inproceedings{alemneh2026amharicir,
  title     = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
  author    = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
  booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
  year      = {2026},
}