SPLADE-Amharic-Medium

This is a SPLADE Sparse Encoder model finetuned from rasyosef/roberta-medium-amharic using the sentence-transformers library. It maps sentences & paragraphs to a 32000-dimensional sparse vector space and can be used for semantic search and sparse retrieval.

The model was presented in the paper The Multilingual Curse at the Retrieval Layer: Evidence from Amharic.

Model Details

Model Description

  • Model Type: SPLADE Sparse Encoder
  • Base model: rasyosef/roberta-medium-amharic
  • Maximum Sequence Length: 510 tokens
  • Output Dimensionality: 32000 dimensions
  • Similarity Function: Dot Product
  • Language: am
  • License: mit

Model Sources

Full Model Architecture

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 510, 'do_lower_case': False, 'architecture': 'XLMRobertaForMaskedLM'})
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 32000})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("rasyosef/splade-amharic-medium")
# Run inference
sentences = [
    'ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና',
    'የኢትዮጵያ ዋነኛ የውጭ ምንዛሬ ምንጭ የሆነው ወደ ውጭ የሚላክ ቡና ዘርፍ በአሁኑ ጊዜ ከፍተኛ ውጥረት ውስጥ ገብቷል።',
    'የቻይናው ፕሬዝዳንት ዚ ጂንፒንግ ከትራምፕ ጋር ባደረጉት ጉባኤ ትኩረታቸው በሁለቱ ሀገራት መካከል ለወራት ከተፈጠረ ውጥረት እና የንግድ ጦርነት በኋላ የተረገጋጋ ግንኙነትን ማስቀጠል ነበር።',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 32000]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[44.9874, 24.7096,  0.0000],
#         [24.7096, 66.3428,  2.4125],
#         [ 0.0000,  2.4125, 69.0888]])

Evaluation

Metrics

Sparse Information Retrieval

Metric Value
dot_accuracy@1 0.6286
dot_accuracy@3 0.8108
dot_accuracy@5 0.8581
dot_accuracy@10 0.8956
dot_precision@1 0.6286
dot_precision@3 0.2703
dot_precision@5 0.1716
dot_precision@10 0.0896
dot_recall@1 0.6286
dot_recall@3 0.8108
dot_recall@5 0.8581
dot_recall@10 0.8956
dot_ndcg@10 0.7694
dot_mrr@10 0.7282
dot_map@100 0.7314
query_active_dims 60.9588
query_sparsity_ratio 0.9981
corpus_active_dims 117.9303
corpus_sparsity_ratio 0.9963

Training Details

Training Dataset

Amharic Passage Retrieval Dataset V2

  • Size: 245,876 training samples
  • Columns: anchor, positive, and negative
  • Loss: SpladeLoss with these parameters:
    {
        "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score')",
        "document_regularizer_weight": 0.003,
        "query_regularizer_weight": 0.005
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 48
  • per_device_eval_batch_size: 48
  • learning_rate: 6e-05
  • num_train_epochs: 4
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.05
  • fp16: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

Citation

@inproceedings{alemneh2026amharicir,
  title     = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
  author    = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
  booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
  year      = {2026},
}
Downloads last month
26
Safetensors
Model size
42.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rasyosef/splade-amharic-medium

Finetuned
(11)
this model

Dataset used to train rasyosef/splade-amharic-medium

Collection including rasyosef/splade-amharic-medium

Paper for rasyosef/splade-amharic-medium

Evaluation results

  • Dot Accuracy@1 on Amharic Passage Retrieval Dataset V2
    self-reported
    0.629
  • Dot Accuracy@3 on Amharic Passage Retrieval Dataset V2
    self-reported
    0.811
  • Dot Accuracy@5 on Amharic Passage Retrieval Dataset V2
    self-reported
    0.858
  • Dot Accuracy@10 on Amharic Passage Retrieval Dataset V2
    self-reported
    0.896
  • Dot Precision@1 on Amharic Passage Retrieval Dataset V2
    self-reported
    0.629
  • Dot Precision@3 on Amharic Passage Retrieval Dataset V2
    self-reported
    0.270
  • Dot Precision@5 on Amharic Passage Retrieval Dataset V2
    self-reported
    0.172
  • Dot Precision@10 on Amharic Passage Retrieval Dataset V2
    self-reported
    0.090