--- base_model: rasyosef/roberta-medium-amharic datasets: - rasyosef/Amharic-Passage-Retrieval-Dataset-V2 language: - am library_name: sentence-transformers license: mit metrics: - map - mrr@10 - ndcg@10 pipeline_tag: text-ranking tags: - sentence-transformers - cross-encoder - generated_from_trainer - dataset_size:491752 - loss:BinaryCrossEntropyLoss model-index: - name: roberta-amharic-reranker-medium results: - task: type: cross-encoder-reranking name: Cross Encoder Reranking dataset: name: amh passage retrieval dev type: amh-passage-retrieval-dev metrics: - type: mrr@10 value: 0.805 name: Mrr@10 - type: ndcg@10 value: 0.835 name: Ndcg@10 --- # reranker-amharic-medium This is a [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) model finetuned from [rasyosef/roberta-medium-amharic](https://huggingface.co/rasyosef/roberta-medium-amharic) using the [sentence-transformers](https://www.SBERT.net) library. It computes scores for pairs of texts, which can be used for text reranking and semantic search. This model is part of the research presented in the paper **"The Multilingual Curse at the Retrieval Layer: Evidence from Amharic"**. - **Paper:** [The Multilingual Curse at the Retrieval Layer: Evidence from Amharic](https://huggingface.co/papers/2605.24556) - **Code:** [https://github.com/rasyosef/amharic-neural-ir](https://github.com/rasyosef/amharic-neural-ir) ## Model Details ### Model Description - **Model Type:** Cross Encoder - **Base model:** [rasyosef/roberta-medium-amharic](https://huggingface.co/rasyosef/roberta-medium-amharic) - **Maximum Sequence Length:** 510 tokens - **Number of Output Labels:** 1 label - **Language:** Amharic (am) - **License:** MIT ### Model Sources - **Documentation:** [Sentence Transformers Documentation](https://sbert.net) - **Documentation:** [Cross Encoder Documentation](https://www.sbert.net/docs/cross_encoder/usage/usage.html) - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import CrossEncoder # Download from the 🤗 Hub model = CrossEncoder("rasyosef/reranker-amharic-medium") # Get scores for pairs of texts pairs = [ ['ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና', 'የኢትዮጵያ ዋነኛ የውጭ ምንዛሬ ምንጭ የሆነው ወደ ውጭ የሚላክ ቡና ዘርፍ በአሁኑ ጊዜ ከፍተኛ ውጥረት ውስጥ ገብቷል።'], ['ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና', 'የቻይናው ፕሬዝዳንት ዚ ጂንፒንግ ከትራምፕ ጋር ባደረጉት ጉባኤ ትኩረታቸው በሁለቱ ሀገራት መካከል ለወራት ከተፈጠረ ውጥረት እና የንግድ ጦርነት በኋላ የተረገጋጋ ግንኙነትን ማስቀጠል ነበር።'] ] scores = model.predict(pairs) print(scores.shape) # (2,) # Or rank different texts based on similarity to a single text ranks = model.rank( 'ለውጭ ገበያ በሚቀርበው የኢትዮጵያ ቡና ላይ የተጋረጠው ፈተና', [ 'የኢትዮጵያ ዋነኛ የውጭ ምንዛሬ ምንጭ የሆነው ወደ ውጭ የሚላክ ቡና ዘርፍ በአሁኑ ጊዜ ከፍተኛ ውጥረት ውስጥ ገብቷል።', 'የቻይናው ፕሬዝዳንት ዚ ጂንፒንግ ከትራምፕ ጋር ባደረጉት ጉባኤ ትኩረታቸው በሁለቱ ሀገራት መካከል ለወራት ከተፈጠረ ውጥረት እና የንግድ ጦርነት በኋላ የተረገጋጋ ግንኙነትን ማስቀጠል ነበር።', ] ) print(ranks) # [{'corpus_id': 0, 'score': ...}, {'corpus_id': 1, 'score': ...}] ``` ## Evaluation ### Metrics #### Cross Encoder Reranking * Dataset: `amh-passage-retrieval-dev` * Evaluated with [CrossEncoderRerankingEvaluator](https://sbert.net/docs/package_reference/cross_encoder/evaluation.html#sentence_transformers.cross_encoder.evaluation.CrossEncoderRerankingEvaluator) with these parameters: ```json { "at_k": 10 } ``` | Metric | Value | |:------------|:-----------| | mrr@10 | 0.805 | | **ndcg@10** | **0.835** | ## Training Details
### Training Dataset #### Amharic Passage Retrieval Dataset V2 * Size: 491,752 training samples * Columns: query, passage, and label * Loss: [BinaryCrossEntropyLoss](https://sbert.net/docs/package_reference/cross_encoder/losses.html#binarycrossentropyloss) with these parameters: ```json { "activation_fn": "torch.nn.modules.linear.Identity", "pos_weight": 7 } ``` ### Training Hyperparameters #### Non-Default Hyperparameters - `eval_strategy`: epoch - `per_device_train_batch_size`: 64 - `per_device_eval_batch_size`: 64 - `learning_rate`: 4e-05 - `num_train_epochs`: 4 - `lr_scheduler_type`: cosine - `warmup_ratio`: 0.05 - `fp16`: True - `dataloader_num_workers`: 2 - `load_best_model_at_end`: True - `batch_sampler`: no_duplicates ### Training Logs | Epoch | Step | Training Loss | amh-passage-retrieval-dev_ndcg@10 | |:-------:|:---------:|:-------------:|:---------------------------------:| | 1.0 | 7684 | 0.4048 | 0.8289 | | 2.0 | 15368 | 0.2366 | 0.8546 | | 3.0 | 23052 | 0.1588 | 0.8353 | | **4.0** | **30736** | **0.1024** | **0.8551** | * The bold row denotes the saved checkpoint. ### Framework Versions - Python: 3.11.13 - Sentence Transformers: 4.1.0 - Transformers: 4.52.4 - PyTorch: 2.6.0+cu124 - Accelerate: 1.7.0 - Datasets: 3.6.0 - Tokenizers: 0.21.1
## Citation ```bibtex @inproceedings{alemneh2026amharicir, title = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic}, author = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten}, booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026}, year = {2026}, } ```