yosefw/amharic-news-retrieval-dataset-v2-with-negatives-V2
Viewer • Updated • 68.3k • 47
How to use kiyam/EmbeddingGemma-300M-Amharic with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("kiyam/EmbeddingGemma-300M-Amharic")
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]Google EmbeddingGemma-300M fine-tuned on Amharic passage retrieval supervision. This model is presented in the paper:
The Multilingual Curse at the Retrieval Layer: Evidence from Amharic
Yosef Worku Alemneh, Kidist Amde Mekonnen, Maarten de Rijke
The 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026
Code: https://github.com/rasyosef/amharic-neural-ir
| Model | R@5 | R@10 | MRR@10 | NDCG@10 |
|---|---|---|---|---|
google/embeddinggemma-300m (zero-shot) |
0.558 | 0.621 | 0.448 | 0.489 |
| This model (fine-tuned) | 0.813 | 0.862 | 0.718 | 0.753 |
Fine-tuning yields a +60.3% relative MRR@10 gain over zero-shot.
Evaluation dataset: rasyosef/Amharic-Passage-Retrieval-Dataset-V2
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("kiyam/EmbeddingGemma-300M-Amharic")
queries = ["የኢትዮጵያ ዋና ከተማ የትኛው ናት?"]
passages = ["አዲስ አበባ የኢትዮጵያ ዋና ከተማ ናት።"]
query_embeddings = model.encode(queries, normalize_embeddings=True)
passage_embeddings = model.encode(passages, normalize_embeddings=True)
scores = query_embeddings @ passage_embeddings.T
This model uses Matryoshka embeddings — you can truncate to shorter dimensions (e.g. 256) for faster retrieval at a small quality cost:
query_embeddings = model.encode(queries, normalize_embeddings=True)[:, :256]
google/embeddinggemma-300m (300M parameters)yosefw/amharic-news-retrieval-dataset-v2-with-negatives-V2@inproceedings{alemneh2026amharicir,
title = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
author = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
year = {2026},
}
Base model
google/embeddinggemma-300m