EmbeddingGemma-300M-Amharic

Google EmbeddingGemma-300M fine-tuned on Amharic passage retrieval supervision. This model is presented in the paper:

The Multilingual Curse at the Retrieval Layer: Evidence from Amharic
Yosef Worku Alemneh, Kidist Amde Mekonnen, Maarten de Rijke
The 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026

Code: https://github.com/rasyosef/amharic-neural-ir

Results on Amharic Passage Retrieval Dataset V2

Model R@5 R@10 MRR@10 NDCG@10
google/embeddinggemma-300m (zero-shot) 0.558 0.621 0.448 0.489
This model (fine-tuned) 0.813 0.862 0.718 0.753

Fine-tuning yields a +60.3% relative MRR@10 gain over zero-shot.

Evaluation dataset: rasyosef/Amharic-Passage-Retrieval-Dataset-V2

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("kiyam/EmbeddingGemma-300M-Amharic")

queries = ["የኢትዮጵያ ዋና ከተማ የትኛው ናት?"]
passages = ["አዲስ አበባ የኢትዮጵያ ዋና ከተማ ናት።"]

query_embeddings = model.encode(queries, normalize_embeddings=True)
passage_embeddings = model.encode(passages, normalize_embeddings=True)

scores = query_embeddings @ passage_embeddings.T

This model uses Matryoshka embeddings — you can truncate to shorter dimensions (e.g. 256) for faster retrieval at a small quality cost:

query_embeddings = model.encode(queries, normalize_embeddings=True)[:, :256]

Training Details

  • Base model: google/embeddinggemma-300m (300M parameters)
  • Training data: yosefw/amharic-news-retrieval-dataset-v2-with-negatives-V2
  • Objective: MultipleNegativesRankingLoss + MatryoshkaLoss (dims: 768, 256)
  • Epochs: 6 | Batch size: 32 | Grad. accum.: 4
  • Learning rate: 4e-5, cosine schedule | Precision: BF16
  • Max sequence length: 512

Citation

@inproceedings{alemneh2026amharicir,
  title     = {The Multilingual Curse at the Retrieval Layer: Evidence from Amharic},
  author    = {Alemneh, Yosef Worku and Mekonnen, Kidist Amde and de Rijke, Maarten},
  booktitle = {Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM), ACL 2026},
  year      = {2026},
}
Downloads last month
22
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kiyam/EmbeddingGemma-300M-Amharic

Finetuned
(241)
this model

Dataset used to train kiyam/EmbeddingGemma-300M-Amharic

Collection including kiyam/EmbeddingGemma-300M-Amharic

Paper for kiyam/EmbeddingGemma-300M-Amharic