sentence-embeddings-xllora-mmbert-hau

This model provides sentence embeddings for Hausa using the XL-LoRA method introduced in the paper:

Bootstrapping Embeddings for Low Resource Languages

The model is based on mmBERT and fine tuned for sentence representation learning in a low resource setting.

Model Details

  • Base model: mmBERT
  • Method: XL-LoRA
  • Language: Hausa
  • Task: Sentence embeddings / semantic similarity

Embedding construction

Embeddings are computed using average first last layer pooling: the first and last hidden layers are combined with fixed equal weights, followed by mean pooling over tokens.

Intended Use

This model can be used for:

  • semantic similarity
  • sentence retrieval
  • clustering
  • low resource language NLP research

Training

The model was fine tuned using synthetic triplet datasets generated with XL-LoRA adapters.

Full experimental details are provided in the accompanying paper.

Embedding construction

Embeddings in the paper are computed using average first last layer pooling, where the first and last transformer layers are combined with equal weights, followed by mean pooling over tokens.

For convenience, the model can also be used with standard mean pooling but the results might be sligtly different to what it was reported.

Related Resources

Usage

1. Simple usage (mean pooling)

This is the simplest way to obtain sentence embeddings.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("mbasoz/sentence-embeddings-xllora-mmbert-hau")

sentences = [
    "Wannan jimla ce ta misali.",
    "Wannan samfurin yana ƙirƙirar embeddings na jimloli."
]

embeddings = model.encode(sentences)
print(embeddings.shape)

2. Reproducing the paper setup (avg first last pooling)

To reproduce the embeddings used in the paper, apply average first–last layer pooling followed by mean pooling over tokens.

import torch
from sentence_transformers import SentenceTransformer, models
from sentence_transformers.models import WeightedLayerPooling, Pooling

model_name = "mbasoz/sentence-embeddings-xllora-mmbert-hau"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
word_embedding_model = models.Transformer(model_name)
word_embedding_model.auto_model.config.output_hidden_states = True

num_layers = word_embedding_model.auto_model.config.num_hidden_layers
hidden_size = word_embedding_model.get_word_embedding_dimension()

weights = torch.zeros(num_layers, dtype=torch.float)
weights[0] = 0.5
weights[-1] = 0.5

weighted_layer_pooling = WeightedLayerPooling(
    word_embedding_dimension=hidden_size,
    num_hidden_layers=num_layers,
    layer_start=1,
    layer_weights=weights,
)

weighted_layer_pooling.layer_weights.requires_grad = False

pooling_model = Pooling(
    word_embedding_dimension=hidden_size,
    pooling_mode_mean_tokens=True,
    pooling_mode_cls_token=False,
    pooling_mode_max_tokens=False,
)

model = SentenceTransformer(
    modules=[word_embedding_model, weighted_layer_pooling, pooling_model]
)

model = model.to(device)


sentences = [
    "Wannan jimla ce ta misali.",
    "Wannan samfurin yana ƙirƙirar embeddings na jimloli."
]

embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.shape)

Citation

If you use this model in your research, please cite our paper:

@article{basoz2026bootstrappingembeddings,
  title={Bootstrapping Embeddings for Low Resource Languages},
  author={Merve Basoz and Andrew Horne and Mattia Opper},
  year={2026},
  eprint={2603.01732},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2603.01732},
  note={Accepted to the LoResLM Workshop at EACL 2026}
}

License

This model is released under the MIT License.

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for mbasoz/sentence-embeddings-xllora-mmbert-hau