sentence-embeddings-xllora-mmbert-kor
This model provides sentence embeddings for Korean using the XL-LoRA method introduced in the paper:
Bootstrapping Embeddings for Low Resource Languages
The model is based on mmBERT and fine tuned for sentence representation learning in a low resource setting.
Model Details
- Base model: mmBERT
- Method: XL-LoRA
- Language: Korean
- Task: Sentence embeddings / semantic similarity
Embedding construction
Embeddings are computed using average first last layer pooling: the first and last hidden layers are combined with fixed equal weights, followed by mean pooling over tokens.
Intended Use
This model can be used for:
- semantic similarity
- sentence retrieval
- clustering
- low resource language NLP research
Training
The model was fine tuned using synthetic triplet datasets generated with XL-LoRA adapters.
Full experimental details are provided in the accompanying paper.
Embedding construction
Embeddings in the paper are computed using average first last layer pooling, where the first and last transformer layers are combined with equal weights, followed by mean pooling over tokens.
For convenience, the model can also be used with standard mean pooling but the results might be sligtly different to what it was reported.
Related Resources
- Paper: Bootstrapping Embeddings for Low Resource Languages
- Code: https://github.com/mbasoz/xllora-embedding
- Training dataset: https://huggingface.co/datasets/mbasoz/xllora-datasets
- XL-LoRA adapters: https://huggingface.co/mbasoz/lora-gemma327b-xllora-pos and https://huggingface.co/mbasoz/lora-gemma327b-xllora-neg
Usage
1. Simple usage (mean pooling)
This is the simplest way to obtain sentence embeddings.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mbasoz/sentence-embeddings-xllora-mmbert-kor")
sentences = [
"์ด๊ฒ์ ์์ ๋ฌธ์ฅ์
๋๋ค.",
"์ด ๋ชจ๋ธ์ ๋ฌธ์ฅ ์๋ฒ ๋ฉ์ ์์ฑํฉ๋๋ค."
]
embeddings = model.encode(sentences)
print(embeddings.shape)
2. Reproducing the paper setup (avg first last pooling)
To reproduce the embeddings used in the paper, apply average firstโlast layer pooling followed by mean pooling over tokens.
import torch
from sentence_transformers import SentenceTransformer, models
from sentence_transformers.models import WeightedLayerPooling, Pooling
model_name = "mbasoz/sentence-embeddings-xllora-mmbert-kor"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
word_embedding_model = models.Transformer(model_name)
word_embedding_model.auto_model.config.output_hidden_states = True
num_layers = word_embedding_model.auto_model.config.num_hidden_layers
hidden_size = word_embedding_model.get_word_embedding_dimension()
weights = torch.zeros(num_layers, dtype=torch.float)
weights[0] = 0.5
weights[-1] = 0.5
weighted_layer_pooling = WeightedLayerPooling(
word_embedding_dimension=hidden_size,
num_hidden_layers=num_layers,
layer_start=1,
layer_weights=weights,
)
weighted_layer_pooling.layer_weights.requires_grad = False
pooling_model = Pooling(
word_embedding_dimension=hidden_size,
pooling_mode_mean_tokens=True,
pooling_mode_cls_token=False,
pooling_mode_max_tokens=False,
)
model = SentenceTransformer(
modules=[word_embedding_model, weighted_layer_pooling, pooling_model]
)
model = model.to(device)
sentences = [
"์ด๊ฒ์ ์์ ๋ฌธ์ฅ์
๋๋ค.",
"์ด ๋ชจ๋ธ์ ๋ฌธ์ฅ ์๋ฒ ๋ฉ์ ์์ฑํฉ๋๋ค."
]
embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.shape)
Citation
If you use this model in your research, please cite our paper:
@article{basoz2026bootstrappingembeddings,
title={Bootstrapping Embeddings for Low Resource Languages},
author={Merve Basoz and Andrew Horne and Mattia Opper},
year={2026},
eprint={2603.01732},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.01732},
note={Accepted to the LoResLM Workshop at EACL 2026}
}
License
This model is released under the MIT License.
- Downloads last month
- 17