sentence-embeddings-xllora-mmbert-kor

This model provides sentence embeddings for Korean using the XL-LoRA method introduced in the paper:

Bootstrapping Embeddings for Low Resource Languages

The model is based on mmBERT and fine tuned for sentence representation learning in a low resource setting.

Model Details

Base model: mmBERT
Method: XL-LoRA
Language: Korean
Task: Sentence embeddings / semantic similarity

Embedding construction

Embeddings are computed using average first last layer pooling: the first and last hidden layers are combined with fixed equal weights, followed by mean pooling over tokens.

Intended Use

This model can be used for:

semantic similarity
sentence retrieval
clustering
low resource language NLP research

Training

The model was fine tuned using synthetic triplet datasets generated with XL-LoRA adapters.

Full experimental details are provided in the accompanying paper.

Embedding construction

Embeddings in the paper are computed using average first last layer pooling, where the first and last transformer layers are combined with equal weights, followed by mean pooling over tokens.

For convenience, the model can also be used with standard mean pooling but the results might be sligtly different to what it was reported.

Related Resources

Paper: Bootstrapping Embeddings for Low Resource Languages
Code: https://github.com/mbasoz/xllora-embedding
Training dataset: https://huggingface.co/datasets/mbasoz/xllora-datasets
XL-LoRA adapters: https://huggingface.co/mbasoz/lora-gemma327b-xllora-pos and https://huggingface.co/mbasoz/lora-gemma327b-xllora-neg

Usage

1. Simple usage (mean pooling)

This is the simplest way to obtain sentence embeddings.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("mbasoz/sentence-embeddings-xllora-mmbert-kor")

sentences = [
    "이것은 예시 문장입니다.",
    "이 모델은 문장 임베딩을 생성합니다."
]

embeddings = model.encode(sentences)
print(embeddings.shape)

2. Reproducing the paper setup (avg first last pooling)

To reproduce the embeddings used in the paper, apply average first–last layer pooling followed by mean pooling over tokens.

import torch
from sentence_transformers import SentenceTransformer, models
from sentence_transformers.models import WeightedLayerPooling, Pooling

model_name = "mbasoz/sentence-embeddings-xllora-mmbert-kor"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
word_embedding_model = models.Transformer(model_name)
word_embedding_model.auto_model.config.output_hidden_states = True

num_layers = word_embedding_model.auto_model.config.num_hidden_layers
hidden_size = word_embedding_model.get_word_embedding_dimension()

weights = torch.zeros(num_layers, dtype=torch.float)
weights[0] = 0.5
weights[-1] = 0.5

weighted_layer_pooling = WeightedLayerPooling(
    word_embedding_dimension=hidden_size,
    num_hidden_layers=num_layers,
    layer_start=1,
    layer_weights=weights,
)

weighted_layer_pooling.layer_weights.requires_grad = False

pooling_model = Pooling(
    word_embedding_dimension=hidden_size,
    pooling_mode_mean_tokens=True,
    pooling_mode_cls_token=False,
    pooling_mode_max_tokens=False,
)

model = SentenceTransformer(
    modules=[word_embedding_model, weighted_layer_pooling, pooling_model]
)

model = model.to(device)


sentences = [
    "이것은 예시 문장입니다.",
    "이 모델은 문장 임베딩을 생성합니다."
]

embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.shape)

Citation

If you use this model in your research, please cite our paper:

@article{basoz2026bootstrappingembeddings,
  title={Bootstrapping Embeddings for Low Resource Languages},
  author={Merve Basoz and Andrew Horne and Mattia Opper},
  year={2026},
  eprint={2603.01732},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2603.01732},
  note={Accepted to the LoResLM Workshop at EACL 2026}
}

License

This model is released under the MIT License.

Downloads last month: 17

Paper for mbasoz/sentence-embeddings-xllora-mmbert-kor

Bootstrapping Embeddings for Low Resource Languages

Paper • 2603.01732 • Published Mar 2