Robust Training for General Text Embeddings via Bagging-Based Model Merging

๐Ÿ“– arXiv Paper | ๐Ÿค— Model Trained on English data | ๐Ÿค— Model Trained on General data | ๐Ÿ› ๏ธ Github |

General-purpose text embedding models underpin a wide range of NLP and information retrieval applications, and are typically trained on large-scale multi-task corpora to encourage broad generalization. However, it remains unclear how different multi-task training strategies compare in practice, and how to efficiently adapt embedding models as new domains and data types continually emerge. In this work, we present a systematic study of multi-task training for text embeddings from two perspectives: data scheduling and model merging. We compare batch-level shuffling, sequential training variants, two-stage training, and multiple merging granularities, and find that simple batch-level shuffling consistently yields the strongest overall performance, suggesting that task conflicts are limited and training datasets are largely complementary. Despite its effectiveness, batch-level shuffling exhibits two practical limitations: suboptimal out-of-domain (OOD) generalization and poor suitability for incremental learning due to expensive full retraining. To address these issues, we propose Bagging-based rObust mOdel Merging (BOOM), which trains multiple embedding models on sampled subsets and merges them into a single model, improving robustness while retaining single-model inference efficiency. Moreover, BOOM naturally supports efficient incremental updates by training lightweight update models on new data with a small historical subset and merging them into the existing model. Experiments across diverse embedding benchmarks demonstrate that BOOM consistently improves both in-domain and OOD performance over full-corpus batch-level shuffling, while substantially reducing training cost in incremental learning settings.

Training data:

Eng-Text-Data:

  • Retrieval: ELI5, HotpotQA, FEVER, MSMARCO, passage and document ranking, NQ, NLI, SQuAD, TriviaQA, and FiQA.
  • Reranking: StackOverFlowDupQuestions.
  • Classification: AmazonReviews-Classification, Banking77Classification, Emotion-Classification, MTOPIntentClassification, IMDB-Classification, ToxicConversationsClassification, TweetSentimentExtraction-Classification, AmazonCounterfactual-Classification.
  • Clustering: Arxiv/Biorxiv/Medrxiv/Reddit/StackExchangeClustering-S2S/P2P, TwentyNewsgroups-Clustering.
  • SemanticTextSimilarity(STS):STS12,STS22,STSBenchmark.
    About 2M data.

General-Text-Data:

  • Eng-Text-Data
  • DuReader,MIRACL,Mr.TyDi,andT2-Ranking
  • Cornstack: JavaScript, Java, Python,PHP,and Ruby, sampled 500K
    About 2.8M data.

Models Merged

The following models were included in the merge:

  • Data_mixing_sampled80_full
  • Data_mixingsampled100_full
  • Data_mixing_sampled60_full
  • Data_mixing_sampled20_full
  • Data_mixing_sampled40_full
models:
  - model: Data_mixing_sampled60_full
    parameters:
      weight: 0.6
  - model: Data_mixing_sampled20_full
    parameters:
      weight: 0.2
  - model: Data_mixing_sampled40_full
    parameters:
      weight: 0.4
  - model: Data_mixing_sampled80_full
    parameters:
      weight: 0.8
  - model: Data_mixingsampled100_full
    parameters:
      weight: 1.0
merge_method: multislerp
dtype: float32

This model was merged using the Multi-SLERP merge method.

Citation

If you find our work helpful, feel free to give us a cite.

@article{zhang2026bagging,
  title={Bagging-Based Model Merging for Robust General Text Embeddings},
  author={Zhang, Hengran and Bi, Keping and Guo, Jiafeng and Zhang, Jiaming and Yang, Wenbo and Shi, Daiting and Cheng, Xueqi},
  journal={arXiv preprint arXiv:2602.05787},
  year={2026}
}
Downloads last month
474
Safetensors
Model size
4B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for hengranZhang/BOOM_4B_eng_data_v1

Quantizations
1 model

Paper for hengranZhang/BOOM_4B_eng_data_v1