multi-sentence-BERTino (V5 - Matryoshka Enhanced)
This is a state-of-the-art sentence-transformers model for the Italian language. It maps sentences and paragraphs to a flexible dense vector space (up to 768 dimensions) and is highly optimized for semantic search, retrieval-augmented generation (RAG), and semantic textual similarity.
What's New in V5
V5 improves upon V4 with a focus on better embedding compression: the 128-dimension truncated vectors are significantly stronger, making this version more practical for production deployments with storage or latency constraints.
Key changes from V4:
CoSENTLossreplacesCosineSimilarityLossfor the STS task, yielding better ranking-aware similarity training.- Asymmetric Matryoshka weights revised from
[1.0, 0.3, 0.15, 0.1]to[1.0, 0.4, 0.2, 0.2], placing more training pressure on the 128d subspace. - Cosine LR scheduler with
weight_decay=0.01for more stable optimization. - STS dataset balanced to match retrieval dataset size, preventing disproportionate STS gradient updates.
Model Highlights: Matryoshka Representation Learning
This model was fine-tuned using Matryoshka Representation Learning (MRL). The model has learned to hierarchically compress its semantic knowledge into the earliest dimensions of the vector. You can safely truncate the output embeddings to 512, 256, or 128 dimensions with minimal degradation in retrieval metrics.
Truncating to 128 dimensions allows you to save up to 83% of storage costs in vector databases (like Pinecone, Qdrant, or Milvus) and drastically speed up similarity searches, while still outperforming standard 128d baselines.
The model was trained exclusively on Semantic Hard Negatives (mined via dense bi-encoder self-retrieval) to prevent the "false-negative" traps commonly caused by traditional BM25 lexical mining.
Usage
Direct Usage (Sentence Transformers)
First, install the Sentence Transformers library:
pip install -U sentence-transformers
Standard Usage (Full 768 Dimensions):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nickprock/multi-sentence-BERTino")
sentences = [
'Chi ha dipinto la Gioconda?',
'Leonardo da Vinci è l\'autore della Gioconda, opera conservata al Louvre.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# Output: (2, 768)
Optimized Usage (Truncated to 128 Dimensions):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nickprock/multi-sentence-BERTino", truncate_dim=128)
embeddings = model.encode(sentences)
print(embeddings.shape)
# Output: (2, 128) -> 83% less memory!
Evaluation Metrics
Evaluated on a 5% hold-out split of the Italian retrieval dataset (with semantic hard negatives) and the Italian STS-B dev set, using a standalone evaluation after training.
Information Retrieval
| Metric | 768d (Full) | 128d (Truncated) |
|---|---|---|
| MAP@100 | 0.8398 | 0.8065 |
| NDCG@10 | 0.8680 | 0.8372 |
| Accuracy@1 | 0.7617 | 0.7233 |
| Accuracy@10 | 0.9584 | 0.9384 |
Comparison with V4
| Metric | V4 (768d) | V5 (768d) | V4 (128d) | V5 (128d) |
|---|---|---|---|---|
| MAP@100 | 0.8397 | 0.8398 | 0.8002 | 0.8065 (+0.63%) |
| NDCG@10 | 0.8688 | 0.8680 | 0.8332 | 0.8372 (+0.48%) |
| Accuracy@1 | 0.7593 | 0.7617 | 0.7145 | 0.7233 (+1.23%) |
V5 trades a negligible variation at 768d for a substantial improvement at 128d, making compressed embeddings considerably more reliable.
Semantic Textual Similarity (STS-B Italian Dev)
| Metric | V4 | V5 |
|---|---|---|
| Spearman Cosine | 0.8540 | 0.8549 |
| Pearson Cosine | 0.8574 | 0.8574 |
Training Details
Loss Functions
The model was trained in a multi-task setup utilizing Gradient Caching for massive logical batch sizes, wrapped inside a Matryoshka Loss:
- Information Retrieval Task:
CachedMultipleNegativesRankingLosswithmini_batch_size=16and a logicalbatch_size=128. - Semantic Similarity Task:
CoSENTLoss(upgraded fromCosineSimilarityLossin V4).
Both base losses were wrapped in MatryoshkaLoss targeting dimensions [768, 512, 256, 128] with weights [1.0, 0.4, 0.2, 0.2].
Training Datasets
- task_retrieval: ~45,000 synthetic Italian search queries generated via LLM (Qwen-2.5-7B) from Italian Wikipedia paragraphs. Each query is paired with 1 positive document and 2 Dense Hard Negatives.
- task_sts: The Italian split of
stsb_multi_mt, balanced to match the retrieval dataset size.
Hyperparameters
| Parameter | Value |
|---|---|
per_device_train_batch_size |
128 |
num_train_epochs |
4 |
learning_rate |
1e-05 |
lr_scheduler_type |
cosine |
warmup_steps |
10% |
weight_decay |
0.01 |
fp16 |
True |
batch_sampler |
no_duplicates |
best_checkpoint |
step 1250 (epoch ~1.76) |
Citation
BibTeX
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
CachedMultipleNegativesRankingLoss
@misc{gao2021scaling,
title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
year={2021},
eprint={2101.06983},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
CoSENTLoss
@misc{su2022cosent,
title={CoSENT: A More Efficient Sentence Vector Training Method Than Sentence-BERT},
author={Jianlin Su},
year={2022},
howpublished={\url{https://kexue.fm/archives/8847}}
}
- Downloads last month
- 958
Model tree for nickprock/multi-sentence-BERTino
Unable to build the model tree, the base model loops to the model itself. Learn more.