multi-sentence-BERTino (V5 - Matryoshka Enhanced)

This is a state-of-the-art sentence-transformers model for the Italian language. It maps sentences and paragraphs to a flexible dense vector space (up to 768 dimensions) and is highly optimized for semantic search, retrieval-augmented generation (RAG), and semantic textual similarity.

What's New in V5

V5 improves upon V4 with a focus on better embedding compression: the 128-dimension truncated vectors are significantly stronger, making this version more practical for production deployments with storage or latency constraints.

Key changes from V4:

CoSENTLoss replaces CosineSimilarityLoss for the STS task, yielding better ranking-aware similarity training.
Asymmetric Matryoshka weights revised from [1.0, 0.3, 0.15, 0.1] to [1.0, 0.4, 0.2, 0.2], placing more training pressure on the 128d subspace.
Cosine LR scheduler with weight_decay=0.01 for more stable optimization.
STS dataset balanced to match retrieval dataset size, preventing disproportionate STS gradient updates.

Model Highlights: Matryoshka Representation Learning

This model was fine-tuned using Matryoshka Representation Learning (MRL). The model has learned to hierarchically compress its semantic knowledge into the earliest dimensions of the vector. You can safely truncate the output embeddings to 512, 256, or 128 dimensions with minimal degradation in retrieval metrics.

Truncating to 128 dimensions allows you to save up to 83% of storage costs in vector databases (like Pinecone, Qdrant, or Milvus) and drastically speed up similarity searches, while still outperforming standard 128d baselines.

The model was trained exclusively on Semantic Hard Negatives (mined via dense bi-encoder self-retrieval) to prevent the "false-negative" traps commonly caused by traditional BM25 lexical mining.

Usage

Direct Usage (Sentence Transformers)

First, install the Sentence Transformers library:

pip install -U sentence-transformers

Standard Usage (Full 768 Dimensions):

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nickprock/multi-sentence-BERTino")

sentences = [
    'Chi ha dipinto la Gioconda?',
    'Leonardo da Vinci è l\'autore della Gioconda, opera conservata al Louvre.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# Output: (2, 768)

Optimized Usage (Truncated to 128 Dimensions):

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nickprock/multi-sentence-BERTino", truncate_dim=128)

embeddings = model.encode(sentences)
print(embeddings.shape)
# Output: (2, 128) -> 83% less memory!

Evaluation Metrics

Evaluated on a 5% hold-out split of the Italian retrieval dataset (with semantic hard negatives) and the Italian STS-B dev set, using a standalone evaluation after training.

Information Retrieval

Metric	768d (Full)	128d (Truncated)
MAP@100	0.8398	0.8065
NDCG@10	0.8680	0.8372
Accuracy@1	0.7617	0.7233
Accuracy@10	0.9584	0.9384

Comparison with V4

Metric	V4 (768d)	V5 (768d)	V4 (128d)	V5 (128d)
MAP@100	0.8397	0.8398	0.8002	0.8065 (+0.63%)
NDCG@10	0.8688	0.8680	0.8332	0.8372 (+0.48%)
Accuracy@1	0.7593	0.7617	0.7145	0.7233 (+1.23%)

V5 trades a negligible variation at 768d for a substantial improvement at 128d, making compressed embeddings considerably more reliable.

Semantic Textual Similarity (STS-B Italian Dev)

Metric	V4	V5
Spearman Cosine	0.8540	0.8549
Pearson Cosine	0.8574	0.8574

Training Details

Loss Functions

The model was trained in a multi-task setup utilizing Gradient Caching for massive logical batch sizes, wrapped inside a Matryoshka Loss:

Information Retrieval Task: CachedMultipleNegativesRankingLoss with mini_batch_size=16 and a logical batch_size=128.
Semantic Similarity Task: CoSENTLoss (upgraded from CosineSimilarityLoss in V4).

Both base losses were wrapped in MatryoshkaLoss targeting dimensions [768, 512, 256, 128] with weights [1.0, 0.4, 0.2, 0.2].

Training Datasets

task_retrieval: ~45,000 synthetic Italian search queries generated via LLM (Qwen-2.5-7B) from Italian Wikipedia paragraphs. Each query is paired with 1 positive document and 2 Dense Hard Negatives.
task_sts: The Italian split of stsb_multi_mt, balanced to match the retrieval dataset size.

Hyperparameters

Parameter	Value
`per_device_train_batch_size`	128
`num_train_epochs`	4
`learning_rate`	1e-05
`lr_scheduler_type`	cosine
`warmup_steps`	10%
`weight_decay`	0.01
`fp16`	True
`batch_sampler`	no_duplicates
`best_checkpoint`	step 1250 (epoch ~1.76)

Citation

BibTeX

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CoSENTLoss

@misc{su2022cosent,
    title={CoSENT: A More Efficient Sentence Vector Training Method Than Sentence-BERT},
    author={Jianlin Su},
    year={2022},
    howpublished={\url{https://kexue.fm/archives/8847}}
}