Sentence Similarity
sentence-transformers
Safetensors
xlm-roberta
feature-extraction
Generated from Trainer
dataset_size:198
loss:MatryoshkaLoss
loss:MultipleNegativesRankingLoss
Eval Results (legacy)
text-embeddings-inference
Instructions to use luka023/proba with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use luka023/proba with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("luka023/proba") sentences = [ "Najčešći tipovi uključuju iznad/ispod 2.5, ukupno golova, i klađenje na broj golova u poluvremenima.", "Koji su najčešći tipovi klađenja na golove?", "Koje kladionice u Srbiji nude DNB opciju?", "Šta je hendikep klađenje?" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
metadata
base_model: intfloat/multilingual-e5-large
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:198
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
widget:
- source_sentence: >-
Najčešći tipovi uključuju iznad/ispod 2.5, ukupno golova, i klađenje na
broj golova u poluvremenima.
sentences:
- Koji su najčešći tipovi klađenja na golove?
- Koje kladionice u Srbiji nude DNB opciju?
- Šta je hendikep klađenje?
- source_sentence: >-
Facebook grupe posvećene klađenju omogućavaju korisnicima da dobijaju
savete i predloge od velikih zajednica korisnika i kladioničara.
sentences:
- Šta je limit u klađenju?
- Kako se koristi Facebook za klađenje?
- Šta je cash-out opcija u uživo klađenju?
- source_sentence: >-
Najčešći tipovi uključuju klađenje na konačan ishod, broj gemova, broj
setova, i klađenje uživo.
sentences:
- Koje su prednosti praćenja utakmica uživo?
- Koji su najčešći tipovi klađenja na tenis?
- Šta je e-novčanik?
- source_sentence: >-
Premijum provizija je dodatna naknada koju berze kvota mogu naplatiti
igračima za specifične usluge ili dobitke.
sentences:
- Šta je premijum provizija?
- Koje su strategije za uspešno uživo klađenje?
- Kako funkcioniše klađenje na ukupan broj poena timova?
- source_sentence: >-
'Super Jenki' sistem uključuje pet događaja i 26 pojedinačnih opklada,
takođe poznat kao kanadski sistem.
sentences:
- Šta je 'Super Jenki' sistem klađenja?
- Šta je procena verovatnoće?
- Kako klađenje uživo funkcioniše u tenisu?
model-index:
- name: SentenceTransformer based on intfloat/multilingual-e5-large
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: dim 768
type: dim_768
metrics:
- type: cosine_accuracy@1
value: 0.8260869565217391
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.9565217391304348
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 1
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.8260869565217391
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.31884057971014484
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.20000000000000007
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.10000000000000003
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.8260869565217391
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.9565217391304348
name: Cosine Recall@3
- type: cosine_recall@5
value: 1
name: Cosine Recall@5
- type: cosine_recall@10
value: 1
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.9271072095125116
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.9021739130434783
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.9021739130434783
name: Cosine Map@100
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: dim 512
type: dim_512
metrics:
- type: cosine_accuracy@1
value: 0.8695652173913043
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 1
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 1
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.8695652173913043
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.3333333333333332
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.20000000000000007
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.10000000000000003
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.8695652173913043
name: Cosine Recall@1
- type: cosine_recall@3
value: 1
name: Cosine Recall@3
- type: cosine_recall@5
value: 1
name: Cosine Recall@5
- type: cosine_recall@10
value: 1
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.9461678046583877
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.9275362318840579
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.9275362318840579
name: Cosine Map@100
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: dim 256
type: dim_256
metrics:
- type: cosine_accuracy@1
value: 0.8260869565217391
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 1
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 1
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.8260869565217391
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.3333333333333332
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.20000000000000007
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.10000000000000003
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.8260869565217391
name: Cosine Recall@1
- type: cosine_recall@3
value: 1
name: Cosine Recall@3
- type: cosine_recall@5
value: 1
name: Cosine Recall@5
- type: cosine_recall@10
value: 1
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.9301212722049728
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.9057971014492753
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.9057971014492753
name: Cosine Map@100
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: dim 128
type: dim_128
metrics:
- type: cosine_accuracy@1
value: 0.782608695652174
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.9565217391304348
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 1
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.782608695652174
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.31884057971014484
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.20000000000000007
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.10000000000000003
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.782608695652174
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.9565217391304348
name: Cosine Recall@3
- type: cosine_recall@5
value: 1
name: Cosine Recall@5
- type: cosine_recall@10
value: 1
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.9091552965878422
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.8782608695652173
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.8782608695652173
name: Cosine Map@100
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: dim 64
type: dim_64
metrics:
- type: cosine_accuracy@1
value: 0.8260869565217391
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.9565217391304348
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 0.9565217391304348
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.8260869565217391
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.31884057971014484
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.19130434782608702
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.10000000000000003
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.8260869565217391
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.9565217391304348
name: Cosine Recall@3
- type: cosine_recall@5
value: 0.9565217391304348
name: Cosine Recall@5
- type: cosine_recall@10
value: 1
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.9164054079968976
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.8894927536231884
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.8894927536231884
name: Cosine Map@100
SentenceTransformer based on intfloat/multilingual-e5-large
This is a sentence-transformers model finetuned from intfloat/multilingual-e5-large on the json dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: intfloat/multilingual-e5-large
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 tokens
- Similarity Function: Cosine Similarity
- Training Dataset:
- json
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("luka023/proba")
# Run inference
sentences = [
"'Super Jenki' sistem uključuje pet događaja i 26 pojedinačnih opklada, takođe poznat kao kanadski sistem.",
"Šta je 'Super Jenki' sistem klađenja?",
'Kako klađenje uživo funkcioniše u tenisu?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Dataset:
dim_768 - Evaluated with
InformationRetrievalEvaluator
| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.8261 |
| cosine_accuracy@3 | 0.9565 |
| cosine_accuracy@5 | 1.0 |
| cosine_accuracy@10 | 1.0 |
| cosine_precision@1 | 0.8261 |
| cosine_precision@3 | 0.3188 |
| cosine_precision@5 | 0.2 |
| cosine_precision@10 | 0.1 |
| cosine_recall@1 | 0.8261 |
| cosine_recall@3 | 0.9565 |
| cosine_recall@5 | 1.0 |
| cosine_recall@10 | 1.0 |
| cosine_ndcg@10 | 0.9271 |
| cosine_mrr@10 | 0.9022 |
| cosine_map@100 | 0.9022 |
Information Retrieval
- Dataset:
dim_512 - Evaluated with
InformationRetrievalEvaluator
| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.8696 |
| cosine_accuracy@3 | 1.0 |
| cosine_accuracy@5 | 1.0 |
| cosine_accuracy@10 | 1.0 |
| cosine_precision@1 | 0.8696 |
| cosine_precision@3 | 0.3333 |
| cosine_precision@5 | 0.2 |
| cosine_precision@10 | 0.1 |
| cosine_recall@1 | 0.8696 |
| cosine_recall@3 | 1.0 |
| cosine_recall@5 | 1.0 |
| cosine_recall@10 | 1.0 |
| cosine_ndcg@10 | 0.9462 |
| cosine_mrr@10 | 0.9275 |
| cosine_map@100 | 0.9275 |
Information Retrieval
- Dataset:
dim_256 - Evaluated with
InformationRetrievalEvaluator
| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.8261 |
| cosine_accuracy@3 | 1.0 |
| cosine_accuracy@5 | 1.0 |
| cosine_accuracy@10 | 1.0 |
| cosine_precision@1 | 0.8261 |
| cosine_precision@3 | 0.3333 |
| cosine_precision@5 | 0.2 |
| cosine_precision@10 | 0.1 |
| cosine_recall@1 | 0.8261 |
| cosine_recall@3 | 1.0 |
| cosine_recall@5 | 1.0 |
| cosine_recall@10 | 1.0 |
| cosine_ndcg@10 | 0.9301 |
| cosine_mrr@10 | 0.9058 |
| cosine_map@100 | 0.9058 |
Information Retrieval
- Dataset:
dim_128 - Evaluated with
InformationRetrievalEvaluator
| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.7826 |
| cosine_accuracy@3 | 0.9565 |
| cosine_accuracy@5 | 1.0 |
| cosine_accuracy@10 | 1.0 |
| cosine_precision@1 | 0.7826 |
| cosine_precision@3 | 0.3188 |
| cosine_precision@5 | 0.2 |
| cosine_precision@10 | 0.1 |
| cosine_recall@1 | 0.7826 |
| cosine_recall@3 | 0.9565 |
| cosine_recall@5 | 1.0 |
| cosine_recall@10 | 1.0 |
| cosine_ndcg@10 | 0.9092 |
| cosine_mrr@10 | 0.8783 |
| cosine_map@100 | 0.8783 |
Information Retrieval
- Dataset:
dim_64 - Evaluated with
InformationRetrievalEvaluator
| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.8261 |
| cosine_accuracy@3 | 0.9565 |
| cosine_accuracy@5 | 0.9565 |
| cosine_accuracy@10 | 1.0 |
| cosine_precision@1 | 0.8261 |
| cosine_precision@3 | 0.3188 |
| cosine_precision@5 | 0.1913 |
| cosine_precision@10 | 0.1 |
| cosine_recall@1 | 0.8261 |
| cosine_recall@3 | 0.9565 |
| cosine_recall@5 | 0.9565 |
| cosine_recall@10 | 1.0 |
| cosine_ndcg@10 | 0.9164 |
| cosine_mrr@10 | 0.8895 |
| cosine_map@100 | 0.8895 |
Training Details
Training Dataset
json
- Dataset: json
- Size: 198 training samples
- Columns:
positiveandanchor - Approximate statistics based on the first 198 samples:
positive anchor type string string details - min: 19 tokens
- mean: 33.76 tokens
- max: 53 tokens
- min: 6 tokens
- mean: 12.87 tokens
- max: 21 tokens
- Samples:
positive anchor Klađenje na ukupan broj poena timova podrazumeva predviđanje da li će jedan tim postići više ili manje poena od postavljene granice, nezavisno od konačnog ishoda.Kako funkcioniše klađenje na ukupan broj poena timova?Konačan ishod podrazumeva klađenje na to ko će pobediti u utakmici, pri čemu postoje tri mogućnosti: pobeda domaćina, pobeda gosta ili nerešeno.Šta znači klađenje na konačan ishod?Patent opklada uključuje tri događaja sa ukupno sedam pojedinačnih opklada: tri singl, tri dubl i jedna trostruka opklada.Šta je patent opklada? - Loss:
MatryoshkaLosswith these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy: epochper_device_train_batch_size: 32per_device_eval_batch_size: 16gradient_accumulation_steps: 16learning_rate: 2e-05num_train_epochs: 4lr_scheduler_type: cosinewarmup_ratio: 0.1bf16: Truetf32: Falseload_best_model_at_end: Trueoptim: adamw_torch_fusedbatch_sampler: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir: Falsedo_predict: Falseeval_strategy: epochprediction_loss_only: Trueper_device_train_batch_size: 32per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 16eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 2e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 4max_steps: -1lr_scheduler_type: cosinelr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Truefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Falselocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Trueignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters:auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseeval_use_gather_object: Falsebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportional
Training Logs
| Epoch | Step | dim_128_cosine_map@100 | dim_256_cosine_map@100 | dim_512_cosine_map@100 | dim_64_cosine_map@100 | dim_768_cosine_map@100 |
|---|---|---|---|---|---|---|
| 1.0 | 1 | 0.6717 | 0.7663 | 0.8229 | 0.5755 | 0.8242 |
| 2.0 | 2 | 0.7779 | 0.8457 | 0.8638 | 0.7833 | 0.8635 |
| 3.0 | 4 | 0.8410 | 0.8732 | 0.8674 | 0.8167 | 0.8659 |
| 1.0 | 1 | 0.8410 | 0.8732 | 0.8674 | 0.8167 | 0.8659 |
| 2.0 | 2 | 0.8845 | 0.8732 | 0.9022 | 0.858 | 0.9022 |
| 3.0 | 4 | 0.8783 | 0.9058 | 0.9275 | 0.8895 | 0.9022 |
- The bold row denotes the saved checkpoint.
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.1.0
- Transformers: 4.44.2
- PyTorch: 2.4.0+cu121
- Accelerate: 0.33.0
- Datasets: 3.0.0
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}