SentenceTransformer based on FacebookAI/xlm-roberta-base

This is a sentence-transformers model finetuned from FacebookAI/xlm-roberta-base on the en-sa dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: FacebookAI/xlm-roberta-base
  • Maximum Sequence Length: 128 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • en-sa

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("saikasyap/samskritam-xlm-roberata-base")
# Run inference
sentences = [
    'अनन्तरं Microbiology इति टङ्कनं करोमि ।',
    'Then I will type Microbiology.',
    'The comic image basically consists of three parts.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

en-sa

  • Dataset: en-sa
  • Size: 140,359 training samples
  • Columns: non_english, english, and label
  • Approximate statistics based on the first 1000 samples:
    non_english english label
    type string string list
    details
    • min: 26 tokens
    • mean: 42.11 tokens
    • max: 128 tokens
    • min: 19 tokens
    • mean: 50.63 tokens
    • max: 128 tokens
    • size: 768 elements
  • Samples:
    non_english english label
    ॐ तपः स्वाध्यायनिरतं तपस्वी वाग्विदां वरम्। नारदं परिपप्रच्छ वाल्मीकिर्मुनिपुङ्गवम्॥ The ascetic Vālmīki asked Nārada, the best of sages and foremost of those conversant with words, ever engaged in austerities and Vedic studies. [0.15034635365009308, 0.35359007120132446, -0.3348075747489929, 0.15415771305561066, 0.020571526139974594, ...]
    कोन्वस्मिन् साम्प्रतं लोके गुणवान् कश्च वीर्यवान्। धर्मज्ञश्च कृतज्ञश्च सत्यवाक्यो दृढत्नतः॥ Who at present in this world is like crowned with qualities, and with prowess, knowing duty, and grateful, and truthful, and firm in vow. [-0.46556514501571655, 0.4740210175514221, -0.2033461034297943, -1.6129034757614136, -0.016881834715604782, ...]
    चारित्रेण च को युक्तः सर्वभूतेषु को हितः। विद्वान् कः कः समर्थश्च कश्चैकप्रियदर्शनः॥ Who is qualified by virtue of his character, and who is engaged in the welfare of all creatures? Who is learned and capable. Who alone is ever lovely to behold? [-0.09693514555692673, 0.4206468462944031, -0.3034357726573944, -1.2955875396728516, 0.3836270868778229, ...]
  • Loss: MSELoss

Evaluation Dataset

en-sa

  • Dataset: en-sa
  • Size: 1,000 evaluation samples
  • Columns: non_english, english, and label
  • Approximate statistics based on the first 1000 samples:
    non_english english label
    type string string list
    details
    • min: 4 tokens
    • mean: 27.87 tokens
    • max: 91 tokens
    • min: 4 tokens
    • mean: 21.35 tokens
    • max: 68 tokens
    • size: 768 elements
  • Samples:
    non_english english label
    तथा दिमागी रूप से तंदुरुस्त हों । And also be mentally fit. [0.2053176611661911, -0.15136581659317017, -0.1492331326007843, -0.13915303349494934, -0.08056919276714325, ...]
    अपरञ्च युष्माकम् आनन्दो यत् सम्पूर्णो भवेद् तदर्थं वयम् एतानि लिखामः। """And these things write we unto you, that your joy may be full.""" [0.0013895286247134209, 0.09506042301654816, -0.3513864576816559, -0.6496815085411072, 0.7649527192115784, ...]
    पञ्च व्यञ्जनानां तेषां च एकम्-एकम स्वास्थ्यसम्बन्धीकार्याणां च सूचीं निर्मातु। List five spices and one health benefits of each. [0.37307825684547424, 0.8675527572631836, 0.6388981342315674, -0.27114301919937134, -0.30143851041793823, ...]
  • Loss: MSELoss

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • learning_rate: 2e-05
  • num_train_epochs: 5
  • warmup_ratio: 0.1
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss
0.1821 100 0.5076
0.3643 200 0.233
0.5464 300 0.2172
0.7286 400 0.2033
0.9107 500 0.1943
1.0929 600 0.1878
1.2750 700 0.1813
1.4572 800 0.1754
1.6393 900 0.1721
1.8215 1000 0.1688
2.0036 1100 0.1664
2.1858 1200 0.1632
2.3679 1300 0.1606
2.5501 1400 0.1588
2.7322 1500 0.1566
2.9144 1600 0.1558
3.0965 1700 0.154
3.2787 1800 0.1525
3.4608 1900 0.1508
3.6430 2000 0.15
3.8251 2100 0.1493
4.0073 2200 0.149
4.1894 2300 0.1479
4.3716 2400 0.1471
4.5537 2500 0.1466
4.7359 2600 0.1461
4.9180 2700 0.1466

Framework Versions

  • Python: 3.10.17
  • Sentence Transformers: 4.1.0
  • Transformers: 4.46.3
  • PyTorch: 2.2.0+cu121
  • Accelerate: 1.1.1
  • Datasets: 2.18.0
  • Tokenizers: 0.20.3

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MSELoss

@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}
Downloads last month
2
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for saikasyap/samskritam-xlm-roberata-base

Finetuned
(3888)
this model

Papers for saikasyap/samskritam-xlm-roberata-base