yahyaabd's picture
Add new SentenceTransformer model
c1607e0 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:64260
  - loss:CosineSimilarityLoss
base_model: yahyaabd/allstats-search-mini-v1-1-mnrl
widget:
  - source_sentence: q-2216
    sentences:
      - Statistik Potensi Desa Provinsi Jambi 2008
      - Indeks Harga Sahsm
      - 17cb76daaeda2a9d92a30af3
  - source_sentence: q-4069
    sentences:
      - 61e74412ad7c948492537b61
      - Ihpb Indonesia Tahun 2014
      - Indeks Harga Perdagangan Besar Indonesia 2014, 2010=100
  - source_sentence: q-748
    sentences:
      - 20dac9022b69b62ab3479d37
      - Statistik Potensi Desa Provinsi Sulawesi Utara 2014
      - data potensi dpsa di Provinsi Sulawesi Utara tahun 2014
  - source_sentence: q-7475
    sentences:
      - >-
        Harga Konsumen Beberapa Barang dan Jasa Kelompok Kesehatan,
        Transportasi, dan Pendidikan 90 Kota di Indonesia 2021
      - Volume ekspor CPO Indonesia
      - b2dbf308898a6d1748629240
  - source_sentence: q-786
    sentences:
      - Statistik eCommerce 2022/2023
      - Angka Kematian Bayi oper P#rovinsi
      - f3b02f2b6706e104ea9d5b74
datasets:
  - yahyaabd/bps-pub-cosine-pairs
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
model-index:
  - name: SentenceTransformer based on yahyaabd/allstats-search-mini-v1-1-mnrl
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts dev
          type: sts-dev
        metrics:
          - type: pearson_cosine
            value: 0.9040861364751858
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8334861589775715
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts test
          type: sts-test
        metrics:
          - type: pearson_cosine
            value: 0.9069041337320248
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8380868510850786
            name: Spearman Cosine

SentenceTransformer based on yahyaabd/allstats-search-mini-v1-1-mnrl

This is a sentence-transformers model finetuned from yahyaabd/allstats-search-mini-v1-1-mnrl on the bps-pub-cosine-pairs dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("yahyaabd/allstats-search-mini-v2")
# Run inference
sentences = [
    'q-786',
    'Angka Kematian Bayi oper P#rovinsi',
    'f3b02f2b6706e104ea9d5b74',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric sts-dev sts-test
pearson_cosine 0.9041 0.9069
spearman_cosine 0.8335 0.8381

Training Details

Training Dataset

bps-pub-cosine-pairs

  • Dataset: bps-pub-cosine-pairs at 038a9de
  • Size: 64,260 training samples
  • Columns: query_id, query, corpus_id, title, and score
  • Approximate statistics based on the first 1000 samples:
    query_id query corpus_id title score
    type string string string string float
    details
    • min: 4 tokens
    • mean: 5.18 tokens
    • max: 6 tokens
    • min: 4 tokens
    • mean: 13.33 tokens
    • max: 38 tokens
    • min: 7 tokens
    • mean: 17.38 tokens
    • max: 22 tokens
    • min: 5 tokens
    • mean: 13.13 tokens
    • max: 30 tokens
    • min: 0.1
    • mean: 0.56
    • max: 0.9
  • Samples:
    query_id query corpus_id title score
    q-1599 Nilai Tukar Nelayan 0b0da8fc2b6af9329a6d9cfe Statistik Hotel dan Akomodasi Lainnya di Indonesia 2013 0.1
    q-1599 nilai tukar nelayan 0b0da8fc2b6af9329a6d9cfe Statistik Hotel dan Akomodasi Lainnya di Indonesia 2013 0.1
    q-1599 NILAI TUKAR NELAYAN 0b0da8fc2b6af9329a6d9cfe Statistik Hotel dan Akomodasi Lainnya di Indonesia 2013 0.1
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Evaluation Dataset

bps-pub-cosine-pairs

  • Dataset: bps-pub-cosine-pairs at 038a9de
  • Size: 8,067 evaluation samples
  • Columns: query_id, query, corpus_id, title, and score
  • Approximate statistics based on the first 1000 samples:
    query_id query corpus_id title score
    type string string string string float
    details
    • min: 4 tokens
    • mean: 5.2 tokens
    • max: 6 tokens
    • min: 4 tokens
    • mean: 12.77 tokens
    • max: 33 tokens
    • min: 13 tokens
    • mean: 17.25 tokens
    • max: 23 tokens
    • min: 5 tokens
    • mean: 13.37 tokens
    • max: 38 tokens
    • min: 0.1
    • mean: 0.57
    • max: 0.9
  • Samples:
    query_id query corpus_id title score
    q-1273 Sosek Desember 2021 b7890a143bc751d1d84dcf4a Laporan Bulanan Data Sosial Ekonomi Desember 2021 0.9
    q-1273 sosek desember 2021 b7890a143bc751d1d84dcf4a Laporan Bulanan Data Sosial Ekonomi Desember 2021 0.9
    q-1273 SOSEK DESEMBER 2021 b7890a143bc751d1d84dcf4a Laporan Bulanan Data Sosial Ekonomi Desember 2021 0.9
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • learning_rate: 1e-05
  • num_train_epochs: 2
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True
  • label_smoothing_factor: 0.01
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 1e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.01
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss sts-dev_spearman_cosine sts-test_spearman_cosine
0 0 - 0.3848 0.8288 -
0.0995 100 0.236 0.0950 0.8396 -
0.1990 200 0.0655 0.0487 0.8452 -
0.2985 300 0.0407 0.0342 0.8437 -
0.3980 400 0.0309 0.0291 0.8427 -
0.4975 500 0.0247 0.0253 0.8427 -
0.5970 600 0.0211 0.0235 0.8427 -
0.6965 700 0.0198 0.0224 0.8395 -
0.7960 800 0.0168 0.0212 0.8405 -
0.8955 900 0.0166 0.0206 0.8384 -
0.9950 1000 0.0145 0.0195 0.8388 -
1.0945 1100 0.0119 0.0193 0.8395 -
1.1940 1200 0.0113 0.0190 0.8376 -
1.2935 1300 0.0108 0.0189 0.8330 -
1.3930 1400 0.0119 0.0180 0.8364 -
1.4925 1500 0.0105 0.0184 0.8338 -
1.5920 1600 0.0092 0.0180 0.8355 -
1.6915 1700 0.009 0.0182 0.8319 -
1.7910 1800 0.0096 0.0178 0.8337 -
1.8905 1900 0.0099 0.0178 0.8326 -
1.99 2000 0.0094 0.0178 0.8335 -
-1 -1 - - - 0.8381
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.4.0
  • Transformers: 4.48.1
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}