CLAP model trained on COCO Captions

This is a sentence-transformers model finetuned from laion/clap-htsat-unfused on the librispeech_asr dataset. It maps sentences & paragraphs to a 512-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: laion/clap-htsat-unfused
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 512 dimensions
  • Similarity Function: Cosine Similarity
  • Supported Modalities: Text, Audio
  • Training Dataset:
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'get_text_features', 'method_output_name': 'pooler_output'}, 'audio': {'method': 'get_audio_features', 'method_output_name': 'pooler_output'}}, 'module_output_name': 'sentence_embedding', 'message_format': 'auto', 'architecture': 'ClapModel'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs")
# Run inference
inputs = [
    'https://huggingface.co/tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs/resolve/main/assets/audio_0.wav',
    'https://huggingface.co/tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs/resolve/main/assets/audio_1.wav',
    'https://huggingface.co/tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs/resolve/main/assets/audio_2.wav',
]
embeddings = model.encode(inputs)
print(embeddings.shape)
# [3, 512]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.4362, 0.6843],
#         [0.4362, 1.0000, 0.2179],
#         [0.6843, 0.2179, 1.0000]])

Evaluation

Metrics

Information Retrieval

Metric librispeech-eval librispeech-test
cosine_accuracy@1 0.245 0.0489
cosine_accuracy@3 0.52 0.1183
cosine_accuracy@5 0.645 0.1691
cosine_accuracy@10 0.785 0.2641
cosine_precision@1 0.245 0.0489
cosine_precision@3 0.1733 0.0394
cosine_precision@5 0.129 0.0338
cosine_precision@10 0.0785 0.0264
cosine_recall@1 0.245 0.0489
cosine_recall@3 0.52 0.1183
cosine_recall@5 0.645 0.1691
cosine_recall@10 0.785 0.2641
cosine_ndcg@10 0.503 0.1402
cosine_mrr@10 0.414 0.1027
cosine_map@100 0.4253 0.1195

Training Details

Training Dataset

librispeech_asr

  • Dataset: librispeech_asr at 71cacbf
  • Size: 28,539 training samples
  • Columns: audio and text
  • Approximate statistics based on the first 1000 samples:
    audio text
    type audio string
    details
    • min: 1.95s
    • mean: 12.68s
    • max: 17.21s
    • sampling_rate: 48000 Hz
    • min: 10 tokens
    • mean: 64.9 tokens
    • max: 101 tokens
  • Samples:
    audio text
    CHAPTER SIXTEEN I MIGHT HAVE TOLD YOU OF THE BEGINNING OF THIS LIAISON IN A FEW LINES BUT I WANTED YOU TO SEE EVERY STEP BY WHICH WE CAME I TO AGREE TO WHATEVER MARGUERITE WISHED
    MARGUERITE TO BE UNABLE TO LIVE APART FROM ME IT WAS THE DAY AFTER THE EVENING WHEN SHE CAME TO SEE ME THAT I SENT HER MANON LESCAUT FROM THAT TIME SEEING THAT I COULD NOT CHANGE MY MISTRESS'S LIFE I CHANGED MY OWN
    I WISHED ABOVE ALL NOT TO LEAVE MYSELF TIME TO THINK OVER THE POSITION I HAD ACCEPTED FOR IN SPITE OF MYSELF IT WAS A GREAT DISTRESS TO ME THUS MY LIFE GENERALLY SO CALM
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false,
        "directions": [
            "query_to_doc",
            "doc_to_query"
        ],
        "partition_mode": "per_direction",
        "hardness_mode": null,
        "hardness_strength": 0.0
    }
    

Evaluation Dataset

librispeech_asr

  • Dataset: librispeech_asr at 71cacbf
  • Size: 200 evaluation samples
  • Columns: audio and text
  • Approximate statistics based on the first 200 samples:
    audio text
    type audio string
    details
    • min: 1.56s
    • mean: 6.41s
    • max: 24.03s
    • sampling_rate: 48000 Hz
    • min: 6 tokens
    • mean: 36.31 tokens
    • max: 129 tokens
  • Samples:
    audio text
    HE WAS IN A FEVERED STATE OF MIND OWING TO THE BLIGHT HIS WIFE'S ACTION THREATENED TO CAST UPON HIS ENTIRE FUTURE
    HE WOULD HAVE TO PAY HER THE MONEY WHICH SHE WOULD NOW REGULARLY DEMAND OR THERE WOULD BE TROUBLE IT DID NOT MATTER WHAT HE DID
    HURSTWOOD WALKED THE FLOOR MENTALLY ARRANGING THE CHIEF POINTS OF HIS SITUATION
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false,
        "directions": [
            "query_to_doc",
            "doc_to_query"
        ],
        "partition_mode": "per_direction",
        "hardness_mode": null,
        "hardness_strength": 0.0
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 4
  • num_train_epochs: 5
  • learning_rate: 2e-05
  • warmup_steps: 0.1
  • bf16: True
  • eval_strategy: steps
  • per_device_eval_batch_size: 4
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • per_device_train_batch_size: 4
  • num_train_epochs: 5
  • max_steps: -1
  • learning_rate: 2e-05
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: None
  • warmup_steps: 0.1
  • optim: adamw_torch_fused
  • optim_args: None
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • optim_target_modules: None
  • gradient_accumulation_steps: 1
  • average_tokens_across_devices: True
  • max_grad_norm: 1.0
  • label_smoothing_factor: 0.0
  • bf16: True
  • fp16: False
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • use_liger_kernel: False
  • liger_kernel_config: None
  • use_cache: False
  • neftune_noise_alpha: None
  • torch_empty_cache_steps: None
  • auto_find_batch_size: False
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • include_num_input_tokens_seen: no
  • log_level: passive
  • log_level_replica: warning
  • disable_tqdm: False
  • project: huggingface
  • trackio_space_id: trackio
  • eval_strategy: steps
  • per_device_eval_batch_size: 4
  • prediction_loss_only: True
  • eval_on_start: False
  • eval_do_concat_batches: True
  • eval_use_gather_object: False
  • eval_accumulation_steps: None
  • include_for_metrics: []
  • batch_eval_metrics: False
  • save_only_model: False
  • save_on_each_node: False
  • enable_jit_checkpoint: False
  • push_to_hub: False
  • hub_private_repo: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_always_push: False
  • hub_revision: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • restore_callback_states_from_checkpoint: False
  • full_determinism: False
  • seed: 42
  • data_seed: None
  • use_cpu: False
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • dataloader_prefetch_factor: None
  • remove_unused_columns: True
  • label_names: None
  • train_sampling_strategy: random
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • ddp_backend: None
  • ddp_timeout: 1800
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • deepspeed: None
  • debug: []
  • skip_memory_metrics: True
  • do_predict: False
  • resume_from_checkpoint: None
  • warmup_ratio: None
  • local_rank: -1
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss Validation Loss librispeech-eval_cosine_ndcg@10 librispeech-test_cosine_ndcg@10
-1 -1 - - 0.0279 0.0037
0.1001 714 1.4538 1.1503 0.0727 -
0.2001 1428 0.9953 0.8749 0.0841 -
0.3002 2142 0.9557 0.7760 0.1252 -
0.4003 2856 0.9621 2.4026 0.0353 -
0.5004 3570 0.9721 0.9326 0.0720 -
0.6004 4284 0.8931 0.8454 0.0934 -
0.7005 4998 0.8368 0.5494 0.1741 -
0.8006 5712 0.8001 0.4935 0.2170 -
0.9006 6426 0.7817 0.7168 0.1476 -
1.0007 7140 0.7235 0.6410 0.1809 -
1.1008 7854 0.6620 0.6527 0.1726 -
1.2008 8568 0.6492 0.4146 0.2116 -
1.3009 9282 0.6342 0.7536 0.1695 -
1.4010 9996 0.6438 0.6872 0.1873 -
1.5011 10710 0.6103 0.4385 0.2767 -
1.6011 11424 0.6052 0.8028 0.1805 -
1.7012 12138 0.5950 0.3628 0.2891 -
1.8013 12852 0.5672 0.6978 0.2120 -
1.9013 13566 0.5611 0.5946 0.1965 -
2.0014 14280 0.5546 0.2659 0.3589 -
2.1015 14994 0.5133 0.4273 0.2806 -
2.2015 15708 0.4588 0.4356 0.2929 -
2.3016 16422 0.4629 0.5123 0.2538 -
2.4017 17136 0.4429 0.3757 0.3092 -
2.5018 17850 0.5000 0.4237 0.3297 -
2.6018 18564 0.4328 0.5146 0.3291 -
2.7019 19278 0.4284 0.3348 0.3483 -
2.8020 19992 0.4598 0.3768 0.3865 -
2.9020 20706 0.4183 0.3908 0.2594 -
3.0021 21420 0.4180 0.3240 0.3470 -
3.1022 22134 0.3624 0.3487 0.4205 -
3.2022 22848 0.3627 0.3124 0.3650 -
3.3023 23562 0.3651 0.3025 0.3046 -
3.4024 24276 0.3644 0.3708 0.4050 -
3.5025 24990 0.3480 0.3458 0.3998 -
3.6025 25704 0.3542 0.2936 0.4141 -
3.7026 26418 0.2954 0.2692 0.3876 -
3.8027 27132 0.3336 0.2221 0.3915 -
3.9027 27846 0.3255 0.3140 0.4253 -
4.0028 28560 0.3093 0.2278 0.4607 -
4.1029 29274 0.2715 0.3176 0.4261 -
4.2029 29988 0.2812 0.2814 0.4590 -
4.3030 30702 0.2690 0.2390 0.4997 -
4.4031 31416 0.2697 0.2575 0.4720 -
4.5032 32130 0.2616 0.3054 0.4863 -
4.6032 32844 0.2437 0.2467 0.4852 -
4.7033 33558 0.2532 0.2505 0.5196 -
4.8034 34272 0.2640 0.2242 0.4926 -
4.9034 34986 0.2245 0.2345 0.4999 -
-1 -1 - - 0.5030 0.1402

Environmental Impact

Carbon emissions were measured using CodeCarbon.

  • Energy Consumed: 2.161 kWh
  • Carbon Emitted: 0.578 kg of CO2
  • Hours Used: 7.59 hours

Training Hardware

  • On Cloud: No
  • GPU Model: 1 x NVIDIA GeForce RTX 3090
  • CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
  • RAM Size: 31.78 GB

Framework Versions

  • Python: 3.11.6
  • Sentence Transformers: 5.4.0.dev0
  • Transformers: 5.3.0.dev0
  • PyTorch: 2.10.0+cu128
  • Accelerate: 1.13.0.dev0
  • Datasets: 4.3.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{günther2024jinaembeddings28192token,
      title={Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents},
      author={Michael Günther and Jackmin Ong and Isabelle Mohr and Alaeddine Abdessalem and Tanguy Abel and Mohammad Kalim Akram and Susana Guzman and Georgios Mastrapas and Saba Sturua and Bo Wang and Maximilian Werk and Nan Wang and Han Xiao},
      year={2024},
      eprint={2310.19923},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2310.19923},
}
Downloads last month
14
Safetensors
Model size
0.2B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs

Finetuned
(1)
this model

Dataset used to train tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs

Papers for tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs

Evaluation results