CLAP model trained on COCO Captions

This is a sentence-transformers model finetuned from laion/clap-htsat-unfused on the librispeech_asr dataset. It maps sentences & paragraphs to a 512-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: laion/clap-htsat-unfused
Maximum Sequence Length: 512 tokens
Output Dimensionality: 512 dimensions
Similarity Function: Cosine Similarity
Supported Modalities: Text, Audio
Training Dataset:
- librispeech_asr
Language: en
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'get_text_features', 'method_output_name': 'pooler_output'}, 'audio': {'method': 'get_audio_features', 'method_output_name': 'pooler_output'}}, 'module_output_name': 'sentence_embedding', 'message_format': 'auto', 'architecture': 'ClapModel'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs")
# Run inference
inputs = [
    'https://huggingface.co/tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs/resolve/main/assets/audio_0.wav',
    'https://huggingface.co/tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs/resolve/main/assets/audio_1.wav',
    'https://huggingface.co/tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs/resolve/main/assets/audio_2.wav',
]
embeddings = model.encode(inputs)
print(embeddings.shape)
# [3, 512]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.4362, 0.6843],
#         [0.4362, 1.0000, 0.2179],
#         [0.6843, 0.2179, 1.0000]])

Evaluation

Metrics

Information Retrieval

Datasets: librispeech-eval and librispeech-test
Evaluated with InformationRetrievalEvaluator

Metric	librispeech-eval	librispeech-test
cosine_accuracy@1	0.245	0.0489
cosine_accuracy@3	0.52	0.1183
cosine_accuracy@5	0.645	0.1691
cosine_accuracy@10	0.785	0.2641
cosine_precision@1	0.245	0.0489
cosine_precision@3	0.1733	0.0394
cosine_precision@5	0.129	0.0338
cosine_precision@10	0.0785	0.0264
cosine_recall@1	0.245	0.0489
cosine_recall@3	0.52	0.1183
cosine_recall@5	0.645	0.1691
cosine_recall@10	0.785	0.2641
cosine_ndcg@10	0.503	0.1402
cosine_mrr@10	0.414	0.1027
cosine_map@100	0.4253	0.1195

Training Details

Training Dataset

librispeech_asr

Dataset: librispeech_asr at 71cacbf
Size: 28,539 training samples
Columns: audio and text
Approximate statistics based on the first 1000 samples:
audio text
type audio string
details
min: 1.95s
mean: 12.68s
max: 17.21s
sampling_rate: 48000 Hz

min: 10 tokens
mean: 64.9 tokens
max: 101 tokens

	audio	text
type	audio	string
details	min: 1.95s mean: 12.68s max: 17.21s sampling_rate: 48000 Hz	min: 10 tokens mean: 64.9 tokens max: 101 tokens

Samples:

audio	text
	`CHAPTER SIXTEEN I MIGHT HAVE TOLD YOU OF THE BEGINNING OF THIS LIAISON IN A FEW LINES BUT I WANTED YOU TO SEE EVERY STEP BY WHICH WE CAME I TO AGREE TO WHATEVER MARGUERITE WISHED`
	`MARGUERITE TO BE UNABLE TO LIVE APART FROM ME IT WAS THE DAY AFTER THE EVENING WHEN SHE CAME TO SEE ME THAT I SENT HER MANON LESCAUT FROM THAT TIME SEEING THAT I COULD NOT CHANGE MY MISTRESS'S LIFE I CHANGED MY OWN`
	`I WISHED ABOVE ALL NOT TO LEAVE MYSELF TIME TO THINK OVER THE POSITION I HAD ACCEPTED FOR IN SPITE OF MYSELF IT WAS A GREAT DISTRESS TO ME THUS MY LIFE GENERALLY SO CALM`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "gather_across_devices": false,
    "directions": [
        "query_to_doc",
        "doc_to_query"
    ],
    "partition_mode": "per_direction",
    "hardness_mode": null,
    "hardness_strength": 0.0
}

Evaluation Dataset

librispeech_asr

Dataset: librispeech_asr at 71cacbf
Size: 200 evaluation samples
Columns: audio and text
Approximate statistics based on the first 200 samples:
audio text
type audio string
details
min: 1.56s
mean: 6.41s
max: 24.03s
sampling_rate: 48000 Hz

min: 6 tokens
mean: 36.31 tokens
max: 129 tokens

	audio	text
type	audio	string
details	min: 1.56s mean: 6.41s max: 24.03s sampling_rate: 48000 Hz	min: 6 tokens mean: 36.31 tokens max: 129 tokens

Samples:

audio	text
	`HE WAS IN A FEVERED STATE OF MIND OWING TO THE BLIGHT HIS WIFE'S ACTION THREATENED TO CAST UPON HIS ENTIRE FUTURE`
	`HE WOULD HAVE TO PAY HER THE MONEY WHICH SHE WOULD NOW REGULARLY DEMAND OR THERE WOULD BE TROUBLE IT DID NOT MATTER WHAT HE DID`
	`HURSTWOOD WALKED THE FLOOR MENTALLY ARRANGING THE CHIEF POINTS OF HIS SITUATION`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "gather_across_devices": false,
    "directions": [
        "query_to_doc",
        "doc_to_query"
    ],
    "partition_mode": "per_direction",
    "hardness_mode": null,
    "hardness_strength": 0.0
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 4
num_train_epochs: 5
learning_rate: 2e-05
warmup_steps: 0.1
bf16: True
eval_strategy: steps
per_device_eval_batch_size: 4
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

per_device_train_batch_size: 4
num_train_epochs: 5
max_steps: -1
learning_rate: 2e-05
lr_scheduler_type: linear
lr_scheduler_kwargs: None
warmup_steps: 0.1
optim: adamw_torch_fused
optim_args: None
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
optim_target_modules: None
gradient_accumulation_steps: 1
average_tokens_across_devices: True
max_grad_norm: 1.0
label_smoothing_factor: 0.0
bf16: True
fp16: False
bf16_full_eval: False
fp16_full_eval: False
tf32: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
use_liger_kernel: False
liger_kernel_config: None
use_cache: False
neftune_noise_alpha: None
torch_empty_cache_steps: None
auto_find_batch_size: False
log_on_each_node: True
logging_nan_inf_filter: True
include_num_input_tokens_seen: no
log_level: passive
log_level_replica: warning
disable_tqdm: False
project: huggingface
trackio_space_id: trackio
eval_strategy: steps
per_device_eval_batch_size: 4
prediction_loss_only: True
eval_on_start: False
eval_do_concat_batches: True
eval_use_gather_object: False
eval_accumulation_steps: None
include_for_metrics: []
batch_eval_metrics: False
save_only_model: False
save_on_each_node: False
enable_jit_checkpoint: False
push_to_hub: False
hub_private_repo: None
hub_model_id: None
hub_strategy: every_save
hub_always_push: False
hub_revision: None
load_best_model_at_end: False
ignore_data_skip: False
restore_callback_states_from_checkpoint: False
full_determinism: False
seed: 42
data_seed: None
use_cpu: False
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_pin_memory: True
dataloader_persistent_workers: False
dataloader_prefetch_factor: None
remove_unused_columns: True
label_names: None
train_sampling_strategy: random
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
ddp_backend: None
ddp_timeout: 1800
fsdp: []
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
deepspeed: None
debug: []
skip_memory_metrics: True
do_predict: False
resume_from_checkpoint: None
warmup_ratio: None
local_rank: -1
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss	Validation Loss	librispeech-eval_cosine_ndcg@10	librispeech-test_cosine_ndcg@10
-1	-1	-	-	0.0279	0.0037
0.1001	714	1.4538	1.1503	0.0727	-
0.2001	1428	0.9953	0.8749	0.0841	-
0.3002	2142	0.9557	0.7760	0.1252	-
0.4003	2856	0.9621	2.4026	0.0353	-
0.5004	3570	0.9721	0.9326	0.0720	-
0.6004	4284	0.8931	0.8454	0.0934	-
0.7005	4998	0.8368	0.5494	0.1741	-
0.8006	5712	0.8001	0.4935	0.2170	-
0.9006	6426	0.7817	0.7168	0.1476	-
1.0007	7140	0.7235	0.6410	0.1809	-
1.1008	7854	0.6620	0.6527	0.1726	-
1.2008	8568	0.6492	0.4146	0.2116	-
1.3009	9282	0.6342	0.7536	0.1695	-
1.4010	9996	0.6438	0.6872	0.1873	-
1.5011	10710	0.6103	0.4385	0.2767	-
1.6011	11424	0.6052	0.8028	0.1805	-
1.7012	12138	0.5950	0.3628	0.2891	-
1.8013	12852	0.5672	0.6978	0.2120	-
1.9013	13566	0.5611	0.5946	0.1965	-
2.0014	14280	0.5546	0.2659	0.3589	-
2.1015	14994	0.5133	0.4273	0.2806	-
2.2015	15708	0.4588	0.4356	0.2929	-
2.3016	16422	0.4629	0.5123	0.2538	-
2.4017	17136	0.4429	0.3757	0.3092	-
2.5018	17850	0.5000	0.4237	0.3297	-
2.6018	18564	0.4328	0.5146	0.3291	-
2.7019	19278	0.4284	0.3348	0.3483	-
2.8020	19992	0.4598	0.3768	0.3865	-
2.9020	20706	0.4183	0.3908	0.2594	-
3.0021	21420	0.4180	0.3240	0.3470	-
3.1022	22134	0.3624	0.3487	0.4205	-
3.2022	22848	0.3627	0.3124	0.3650	-
3.3023	23562	0.3651	0.3025	0.3046	-
3.4024	24276	0.3644	0.3708	0.4050	-
3.5025	24990	0.3480	0.3458	0.3998	-
3.6025	25704	0.3542	0.2936	0.4141	-
3.7026	26418	0.2954	0.2692	0.3876	-
3.8027	27132	0.3336	0.2221	0.3915	-
3.9027	27846	0.3255	0.3140	0.4253	-
4.0028	28560	0.3093	0.2278	0.4607	-
4.1029	29274	0.2715	0.3176	0.4261	-
4.2029	29988	0.2812	0.2814	0.4590	-
4.3030	30702	0.2690	0.2390	0.4997	-
4.4031	31416	0.2697	0.2575	0.4720	-
4.5032	32130	0.2616	0.3054	0.4863	-
4.6032	32844	0.2437	0.2467	0.4852	-
4.7033	33558	0.2532	0.2505	0.5196	-
4.8034	34272	0.2640	0.2242	0.4926	-
4.9034	34986	0.2245	0.2345	0.4999	-
-1	-1	-	-	0.5030	0.1402

Environmental Impact

Carbon emissions were measured using CodeCarbon.

Energy Consumed: 2.161 kWh
Carbon Emitted: 0.578 kg of CO2
Hours Used: 7.59 hours

Training Hardware

On Cloud: No
GPU Model: 1 x NVIDIA GeForce RTX 3090
CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
RAM Size: 31.78 GB

Framework Versions

Python: 3.11.6
Sentence Transformers: 5.4.0.dev0
Transformers: 5.3.0.dev0
PyTorch: 2.10.0+cu128
Accelerate: 1.13.0.dev0
Datasets: 4.3.0
Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{günther2024jinaembeddings28192token,
      title={Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents},
      author={Michael Günther and Jackmin Ong and Isabelle Mohr and Alaeddine Abdessalem and Tanguy Abel and Mohammad Kalim Akram and Susana Guzman and Georgios Mastrapas and Saba Sturua and Bo Wang and Maximilian Werk and Nan Wang and Han Xiao},
      year={2024},
      eprint={2310.19923},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2310.19923},
}

Downloads last month: 14

Safetensors

Model size

0.2B params

Tensor type

I64

F32

Model tree for tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs

Base model

laion/clap-htsat-unfused

Finetuned

(1)

this model

Dataset used to train tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs

Papers for tomaarsen/clap-htsat-unfused-librispeech-5-epochs-128bs

Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents

Paper • 2310.19923 • Published Oct 30, 2023 • 14

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 12

Evaluation results

Cosine Accuracy@1 on librispeech eval
self-reported

0.245
Cosine Accuracy@3 on librispeech eval
self-reported

0.520
Cosine Accuracy@5 on librispeech eval
self-reported

0.645
Cosine Accuracy@10 on librispeech eval
self-reported

0.785
Cosine Precision@1 on librispeech eval
self-reported

0.245
Cosine Precision@3 on librispeech eval
self-reported

0.173
Cosine Precision@5 on librispeech eval
self-reported

0.129
Cosine Precision@10 on librispeech eval
self-reported

0.079
Cosine Recall@1 on librispeech eval
self-reported

0.245
Cosine Recall@3 on librispeech eval
self-reported

0.520
Cosine Recall@5 on librispeech eval
self-reported

0.645
Cosine Recall@10 on librispeech eval
self-reported

0.785
Cosine Ndcg@10 on librispeech eval
self-reported

0.503
Cosine Mrr@10 on librispeech eval
self-reported

0.414
Cosine Map@100 on librispeech eval
self-reported

0.425
Cosine Accuracy@1 on librispeech test
self-reported

0.049
Cosine Accuracy@3 on librispeech test
self-reported

0.118
Cosine Accuracy@5 on librispeech test
self-reported

0.169
Cosine Accuracy@10 on librispeech test
self-reported

0.264
Cosine Precision@1 on librispeech test
self-reported

0.049