--- license: mit library_name: transformers tags: - audio grounding - audio-text retrieval - sound-event-detection - multimodal - clap pipeline_tag: feature-extraction --- # FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2604.01155) [![Hugging Face Model](https://img.shields.io/badge/Model-HuggingFace-yellow?logo=huggingface)](https://huggingface.co/AndreasXi/FineLAP) [![Hugging Face Dataset](https://img.shields.io/badge/Dataset-HuggingFace-blue?logo=huggingface)](https://huggingface.co/datasets/AndreasXi/FineLAP-100k) FineLAP is a strong contrastively pre-trained audio-language model that excels in both clip- and frame-level audio understanding tasks You can use the script below to extract frame- and clip-level features or calculate similarity: ```python import torch from transformers import AutoModel audio_path = ['resources/1.wav', 'resources/2.wav'] # (B,) caption = ["A woman speaks, dishes clanking, food frying, and music plays", 'A power tool is heard with male speech.'] # (B,) phrases = ['Speech', 'Dog', 'Cat', 'Frying', 'Dishes', 'Music', 'Vacuum', 'Type', 'Power tool'] # (N,) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = AutoModel.from_pretrained("AndreasXi/FineLAP", trust_remote_code=True).to(device) model.eval() with torch.no_grad(): global_text_embeds = model.get_global_text_embeds(caption) # (B, d) print(global_text_embeds.shape) global_audio_embeds = model.get_global_audio_embeds(audio_path) # (B, d) print(global_audio_embeds.shape) dense_audio_embeds = model.get_dense_audio_embeds(audio_path) # (B, T, d) print(dense_audio_embeds.shape) clip_scores = model.get_clip_level_score(audio_path, caption) # (B, B) print(clip_scores.shape) frame_scores = model.get_frame_level_score(audio_path, phrases) # (B, N, T) print(frame_scores.shape) ## (Optional) Plot frame-level similarity, only supprt single audio file model.plot_frame_level_score(audio_path[1], phrases, output_path="output/output_plot.png") ```