update readme

Browse files

Files changed (9) hide show

.gitattributes +2 -0
README.md +15 -0
resources/2.wav +3 -0
resources/performance_classification_results.png +3 -0
resources/performance_main_results.png +3 -0
resources/performance_sed_results.png +3 -0
resources/radar_performance.png +3 -0
resources/sed_result_Y5J603SAj7QM_210.000_220.000.png +3 -0
resources/training.md +83 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 resources/1.wav filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 resources/1.wav filter=lfs diff=lfs merge=lfs -text
+*.wav filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -12,8 +12,23 @@ pipeline_tag: feature-extraction
 # FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining
 FineLAP is a strong contrastively pre-trained audio-language model that excels in both clip- and frame-level audio understanding tasks
 ```python
 import torch
 from transformers import AutoModel

 # FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining
+  [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2604.01155)
+  [![Hugging Face Model](https://img.shields.io/badge/Model-HuggingFace-yellow?logo=huggingface)](https://huggingface.co/AndreasXi/FineLAP)
+  [![Hugging Face Dataset](https://img.shields.io/badge/Dataset-HuggingFace-blue?logo=huggingface)](https://huggingface.co/datasets/AndreasXi/FineLAP-100k)
 FineLAP is a strong contrastively pre-trained audio-language model that excels in both clip- and frame-level audio understanding tasks
+<div align="center">
+  <img src="resources/radar_performance.png" alt="Radar performance" width="46%">
+  <img src="resources/sed_result_Y5J603SAj7QM_210.000_220.000.png" alt="SED result" width="50.5%">
+</div>
+<br>
+You can use the script below to extract frame- and clip-level features or calculate similarity:
 ```python
 import torch
 from transformers import AutoModel

resources/2.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:71ea85026e6487ee3839cd0b1185c4e95d165a8d21f2c7a7b288b13d625dc5a1
+size 562391

resources/performance_classification_results.png ADDED Viewed

Git LFS Details

SHA256: e7e454f38da058220eca5f6901ab5653e1679d3e9f7f08dc0e1a18da0da71bd3
Pointer size: 130 Bytes
Size of remote file: 73.9 kB

resources/performance_main_results.png ADDED Viewed

Git LFS Details

SHA256: aa931d6d44d5cc6f45775001cf1236fa0c963fd8e901a48fc57b083a49023353
Pointer size: 131 Bytes
Size of remote file: 145 kB

resources/performance_sed_results.png ADDED Viewed

Git LFS Details

SHA256: 99a015793b20445454ce41813f719ae28baaee541138438b793782580a25a1e2
Pointer size: 130 Bytes
Size of remote file: 70.7 kB

resources/radar_performance.png ADDED Viewed

Git LFS Details

SHA256: 636fc97c72e59152b34e96b2da7cbb96368478c659e77889d571003b90d3ed39
Pointer size: 131 Bytes
Size of remote file: 878 kB

resources/sed_result_Y5J603SAj7QM_210.000_220.000.png ADDED Viewed

Git LFS Details

SHA256: dac080f591f266b2a70085def5b2f230bbd49b63c8aa6f378829f704b1663eaf
Pointer size: 132 Bytes
Size of remote file: 2.14 MB

resources/training.md ADDED Viewed

	@@ -0,0 +1,83 @@

+# FineLAP Training & Fine-tuning
+Before training, make sure that all files from [here](https://huggingface.co/AndreasXi/FineLAP_Pytorch) have been downloaded to `./weights/`.
+## Environmental Setup
+```bash
+conda create -n finelap python=3.9
+git clone https://github.com/facebookresearch/fairseq.git
+pip install "pip<24.1" -U; cd fairseq; pip install -e ./
+pip install -r requirements_train.txt
+```
+## Data Setup
+To train FineLAP, we format the data in a JSONL structure as follows:
+```json
+{
+  "audio_id": "Ycq6bqC_AsO4.flac",
+  "audio_path": "path/to/audio.wav",
+  "caption": "Birds are chirping with background noise.",
+  "phrases": [
+    {
+      "phrase": "Background noise",
+      "segments": [
+        [0.498, 10.0]
+      ]
+    },
+    {
+      "phrase": "Bird vocalization, bird call, bird song",
+      "segments": [
+        [0.629, 4.114],
+        [4.313, 10.0]
+      ]
+    }
+  ]
+}
+```
+Each entry contains:
+- audio_id: Unique identifier of the audio sample.
+- audio_path: Path to the audio file.
+- caption: A clip-level description of the audio content.
+- phrases (optional): A list of sound events, where each includes:
+  - phrase: Textual phrase of the event
+  - segments: Time intervals (in seconds) indicating when the event occurs
+For data without frame-level annotations, the `phrases` field can be omitted. The dataset will automatically detect this and skip the frame-level loss for such samples.
+An example training metadata file with 10 samples is provided at `data/training_metadata_example.jsonl`.
+The current training pipeline uses the phrase bank `data/phrase_bank_new_with_FSDLabel_UrbanSED.jsonl`.
+Once the dataset metadata JSONL is ready, include it in the `train_data_args.metadata_files` list defined in `config/data_config/data_eat.yaml` or `config/data_config/data_htsat.yaml`.
+## Start Training
+Run
+```bash
+bash scripts/train.sh
+```
+to start training. This will use the config `config/finelap_eat_config.yaml`. The output will be saved in `exps/${exp_name}`.
+## Fine-tuning From a FineLAP Checkpoint
+The training code now supports loading an existing FineLAP checkpoint before training starts. This is useful when you want to finetune from a previously trained model such as `weights/finelap_fixed.pt`.
+In `config/finelap_eat_config.yaml`, set:
+```yaml
+model_args:
+  ckpt_path: './weights/finelap_fixed.pt'
+```
+If `ckpt_path` is an empty string:
+```yaml
+model_args:
+  ckpt_path: ''
+```
+then no FineLAP checkpoint will be loaded, and training will start from the encoder initialization defined by `audio_encoder_ckpt` and `text_encoder_ckpt`.
+This finetuning path loads model weights only. It does not restore the optimizer state or resume the previous epoch count.