AndreasXi commited on
Commit
ef0907b
·
1 Parent(s): fd006ca

update readme

Browse files
.gitattributes CHANGED
@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  resources/1.wav filter=lfs diff=lfs merge=lfs -text
 
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  resources/1.wav filter=lfs diff=lfs merge=lfs -text
37
+ *.wav filter=lfs diff=lfs merge=lfs -text
38
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -12,8 +12,23 @@ pipeline_tag: feature-extraction
12
 
13
 
14
  # FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining
 
 
 
 
 
15
  FineLAP is a strong contrastively pre-trained audio-language model that excels in both clip- and frame-level audio understanding tasks
16
 
 
 
 
 
 
 
 
 
 
 
17
  ```python
18
  import torch
19
  from transformers import AutoModel
 
12
 
13
 
14
  # FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining
15
+
16
+ [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2604.01155)
17
+ [![Hugging Face Model](https://img.shields.io/badge/Model-HuggingFace-yellow?logo=huggingface)](https://huggingface.co/AndreasXi/FineLAP)
18
+ [![Hugging Face Dataset](https://img.shields.io/badge/Dataset-HuggingFace-blue?logo=huggingface)](https://huggingface.co/datasets/AndreasXi/FineLAP-100k)
19
+
20
  FineLAP is a strong contrastively pre-trained audio-language model that excels in both clip- and frame-level audio understanding tasks
21
 
22
+
23
+ <div align="center">
24
+ <img src="resources/radar_performance.png" alt="Radar performance" width="46%">
25
+ <img src="resources/sed_result_Y5J603SAj7QM_210.000_220.000.png" alt="SED result" width="50.5%">
26
+ </div>
27
+
28
+
29
+ <br>
30
+ You can use the script below to extract frame- and clip-level features or calculate similarity:
31
+
32
  ```python
33
  import torch
34
  from transformers import AutoModel
resources/2.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:71ea85026e6487ee3839cd0b1185c4e95d165a8d21f2c7a7b288b13d625dc5a1
3
+ size 562391
resources/performance_classification_results.png ADDED

Git LFS Details

  • SHA256: e7e454f38da058220eca5f6901ab5653e1679d3e9f7f08dc0e1a18da0da71bd3
  • Pointer size: 130 Bytes
  • Size of remote file: 73.9 kB
resources/performance_main_results.png ADDED

Git LFS Details

  • SHA256: aa931d6d44d5cc6f45775001cf1236fa0c963fd8e901a48fc57b083a49023353
  • Pointer size: 131 Bytes
  • Size of remote file: 145 kB
resources/performance_sed_results.png ADDED

Git LFS Details

  • SHA256: 99a015793b20445454ce41813f719ae28baaee541138438b793782580a25a1e2
  • Pointer size: 130 Bytes
  • Size of remote file: 70.7 kB
resources/radar_performance.png ADDED

Git LFS Details

  • SHA256: 636fc97c72e59152b34e96b2da7cbb96368478c659e77889d571003b90d3ed39
  • Pointer size: 131 Bytes
  • Size of remote file: 878 kB
resources/sed_result_Y5J603SAj7QM_210.000_220.000.png ADDED

Git LFS Details

  • SHA256: dac080f591f266b2a70085def5b2f230bbd49b63c8aa6f378829f704b1663eaf
  • Pointer size: 132 Bytes
  • Size of remote file: 2.14 MB
resources/training.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FineLAP Training & Fine-tuning
2
+
3
+ Before training, make sure that all files from [here](https://huggingface.co/AndreasXi/FineLAP_Pytorch) have been downloaded to `./weights/`.
4
+
5
+ ## Environmental Setup
6
+ ```bash
7
+ conda create -n finelap python=3.9
8
+
9
+ git clone https://github.com/facebookresearch/fairseq.git
10
+ pip install "pip<24.1" -U; cd fairseq; pip install -e ./
11
+
12
+ pip install -r requirements_train.txt
13
+ ```
14
+
15
+ ## Data Setup
16
+ To train FineLAP, we format the data in a JSONL structure as follows:
17
+
18
+ ```json
19
+ {
20
+ "audio_id": "Ycq6bqC_AsO4.flac",
21
+ "audio_path": "path/to/audio.wav",
22
+ "caption": "Birds are chirping with background noise.",
23
+ "phrases": [
24
+ {
25
+ "phrase": "Background noise",
26
+ "segments": [
27
+ [0.498, 10.0]
28
+ ]
29
+ },
30
+ {
31
+ "phrase": "Bird vocalization, bird call, bird song",
32
+ "segments": [
33
+ [0.629, 4.114],
34
+ [4.313, 10.0]
35
+ ]
36
+ }
37
+ ]
38
+ }
39
+ ```
40
+
41
+ Each entry contains:
42
+
43
+ - audio_id: Unique identifier of the audio sample.
44
+ - audio_path: Path to the audio file.
45
+ - caption: A clip-level description of the audio content.
46
+ - phrases (optional): A list of sound events, where each includes:
47
+ - phrase: Textual phrase of the event
48
+ - segments: Time intervals (in seconds) indicating when the event occurs
49
+
50
+ For data without frame-level annotations, the `phrases` field can be omitted. The dataset will automatically detect this and skip the frame-level loss for such samples.
51
+ An example training metadata file with 10 samples is provided at `data/training_metadata_example.jsonl`.
52
+
53
+ The current training pipeline uses the phrase bank `data/phrase_bank_new_with_FSDLabel_UrbanSED.jsonl`.
54
+
55
+ Once the dataset metadata JSONL is ready, include it in the `train_data_args.metadata_files` list defined in `config/data_config/data_eat.yaml` or `config/data_config/data_htsat.yaml`.
56
+
57
+ ## Start Training
58
+ Run
59
+ ```bash
60
+ bash scripts/train.sh
61
+ ```
62
+ to start training. This will use the config `config/finelap_eat_config.yaml`. The output will be saved in `exps/${exp_name}`.
63
+
64
+ ## Fine-tuning From a FineLAP Checkpoint
65
+ The training code now supports loading an existing FineLAP checkpoint before training starts. This is useful when you want to finetune from a previously trained model such as `weights/finelap_fixed.pt`.
66
+
67
+ In `config/finelap_eat_config.yaml`, set:
68
+
69
+ ```yaml
70
+ model_args:
71
+ ckpt_path: './weights/finelap_fixed.pt'
72
+ ```
73
+
74
+ If `ckpt_path` is an empty string:
75
+
76
+ ```yaml
77
+ model_args:
78
+ ckpt_path: ''
79
+ ```
80
+
81
+ then no FineLAP checkpoint will be loaded, and training will start from the encoder initialization defined by `audio_encoder_ckpt` and `text_encoder_ckpt`.
82
+
83
+ This finetuning path loads model weights only. It does not restore the optimizer state or resume the previous epoch count.