FineLAP Training & Fine-tuning

Before training, make sure that all files from here have been downloaded to ./weights/.

Environmental Setup

conda create -n finelap python=3.9 

git clone https://github.com/facebookresearch/fairseq.git
pip install "pip<24.1" -U; cd fairseq; pip install -e ./

pip install -r requirements_train.txt

Data Setup

To train FineLAP, we format the data in a JSONL structure as follows:

{
  "audio_id": "Ycq6bqC_AsO4.flac",
  "audio_path": "path/to/audio.wav",
  "caption": "Birds are chirping with background noise.",
  "phrases": [
    {
      "phrase": "Background noise",
      "segments": [
        [0.498, 10.0]
      ]
    },
    {
      "phrase": "Bird vocalization, bird call, bird song",
      "segments": [
        [0.629, 4.114],
        [4.313, 10.0]
      ]
    }
  ]
}

Each entry contains:

audio_id: Unique identifier of the audio sample.
audio_path: Path to the audio file.
caption: A clip-level description of the audio content.
phrases (optional): A list of sound events, where each includes:
- phrase: Textual phrase of the event
- segments: Time intervals (in seconds) indicating when the event occurs

For data without frame-level annotations, the phrases field can be omitted. The dataset will automatically detect this and skip the frame-level loss for such samples. An example training metadata file with 10 samples is provided at data/training_metadata_example.jsonl.

The current training pipeline uses the phrase bank data/phrase_bank_new_with_FSDLabel_UrbanSED.jsonl.

Once the dataset metadata JSONL is ready, include it in the train_data_args.metadata_files list defined in config/data_config/data_eat.yaml or config/data_config/data_htsat.yaml.

Start Training

Run

bash scripts/train.sh

to start training. This will use the config config/finelap_eat_config.yaml. The output will be saved in exps/${exp_name}.

Fine-tuning From a FineLAP Checkpoint

The training code now supports loading an existing FineLAP checkpoint before training starts. This is useful when you want to finetune from a previously trained model such as weights/finelap_fixed.pt.

In config/finelap_eat_config.yaml, set:

model_args:
  ckpt_path: './weights/finelap_fixed.pt'

If ckpt_path is an empty string:

model_args:
  ckpt_path: ''

then no FineLAP checkpoint will be loaded, and training will start from the encoder initialization defined by audio_encoder_ckpt and text_encoder_ckpt.

This finetuning path loads model weights only. It does not restore the optimizer state or resume the previous epoch count.