FineLAP Training & Fine-tuning
Before training, make sure that all files from here have been downloaded to ./weights/.
Environmental Setup
conda create -n finelap python=3.9
git clone https://github.com/facebookresearch/fairseq.git
pip install "pip<24.1" -U; cd fairseq; pip install -e ./
pip install -r requirements_train.txt
Data Setup
To train FineLAP, we format the data in a JSONL structure as follows:
{
"audio_id": "Ycq6bqC_AsO4.flac",
"audio_path": "path/to/audio.wav",
"caption": "Birds are chirping with background noise.",
"phrases": [
{
"phrase": "Background noise",
"segments": [
[0.498, 10.0]
]
},
{
"phrase": "Bird vocalization, bird call, bird song",
"segments": [
[0.629, 4.114],
[4.313, 10.0]
]
}
]
}
Each entry contains:
- audio_id: Unique identifier of the audio sample.
- audio_path: Path to the audio file.
- caption: A clip-level description of the audio content.
- phrases (optional): A list of sound events, where each includes:
- phrase: Textual phrase of the event
- segments: Time intervals (in seconds) indicating when the event occurs
For data without frame-level annotations, the phrases field can be omitted. The dataset will automatically detect this and skip the frame-level loss for such samples.
An example training metadata file with 10 samples is provided at data/training_metadata_example.jsonl.
The current training pipeline uses the phrase bank data/phrase_bank_new_with_FSDLabel_UrbanSED.jsonl.
Once the dataset metadata JSONL is ready, include it in the train_data_args.metadata_files list defined in config/data_config/data_eat.yaml or config/data_config/data_htsat.yaml.
Start Training
Run
bash scripts/train.sh
to start training. This will use the config config/finelap_eat_config.yaml. The output will be saved in exps/${exp_name}.
Fine-tuning From a FineLAP Checkpoint
The training code now supports loading an existing FineLAP checkpoint before training starts. This is useful when you want to finetune from a previously trained model such as weights/finelap_fixed.pt.
In config/finelap_eat_config.yaml, set:
model_args:
ckpt_path: './weights/finelap_fixed.pt'
If ckpt_path is an empty string:
model_args:
ckpt_path: ''
then no FineLAP checkpoint will be loaded, and training will start from the encoder initialization defined by audio_encoder_ckpt and text_encoder_ckpt.
This finetuning path loads model weights only. It does not restore the optimizer state or resume the previous epoch count.