Installation
The project is based on Python and PyTorch. We usually run experiments with multi-GPU training.
Tested runtime:
- Python
3.12.3 - PyTorch
2.8.0+cu128
π₯ Clone the Git repo
$ https://github.com/yyliu01/AuralSAM2
$ cd AuralSAM2
π§© Install dependencies
- create conda env from yaml
$ conda env create -f docs/auralsam2.yml
- activate env
$ conda activate auralsam2
- install PyTorch (recommended: match tested runtime)
# CUDA 12.8 (tested):
$ pip install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
- install python packages (if needed)
$ pip install -r docs/requirements.txt
ποΈ Prepare dataset
AVSBench (avs.code)
- download and prepare AVSBench under repository root.
- ensure the dataset root path is:
AVSBench/AVSBench/avss_index/metadata.csv(and subset foldersv1s/,v1m/,v2/)
Ref-AVS (ref-avs.code)
- download and prepare the Ref-AVS (REFAVS) dataset under repository root.
- ensure the dataset root path is:
REFAVS/REFAVS/metadata.csv(splits:train,test_s,test_u,test_n)
Checkpoints (shared)
Prepare under repository root:
ckpts/sam_ckpts/sam2_hiera_large.ptckpts/vggish-10086976.pth
ποΈ Workspace structure
AuralSAM2/
βββ avs.code/
β βββ v1s.code/
β βββ v1m.code/
β βββ v2.code/
βββ ref-avs.code/
βββ scripts/
β βββ run_avs_train.sh
β βββ run_ref_train.sh
βββ AVSBench/
β βββ avss_index
β β βββ metadata.csv
β β βββ metadata_v1m_man.csv
β β βββ metadata_v2_man.csv
β βββ v1m
β β βββ 01uIJMwnUvA_0
β β βββ 0WxgIKuetYI_0
β β ... (419 more)
β βββ v1s
β β βββ --FenyW2i_4_5000_10000
β β βββ --ZHUMfueO0_5000_10000
β β ... (4927 more)
β βββ v2
β βββ --KCIeTv6PM_14000_24000
β βββ --iSerV5DbY_68000_78000
β ... (5995 more)
βββ REFAVS/
β βββ gt_mask
β β βββ --KCIeTv6PM_14000_24000
β β βββ --iSerV5DbY_68000_78000
β β ... (~4000 more)
β βββ media
β β βββ --KCIeTv6PM_14000_24000
β β βββ --iSerV5DbY_68000_78000
β β ... (~4300 more)
β βββ metadata.csv
βββ ckpts/
β βββ sam_ckpts/
β β βββ sam2_hiera_large.pt
β βββ vggish-10086976.pth
βββ docs/
βββ installation.md
βββ before_start.md
βββ requirements.txt
βββ auralsam2.yml
π Notes
- use
docs/before_start.mdfor training and inference commands. - if wandb is not needed, disable online logging in your config.