AuralSAM2 / docs /installation.md
yyliu01's picture
Upload folder using huggingface_hub
c6dfc69 verified
# Installation
The project is based on Python and PyTorch. We usually run experiments with multi-GPU training.
Tested runtime:
- Python `3.12.3`
- PyTorch `2.8.0+cu128`
## πŸ“₯ Clone the Git repo
``` shell
$ https://github.com/yyliu01/AuralSAM2
$ cd AuralSAM2
```
## 🧩 Install dependencies
1) create conda env from yaml
```shell
$ conda env create -f docs/auralsam2.yml
```
2) activate env
```shell
$ conda activate auralsam2
```
3) install PyTorch (recommended: match tested runtime)
```shell
# CUDA 12.8 (tested):
$ pip install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
```
4) install python packages (if needed)
```shell
$ pip install -r docs/requirements.txt
```
## πŸ—‚οΈ Prepare dataset
### AVSBench (`avs.code`)
1) download and prepare AVSBench under repository root.
2) ensure the dataset root path is:
- `AVSBench/`
- `AVSBench/avss_index/metadata.csv` (and subset folders `v1s/`, `v1m/`, `v2/`)
### Ref-AVS (`ref-avs.code`)
1) download and prepare the Ref-AVS (REFAVS) dataset under repository root.
2) ensure the dataset root path is:
- `REFAVS/`
- `REFAVS/metadata.csv` (splits: `train`, `test_s`, `test_u`, `test_n`)
### Checkpoints (shared)
Prepare under repository root:
- `ckpts/sam_ckpts/sam2_hiera_large.pt`
- `ckpts/vggish-10086976.pth`
## πŸ—οΈ Workspace structure
```shell
AuralSAM2/
β”œβ”€β”€ avs.code/
β”‚ β”œβ”€β”€ v1s.code/
β”‚ β”œβ”€β”€ v1m.code/
β”‚ └── v2.code/
β”œβ”€β”€ ref-avs.code/
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ run_avs_train.sh
β”‚ └── run_ref_train.sh
β”œβ”€β”€ AVSBench/
β”‚ β”œβ”€β”€ avss_index
β”‚ β”‚ β”œβ”€β”€ metadata.csv
β”‚ β”‚ β”œβ”€β”€ metadata_v1m_man.csv
β”‚ β”‚ └── metadata_v2_man.csv
β”‚ β”œβ”€β”€ v1m
β”‚ β”‚ β”œβ”€β”€ 01uIJMwnUvA_0
β”‚ β”‚ β”œβ”€β”€ 0WxgIKuetYI_0
β”‚ β”‚ ... (419 more)
β”‚ β”œβ”€β”€ v1s
β”‚ β”‚ β”œβ”€β”€ --FenyW2i_4_5000_10000
β”‚ β”‚ β”œβ”€β”€ --ZHUMfueO0_5000_10000
β”‚ β”‚ ... (4927 more)
β”‚ └── v2
β”‚ β”œβ”€β”€ --KCIeTv6PM_14000_24000
β”‚ β”œβ”€β”€ --iSerV5DbY_68000_78000
β”‚ ... (5995 more)
β”œβ”€β”€ REFAVS/
β”‚ β”œβ”€β”€ gt_mask
β”‚ β”‚ β”œβ”€β”€ --KCIeTv6PM_14000_24000
β”‚ β”‚ β”œβ”€β”€ --iSerV5DbY_68000_78000
β”‚ β”‚ ... (~4000 more)
β”‚ β”œβ”€β”€ media
β”‚ β”‚ β”œβ”€β”€ --KCIeTv6PM_14000_24000
β”‚ β”‚ β”œβ”€β”€ --iSerV5DbY_68000_78000
β”‚ β”‚ ... (~4300 more)
β”‚ └── metadata.csv
β”œβ”€β”€ ckpts/
β”‚ β”œβ”€β”€ sam_ckpts/
β”‚ β”‚ └── sam2_hiera_large.pt
β”‚ └── vggish-10086976.pth
└── docs/
β”œβ”€β”€ installation.md
β”œβ”€β”€ before_start.md
β”œβ”€β”€ requirements.txt
└── auralsam2.yml
```
## πŸ“ Notes
- use `docs/before_start.md` for training and inference commands.
- if wandb is not needed, disable online logging in your config.