File size: 2,894 Bytes
c6dfc69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# Installation

The project is based on Python and PyTorch. We usually run experiments with multi-GPU training.

Tested runtime:
- Python `3.12.3`
- PyTorch `2.8.0+cu128`

## πŸ“₯ Clone the Git repo

``` shell
$ https://github.com/yyliu01/AuralSAM2
$ cd AuralSAM2
```

## 🧩 Install dependencies

1) create conda env from yaml
```shell
$ conda env create -f docs/auralsam2.yml
```

2) activate env
```shell
$ conda activate auralsam2
```

3) install PyTorch (recommended: match tested runtime)
```shell
# CUDA 12.8 (tested):
$ pip install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
```

4) install python packages (if needed)
```shell
$ pip install -r docs/requirements.txt
```

## πŸ—‚οΈ Prepare dataset

### AVSBench (`avs.code`)

1) download and prepare AVSBench under repository root.
2) ensure the dataset root path is:
   - `AVSBench/`
   - `AVSBench/avss_index/metadata.csv` (and subset folders `v1s/`, `v1m/`, `v2/`)

### Ref-AVS (`ref-avs.code`)

1) download and prepare the Ref-AVS (REFAVS) dataset under repository root.
2) ensure the dataset root path is:
   - `REFAVS/`
   - `REFAVS/metadata.csv` (splits: `train`, `test_s`, `test_u`, `test_n`)


### Checkpoints (shared)

Prepare under repository root:

- `ckpts/sam_ckpts/sam2_hiera_large.pt`
- `ckpts/vggish-10086976.pth`

## πŸ—οΈ Workspace structure

```shell
AuralSAM2/
β”œβ”€β”€ avs.code/
β”‚   β”œβ”€β”€ v1s.code/
β”‚   β”œβ”€β”€ v1m.code/
β”‚   └── v2.code/
β”œβ”€β”€ ref-avs.code/
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ run_avs_train.sh
β”‚   └── run_ref_train.sh
β”œβ”€β”€ AVSBench/
β”‚   β”œβ”€β”€ avss_index
β”‚   β”‚   β”œβ”€β”€ metadata.csv
β”‚   β”‚   β”œβ”€β”€ metadata_v1m_man.csv
β”‚   β”‚   └── metadata_v2_man.csv
β”‚   β”œβ”€β”€ v1m
β”‚   β”‚   β”œβ”€β”€ 01uIJMwnUvA_0
β”‚   β”‚   β”œβ”€β”€ 0WxgIKuetYI_0
β”‚   β”‚   ... (419 more)
β”‚   β”œβ”€β”€ v1s
β”‚   β”‚   β”œβ”€β”€ --FenyW2i_4_5000_10000
β”‚   β”‚   β”œβ”€β”€ --ZHUMfueO0_5000_10000
β”‚   β”‚   ... (4927 more)
β”‚   └── v2
β”‚       β”œβ”€β”€ --KCIeTv6PM_14000_24000
β”‚       β”œβ”€β”€ --iSerV5DbY_68000_78000
β”‚       ... (5995 more)
β”œβ”€β”€ REFAVS/
β”‚   β”œβ”€β”€ gt_mask
β”‚   β”‚   β”œβ”€β”€ --KCIeTv6PM_14000_24000
β”‚   β”‚   β”œβ”€β”€ --iSerV5DbY_68000_78000
β”‚   β”‚   ... (~4000 more)
β”‚   β”œβ”€β”€ media
β”‚   β”‚   β”œβ”€β”€ --KCIeTv6PM_14000_24000
β”‚   β”‚   β”œβ”€β”€ --iSerV5DbY_68000_78000
β”‚   β”‚   ... (~4300 more)
β”‚   └── metadata.csv
β”œβ”€β”€ ckpts/
β”‚   β”œβ”€β”€ sam_ckpts/
β”‚   β”‚   └── sam2_hiera_large.pt
β”‚   └── vggish-10086976.pth
└── docs/
    β”œβ”€β”€ installation.md
    β”œβ”€β”€ before_start.md
    β”œβ”€β”€ requirements.txt
    └── auralsam2.yml
```

## πŸ“ Notes

- use `docs/before_start.md` for training and inference commands.
- if wandb is not needed, disable online logging in your config.