metadata
pipeline_tag: image-segmentation
AuralSAM2
This repository contains the weights for AuralSAM2, as presented in the paper AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting.
AuralSAM2 integrates audio into the Segment Anything Model 2 (SAM2) while preserving its promptable segmentation capability. It introduces the AuralFuser module, which fuses audio and visual features to generate sparse and dense prompts. These prompts propagate auditory cues across SAM2's feature pyramid, enabling audio-guided object segmentation.
Installation
Please install the dependencies and dataset based on the installation document in the official repository.
Getting started
Please follow the instruction document to reproduce the results.
Citation
If you find this work helpful for your research, please consider citing:
@article{liu2025auralsam2,
title={AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting},
author={Liu, Yuyuan and Chen, Yuanhong and Wang, Chong and Han, Junlin and Wu, Junde and Peng, Can and Jingkun Chen and Yu Tian and Gustavo Carneiro},
journal={arXiv preprint arXiv:2506.01015},
year={2025}
}