| pipeline_tag: image-segmentation | |
| # AuralSAM2 | |
| This repository contains the weights for **AuralSAM2**, as presented in the paper [AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting](https://huggingface.co/papers/2506.01015). | |
| AuralSAM2 integrates audio into the Segment Anything Model 2 (SAM2) while preserving its promptable segmentation capability. It introduces the **AuralFuser** module, which fuses audio and visual features to generate sparse and dense prompts. These prompts propagate auditory cues across SAM2's feature pyramid, enabling audio-guided object segmentation. | |
| [**Paper**](https://huggingface.co/papers/2506.01015) | [**GitHub Code**](https://github.com/yyliu01/AuralSAM2) | |
| <img src="./docs/overview.png" width="850" alt="AuralSAM2 overview" /> | |
| ## Installation | |
| Please install the dependencies and dataset based on the [***installation***](./docs/installation.md) document in the official repository. | |
| ## Getting started | |
| Please follow the [***instruction***](./docs/before_start.md) document to reproduce the results. | |
| ## Citation | |
| If you find this work helpful for your research, please consider citing: | |
| ```bibtex | |
| @article{liu2025auralsam2, | |
| title={AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting}, | |
| author={Liu, Yuyuan and Chen, Yuanhong and Wang, Chong and Han, Junlin and Wu, Junde and Peng, Can and Jingkun Chen and Yu Tian and Gustavo Carneiro}, | |
| journal={arXiv preprint arXiv:2506.01015}, | |
| year={2025} | |
| } | |
| ``` |