--- pipeline_tag: image-segmentation --- # AuralSAM2 This repository contains the weights for **AuralSAM2**, as presented in the paper [AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting](https://huggingface.co/papers/2506.01015). AuralSAM2 integrates audio into the Segment Anything Model 2 (SAM2) while preserving its promptable segmentation capability. It introduces the **AuralFuser** module, which fuses audio and visual features to generate sparse and dense prompts. These prompts propagate auditory cues across SAM2's feature pyramid, enabling audio-guided object segmentation. [**Paper**](https://huggingface.co/papers/2506.01015) | [**GitHub Code**](https://github.com/yyliu01/AuralSAM2) AuralSAM2 overview ## Installation Please install the dependencies and dataset based on the [***installation***](./docs/installation.md) document in the official repository. ## Getting started Please follow the [***instruction***](./docs/before_start.md) document to reproduce the results. ## Citation If you find this work helpful for your research, please consider citing: ```bibtex @article{liu2025auralsam2, title={AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting}, author={Liu, Yuyuan and Chen, Yuanhong and Wang, Chong and Han, Junlin and Wu, Junde and Peng, Can and Jingkun Chen and Yu Tian and Gustavo Carneiro}, journal={arXiv preprint arXiv:2506.01015}, year={2025} } ```