File size: 1,501 Bytes
9e5f1bf c6dfc69 9e5f1bf c6dfc69 9e5f1bf c6dfc69 9e5f1bf c6dfc69 9e5f1bf c6dfc69 9e5f1bf c6dfc69 9e5f1bf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | ---
pipeline_tag: image-segmentation
---
# AuralSAM2
This repository contains the weights for **AuralSAM2**, as presented in the paper [AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting](https://huggingface.co/papers/2506.01015).
AuralSAM2 integrates audio into the Segment Anything Model 2 (SAM2) while preserving its promptable segmentation capability. It introduces the **AuralFuser** module, which fuses audio and visual features to generate sparse and dense prompts. These prompts propagate auditory cues across SAM2's feature pyramid, enabling audio-guided object segmentation.
[**Paper**](https://huggingface.co/papers/2506.01015) | [**GitHub Code**](https://github.com/yyliu01/AuralSAM2)
<img src="./docs/overview.png" width="850" alt="AuralSAM2 overview" />
## Installation
Please install the dependencies and dataset based on the [***installation***](./docs/installation.md) document in the official repository.
## Getting started
Please follow the [***instruction***](./docs/before_start.md) document to reproduce the results.
## Citation
If you find this work helpful for your research, please consider citing:
```bibtex
@article{liu2025auralsam2,
title={AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting},
author={Liu, Yuyuan and Chen, Yuanhong and Wang, Chong and Han, Junlin and Wu, Junde and Peng, Can and Jingkun Chen and Yu Tian and Gustavo Carneiro},
journal={arXiv preprint arXiv:2506.01015},
year={2025}
}
``` |