---
pipeline_tag: image-segmentation
---

# AuralSAM2

This repository contains the weights for **AuralSAM2**, as presented in the paper [AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting](https://huggingface.co/papers/2506.01015).

AuralSAM2 integrates audio into the Segment Anything Model 2 (SAM2) while preserving its promptable segmentation capability. It introduces the **AuralFuser** module, which fuses audio and visual features to generate sparse and dense prompts. These prompts propagate auditory cues across SAM2's feature pyramid, enabling audio-guided object segmentation.

[**Paper**](https://huggingface.co/papers/2506.01015) | [**GitHub Code**](https://github.com/yyliu01/AuralSAM2)

<img src="./docs/overview.png" width="850" alt="AuralSAM2 overview" />

## Installation
Please install the dependencies and dataset based on the [***installation***](./docs/installation.md) document in the official repository.

## Getting started
Please follow the [***instruction***](./docs/before_start.md) document to reproduce the results.

## Citation
If you find this work helpful for your research, please consider citing:

```bibtex
@article{liu2025auralsam2,
  title={AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting},
  author={Liu, Yuyuan and Chen, Yuanhong and Wang, Chong and Han, Junlin and Wu, Junde and Peng, Can and Jingkun Chen and Yu Tian and Gustavo Carneiro},
  journal={arXiv preprint arXiv:2506.01015},
  year={2025}
}
```