Papers
arxiv:2506.18843

USAD: Universal Speech and Audio Representation via Distillation

Published on Jun 23, 2025
· Submitted by
Heng-Jui Chang
on Jun 25, 2025
Authors:
,

Abstract

USAD integrates diverse audio types using efficient layer-to-layer distillation from domain-specific models, achieving competitive performance across various benchmarks with a single encoder.

AI-generated summary

Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.

Community

Paper author Paper submitter

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2506.18843
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.18843 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.18843 in a Space README.md to link it from this page.

Collections including this paper 3