Papers
arxiv:2605.20266

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

Published on May 18
ยท Submitted by
Yang Xiao
on May 21
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Large Audio Language Models exhibit significant trustworthiness challenges despite performance advances, requiring comprehensive frameworks addressing security vulnerabilities and defensive strategies.

AI-generated summary

The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.

Community

This survey provides a timely and comprehensive overview of trustworthiness issues in Large Audio Language Models. It clearly identifies the unique risks introduced by continuous acoustic inputs, including cross-modal attacks, acoustic backdoors, privacy leakage, hallucination, and fairness concerns. The proposed roadmap toward defense-in-depth architectures and intrinsic representation engineering is valuable. A stronger empirical comparison of existing LALMs and their defense coverage would further improve the survey. Overall, this is a useful reference for researchers working on trustworthy audio-language intelligence.

Paper submitter

Most conversations about Multimodal LLMs and universal auditory intelligence focus purely on model capabilities and performance scaling. In our new comprehensive survey, "A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook", we make a critical argument: for real-world deployment, empirical performance means nothing without intrinsic trustworthiness. The evidence is hard to ignore. Recent benchmarks reveal that the transition to unified end-to-end audio frameworks has dramatically expanded the attack surface.

We evaluate the state-of-the-art landscape across six analytical pillars: Hallucination, Robustness, Safety, Privacy, Fairness, and Authentication. The survey systematically uncovers a profound imbalance between a mature offensive ecosystem and fragmented, reactive defenses. To bridge this chasm, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering.

If you're building real-time full-duplex conversational agents, voice assistants, speech security systems, or anything that interacts with live acoustic data, we hope you'll find something vital here.

๐Ÿ“„ Paper: https://arxiv.org/abs/2605.20266
๐Ÿ’ป Project: https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs
๐Ÿ˜Š Hugging Face: https://huggingface.co/papers/2605.20266

Huge thanks to my incredible co-authors

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.20266
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.20266 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.20266 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.20266 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.