arxiv:2605.20266

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

Published on May 18

· Submitted by

Yang Xiao on May 21

Nanyang Technological University

Upvote

Authors:

Abstract

Large Audio Language Models exhibit significant trustworthiness challenges despite performance advances, requiring comprehensive frameworks addressing security vulnerabilities and defensive strategies.

AI-generated summary

The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.

View arXiv page View PDF GitHub 188 Add to collection

Community

aidawang

about 4 hours ago

This survey provides a timely and comprehensive overview of trustworthiness issues in Large Audio Language Models. It clearly identifies the unique risks introduced by continuous acoustic inputs, including cross-modal attacks, acoustic backdoors, privacy leakage, hallucination, and fairness concerns. The proposed roadmap toward defense-in-depth architectures and intrinsic representation engineering is valuable. A stronger empirical comparison of existing LALMs and their defense coverage would further improve the survey. Overall, this is a useful reference for researchers working on trustworthy audio-language intelligence.

AustinXiao

Paper submitter about 3 hours ago

Most conversations about Multimodal LLMs and universal auditory intelligence focus purely on model capabilities and performance scaling. In our new comprehensive survey, "A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook", we make a critical argument: for real-world deployment, empirical performance means nothing without intrinsic trustworthiness. The evidence is hard to ignore. Recent benchmarks reveal that the transition to unified end-to-end audio frameworks has dramatically expanded the attack surface.

We evaluate the state-of-the-art landscape across six analytical pillars: Hallucination, Robustness, Safety, Privacy, Fairness, and Authentication. The survey systematically uncovers a profound imbalance between a mature offensive ecosystem and fragmented, reactive defenses. To bridge this chasm, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering.

If you're building real-time full-duplex conversational agents, voice assistants, speech security systems, or anything that interacts with live acoustic data, we hope you'll find something vital here.

📄 Paper: https://arxiv.org/abs/2605.20266
💻 Project: https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs
😊 Hugging Face: https://huggingface.co/papers/2605.20266

Huge thanks to my incredible co-authors