[](https://arxiv.org/abs/2604.22245) # LAT-Audio ## Overview LAT-Audio is a large audio-language model designed for **precise temporal awareness in long-form audio understanding**. Unlike existing models that degrade on long audio, LAT-Audio introduces a **progressive global-to-local reasoning paradigm**, enabling models to maintain temporal consistency over audio up to 30 minutes. The core idea is to first construct a **global timeline** that captures the temporal-semantic structure of the audio, and then perform **task-specific reasoning** grounded on this timeline. During reasoning, LAT-Audio iteratively incorporates audio evidence through a **Think-With-Audio Chain-of-Thought (TWA-CoT)** process, which significantly reduces: - temporal hallucination (invalid timestamps) - timestamp drift (progressive misalignment over time) ## Model Description LAT-Audio formulates long-form audio understanding as a structured reasoning process: 1. **Global Timeline Construction** The model summarizes the audio into a coarse temporal structure. 2. **Global-to-Local Reasoning** Downstream tasks are performed conditioned on the global timeline. 3. **Think-With-Audio Chain-of-Thought (TWA-CoT)** The model iteratively retrieves local audio segments and refines its reasoning with grounded evidence. This design enables robust temporal reasoning under long-context settings, where conventional direct modeling approaches often fail.
Figure 1: Overall framework of LAT-Audio.