| [](https://arxiv.org/abs/2604.22245) |
|
|
| # LAT-Audio |
|
|
| ## Overview |
|
|
| LAT-Audio is a large audio-language model designed for **precise temporal awareness in long-form audio understanding**. |
|
|
| Unlike existing models that degrade on long audio, LAT-Audio introduces a **progressive global-to-local reasoning paradigm**, enabling models to maintain temporal consistency over audio up to 30 minutes. |
|
|
| The core idea is to first construct a **global timeline** that captures the temporal-semantic structure of the audio, and then perform **task-specific reasoning** grounded on this timeline. |
| During reasoning, LAT-Audio iteratively incorporates audio evidence through a **Think-With-Audio Chain-of-Thought (TWA-CoT)** process, which significantly reduces: |
|
|
| - temporal hallucination (invalid timestamps) |
| - timestamp drift (progressive misalignment over time) |
|
|
|
|
|
|
| ## Model Description |
|
|
| LAT-Audio formulates long-form audio understanding as a structured reasoning process: |
|
|
| 1. **Global Timeline Construction** |
| The model summarizes the audio into a coarse temporal structure. |
|
|
| 2. **Global-to-Local Reasoning** |
| Downstream tasks are performed conditioned on the global timeline. |
|
|
| 3. **Think-With-Audio Chain-of-Thought (TWA-CoT)** |
| The model iteratively retrieves local audio segments and refines its reasoning with grounded evidence. |
|
|
| This design enables robust temporal reasoning under long-context settings, where conventional direct modeling approaches often fail. |
|
|
| <p align="center"> |
| <img src="./Figures/LAT-Audio.png" width="1200"/> |
| <em>Figure 1: Overall framework of LAT-Audio.</em> |
| </p> |
|
|
| ## Model Variants |
|
|
| We provide two model variants: |
| | Model | Reasoning | Training Data | Description | |
| |------|----------|--------------|-------------| |
| | **LAT-Audio** | Yes | LAT-Chronicle | Tool-augmented multi-step reasoning model with global-to-local temporal inference | |
| | **LAT-Audio-Base** | No | LAT-Chronicle + in-house | Direct modeling baseline fine-tuned from Qwen3-Omni with more in-house data, offering faster and simpler inference | |
|
|
|
|
|
|
| ## Quick Start |
|
|
|
|
| Download through Hugging Face |
| ``` |
| pip install -U "huggingface_hub[cli]" |
| huggingface-cli download mcshao/LAT-Audio --local-dir ./LAT-Audio |
| huggingface-cli download mcshao/LAT-Audio-Base --local-dir ./LAT-Audio-Base |
| ``` |
|
|
| For detailed inference methods and examples, please refer to the official repository: |
| 👉 https://github.com/alanshaoTT/LAT-Audio-Repo |
|
|
| ## Citation |
|
|
| If you find this work useful, please cite: |
|
|
| ```bibtex |
| @article{shao2026lataudio, |
| title={Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding}, |
| author={Shao, Mingchen and Su, Hang and Tian, Wenjie and Mu, Bingshen and Lin, Zhennan and Fan, Lichun and Luo, Zhenbo and Luan, Jian and Xie, Lei}, |
| journal={arXiv preprint arXiv:2604.22245}, |
| year={2026} |
| } |
| ``` |
|
|
| ## Contact |
|
|
| For questions, feedback, or collaboration: |
|
|
| 📧 mcshao@mail.nwpu.edu.cn |