File size: 2,980 Bytes
ab8f492 a0cd61a 36d1f9f a0cd61a ab8f492 a0cd61a ab8f492 a0cd61a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | [](https://arxiv.org/abs/2604.22245)
# LAT-Audio-Base
## Overview
LAT-Audio is a large audio-language model designed for **precise temporal awareness in long-form audio understanding**.
Unlike existing models that degrade on long audio, LAT-Audio introduces a **progressive global-to-local reasoning paradigm**, enabling models to maintain temporal consistency over audio up to 30 minutes.
The core idea is to first construct a **global timeline** that captures the temporal-semantic structure of the audio, and then perform **task-specific reasoning** grounded on this timeline.
During reasoning, LAT-Audio iteratively incorporates audio evidence through a **Think-With-Audio Chain-of-Thought (TWA-CoT)** process, which significantly reduces:
- temporal hallucination (invalid timestamps)
- timestamp drift (progressive misalignment over time)
## Model Description
LAT-Audio formulates long-form audio understanding as a structured reasoning process:
1. **Global Timeline Construction**
The model summarizes the audio into a coarse temporal structure.
2. **Global-to-Local Reasoning**
Downstream tasks are performed conditioned on the global timeline.
3. **Think-With-Audio Chain-of-Thought (TWA-CoT)**
The model iteratively retrieves local audio segments and refines its reasoning with grounded evidence.
This design enables robust temporal reasoning under long-context settings, where conventional direct modeling approaches often fail.
<p align="center">
<img src="./Figures/LAT-Audio.png" width="1200"/>
<em>Figure 1: Overall framework of LAT-Audio.</em>
</p>
## Model Variants
We provide two model variants:
| Model | Reasoning | Training Data | Description |
|------|----------|--------------|-------------|
| **LAT-Audio** | Yes | LAT-Chronicle | Tool-augmented multi-step reasoning model with global-to-local temporal inference |
| **LAT-Audio-Base** | No | LAT-Chronicle + in-house | Direct modeling baseline fine-tuned from Qwen3-Omni with more in-house data, offering faster and simpler inference |
## Quick Start
Download through Hugging Face
```
pip install -U "huggingface_hub[cli]"
huggingface-cli download mcshao/LAT-Audio --local-dir ./LAT-Audio
huggingface-cli download mcshao/LAT-Audio-Base --local-dir ./LAT-Audio-Base
```
For detailed inference methods and examples, please refer to the official repository:
👉 https://github.com/alanshaoTT/LAT-Audio-Repo
## Citation
If you find this work useful, please cite:
```bibtex
@article{shao2026lataudio,
title={Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding},
author={Shao, Mingchen and Su, Hang and Tian, Wenjie and Mu, Bingshen and Lin, Zhennan and Fan, Lichun and Luo, Zhenbo and Luan, Jian and Xie, Lei},
journal={arXiv preprint arXiv:2604.22245},
year={2026}
}
```
## Contact
For questions, feedback, or collaboration:
📧 mcshao@mail.nwpu.edu.cn |