File size: 2,976 Bytes
7251554
cb1d6af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb040a3
cb1d6af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d95a907
81dc6c4
d95a907
 
cb1d6af
 
 
 
 
 
e71a111
 
cb1d6af
bb040a3
cb1d6af
 
 
54c31b2
6d112c4
54c31b2
cb1d6af
 
 
54c31b2
bb040a3
cb1d6af
 
 
7251554
bb040a3
7251554
 
 
 
 
 
 
 
 
 
cb1d6af
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
[![arXiv](https://img.shields.io/badge/arXiv-2604.22245-b31b1b.svg)](https://arxiv.org/abs/2604.22245)

# LAT-Audio

## Overview

LAT-Audio is a large audio-language model designed for **precise temporal awareness in long-form audio understanding**.

Unlike existing models that degrade on long audio, LAT-Audio introduces a **progressive global-to-local reasoning paradigm**, enabling models to maintain temporal consistency over audio up to 30 minutes.

The core idea is to first construct a **global timeline** that captures the temporal-semantic structure of the audio, and then perform **task-specific reasoning** grounded on this timeline.  
During reasoning, LAT-Audio iteratively incorporates audio evidence through a **Think-With-Audio Chain-of-Thought (TWA-CoT)** process, which significantly reduces:

- temporal hallucination (invalid timestamps)
- timestamp drift (progressive misalignment over time)



## Model Description

LAT-Audio formulates long-form audio understanding as a structured reasoning process:

1. **Global Timeline Construction**  
   The model summarizes the audio into a coarse temporal structure.

2. **Global-to-Local Reasoning**  
   Downstream tasks are performed conditioned on the global timeline.

3. **Think-With-Audio Chain-of-Thought (TWA-CoT)**  
   The model iteratively retrieves local audio segments and refines its reasoning with grounded evidence.

This design enables robust temporal reasoning under long-context settings, where conventional direct modeling approaches often fail.

<p align="center">
  <img src="./Figures/LAT-Audio.png" width="1200"/>
  <em>Figure 1: Overall framework of LAT-Audio.</em>
</p>

## Model Variants

We provide two model variants:
| Model | Reasoning | Training Data | Description |
|------|----------|--------------|-------------|
| **LAT-Audio** |  Yes | LAT-Chronicle | Tool-augmented multi-step reasoning model with global-to-local temporal inference |
| **LAT-Audio-Base** |  No | LAT-Chronicle + in-house | Direct modeling baseline fine-tuned from Qwen3-Omni with more in-house data, offering faster and simpler inference |



## Quick Start


Download through Hugging Face
```
pip install -U "huggingface_hub[cli]"
huggingface-cli download mcshao/LAT-Audio --local-dir ./LAT-Audio
huggingface-cli download mcshao/LAT-Audio-Base --local-dir ./LAT-Audio-Base
```

For detailed inference methods and examples, please refer to the official repository:
👉 https://github.com/alanshaoTT/LAT-Audio-Repo

## Citation

If you find this work useful, please cite:

```bibtex
@article{shao2026lataudio,
  title={Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding},
  author={Shao, Mingchen and Su, Hang and Tian, Wenjie and Mu, Bingshen and Lin, Zhennan and Fan, Lichun and Luo, Zhenbo and Luan, Jian and Xie, Lei},
  journal={arXiv preprint arXiv:2604.22245},
  year={2026}
}
```

## Contact

For questions, feedback, or collaboration:

📧 mcshao@mail.nwpu.edu.cn