mcshao
/

LAT-Audio

Model card Files Files and versions

LAT-Audio / README.md

mcshao's picture

Update README.md

7251554 verified 11 days ago

|

history blame contribute delete

2.98 kB

	[![arXiv](https://img.shields.io/badge/arXiv-2604.22245-b31b1b.svg)](https://arxiv.org/abs/2604.22245)

	# LAT-Audio

	## Overview

	LAT-Audio is a large audio-language model designed for precise temporal awareness in long-form audio understanding.

	Unlike existing models that degrade on long audio, LAT-Audio introduces a progressive global-to-local reasoning paradigm, enabling models to maintain temporal consistency over audio up to 30 minutes.

	The core idea is to first construct a global timeline that captures the temporal-semantic structure of the audio, and then perform task-specific reasoning grounded on this timeline.
	During reasoning, LAT-Audio iteratively incorporates audio evidence through a Think-With-Audio Chain-of-Thought (TWA-CoT) process, which significantly reduces:

	- temporal hallucination (invalid timestamps)
	- timestamp drift (progressive misalignment over time)



	## Model Description

	LAT-Audio formulates long-form audio understanding as a structured reasoning process:

	1. Global Timeline Construction
	The model summarizes the audio into a coarse temporal structure.

	2. Global-to-Local Reasoning
	Downstream tasks are performed conditioned on the global timeline.

	3. Think-With-Audio Chain-of-Thought (TWA-CoT)
	The model iteratively retrieves local audio segments and refines its reasoning with grounded evidence.

	This design enables robust temporal reasoning under long-context settings, where conventional direct modeling approaches often fail.

	<p align="center">
	<img src="./Figures/LAT-Audio.png" width="1200"/>
	<em>Figure 1: Overall framework of LAT-Audio.</em>
	</p>

	## Model Variants

	We provide two model variants:
	\| Model \| Reasoning \| Training Data \| Description \|
	\|------\|----------\|--------------\|-------------\|
	\| LAT-Audio \| Yes \| LAT-Chronicle \| Tool-augmented multi-step reasoning model with global-to-local temporal inference \|
	\| LAT-Audio-Base \| No \| LAT-Chronicle + in-house \| Direct modeling baseline fine-tuned from Qwen3-Omni with more in-house data, offering faster and simpler inference \|



	## Quick Start


	Download through Hugging Face
	```
	pip install -U "huggingface_hub[cli]"
	huggingface-cli download mcshao/LAT-Audio --local-dir ./LAT-Audio
	huggingface-cli download mcshao/LAT-Audio-Base --local-dir ./LAT-Audio-Base
	```

	For detailed inference methods and examples, please refer to the official repository:
	👉 https://github.com/alanshaoTT/LAT-Audio-Repo

	## Citation

	If you find this work useful, please cite:

	```bibtex
	@article{shao2026lataudio,
	title={Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding},
	author={Shao, Mingchen and Su, Hang and Tian, Wenjie and Mu, Bingshen and Lin, Zhennan and Fan, Lichun and Luo, Zhenbo and Luan, Jian and Xie, Lei},
	journal={arXiv preprint arXiv:2604.22245},
	year={2026}
	}
	```

	## Contact

	For questions, feedback, or collaboration:

	📧 mcshao@mail.nwpu.edu.cn