Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,66 @@
|
|
| 1 |
-
--
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# LAT-Audio-Base
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
LAT-Audio is a large audio-language model designed for **precise temporal awareness in long-form audio understanding**.
|
| 6 |
+
|
| 7 |
+
Unlike existing models that degrade on long audio, LAT-Audio introduces a **progressive global-to-local reasoning paradigm**, enabling models to maintain temporal consistency over audio up to 30 minutes.
|
| 8 |
+
|
| 9 |
+
The core idea is to first construct a **global timeline** that captures the temporal-semantic structure of the audio, and then perform **task-specific reasoning** grounded on this timeline.
|
| 10 |
+
During reasoning, LAT-Audio iteratively incorporates audio evidence through a **Think-With-Audio Chain-of-Thought (TWA-CoT)** process, which significantly reduces:
|
| 11 |
+
|
| 12 |
+
- temporal hallucination (invalid timestamps)
|
| 13 |
+
- timestamp drift (progressive misalignment over time)
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
## Model Description
|
| 18 |
+
|
| 19 |
+
LAT-Audio formulates long-form audio understanding as a structured reasoning process:
|
| 20 |
+
|
| 21 |
+
1. **Global Timeline Construction**
|
| 22 |
+
The model summarizes the audio into a coarse temporal structure.
|
| 23 |
+
|
| 24 |
+
2. **Global-to-Local Reasoning**
|
| 25 |
+
Downstream tasks are performed conditioned on the global timeline.
|
| 26 |
+
|
| 27 |
+
3. **Think-With-Audio Chain-of-Thought (TWA-CoT)**
|
| 28 |
+
The model iteratively retrieves local audio segments and refines its reasoning with grounded evidence.
|
| 29 |
+
|
| 30 |
+
This design enables robust temporal reasoning under long-context settings, where conventional direct modeling approaches often fail.
|
| 31 |
+
|
| 32 |
+
<p align="center">
|
| 33 |
+
<img src="./Figures/LAT-Audio.png" width="1200"/>
|
| 34 |
+
<em>Figure 1: Overall framework of LAT-Audio.</em>
|
| 35 |
+
</p>
|
| 36 |
+
|
| 37 |
+
## Model Variants
|
| 38 |
+
|
| 39 |
+
We provide two model variants:
|
| 40 |
+
| Model | Reasoning | Training Data | Description |
|
| 41 |
+
|------|----------|--------------|-------------|
|
| 42 |
+
| **LAT-Audio** | ✅ Yes | LAT-Chronicle | Tool-augmented multi-step reasoning model with global-to-local temporal inference |
|
| 43 |
+
| **LAT-Audio-Base** | ❌ No | LAT-Chronicle + in-house | Direct modeling baseline fine-tuned from Qwen3-Omni with more in-house data, offering faster and simpler inference |
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
## Quick Start
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
Download through Hugging Face
|
| 51 |
+
```
|
| 52 |
+
pip install -U "huggingface_hub[cli]"
|
| 53 |
+
huggingface-cli download mcshao/LAT-Audio --local-dir ./LAT-Audio
|
| 54 |
+
huggingface-cli download mcshao/LAT-Audio-Base --local-dir ./LAT-Audio-Base
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
For detailed inference methods and examples, please refer to the official repository:
|
| 58 |
+
👉 https://github.com/alanshaoTT/LAT-Audio-Repo
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
## Contact
|
| 63 |
+
|
| 64 |
+
For questions, feedback, or collaboration:
|
| 65 |
+
|
| 66 |
+
📧 mcshao@mail.nwpu.edu.cn
|