mcshao commited on
Commit
a0cd61a
·
verified ·
1 Parent(s): 4ca5bf1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -3
README.md CHANGED
@@ -1,3 +1,66 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LAT-Audio-Base
2
+
3
+ ## Overview
4
+
5
+ LAT-Audio is a large audio-language model designed for **precise temporal awareness in long-form audio understanding**.
6
+
7
+ Unlike existing models that degrade on long audio, LAT-Audio introduces a **progressive global-to-local reasoning paradigm**, enabling models to maintain temporal consistency over audio up to 30 minutes.
8
+
9
+ The core idea is to first construct a **global timeline** that captures the temporal-semantic structure of the audio, and then perform **task-specific reasoning** grounded on this timeline.
10
+ During reasoning, LAT-Audio iteratively incorporates audio evidence through a **Think-With-Audio Chain-of-Thought (TWA-CoT)** process, which significantly reduces:
11
+
12
+ - temporal hallucination (invalid timestamps)
13
+ - timestamp drift (progressive misalignment over time)
14
+
15
+
16
+
17
+ ## Model Description
18
+
19
+ LAT-Audio formulates long-form audio understanding as a structured reasoning process:
20
+
21
+ 1. **Global Timeline Construction**
22
+ The model summarizes the audio into a coarse temporal structure.
23
+
24
+ 2. **Global-to-Local Reasoning**
25
+ Downstream tasks are performed conditioned on the global timeline.
26
+
27
+ 3. **Think-With-Audio Chain-of-Thought (TWA-CoT)**
28
+ The model iteratively retrieves local audio segments and refines its reasoning with grounded evidence.
29
+
30
+ This design enables robust temporal reasoning under long-context settings, where conventional direct modeling approaches often fail.
31
+
32
+ <p align="center">
33
+ <img src="./Figures/LAT-Audio.png" width="1200"/>
34
+ <em>Figure 1: Overall framework of LAT-Audio.</em>
35
+ </p>
36
+
37
+ ## Model Variants
38
+
39
+ We provide two model variants:
40
+ | Model | Reasoning | Training Data | Description |
41
+ |------|----------|--------------|-------------|
42
+ | **LAT-Audio** | ✅ Yes | LAT-Chronicle | Tool-augmented multi-step reasoning model with global-to-local temporal inference |
43
+ | **LAT-Audio-Base** | ❌ No | LAT-Chronicle + in-house | Direct modeling baseline fine-tuned from Qwen3-Omni with more in-house data, offering faster and simpler inference |
44
+
45
+
46
+
47
+ ## Quick Start
48
+
49
+
50
+ Download through Hugging Face
51
+ ```
52
+ pip install -U "huggingface_hub[cli]"
53
+ huggingface-cli download mcshao/LAT-Audio --local-dir ./LAT-Audio
54
+ huggingface-cli download mcshao/LAT-Audio-Base --local-dir ./LAT-Audio-Base
55
+ ```
56
+
57
+ For detailed inference methods and examples, please refer to the official repository:
58
+ 👉 https://github.com/alanshaoTT/LAT-Audio-Repo
59
+
60
+
61
+
62
+ ## Contact
63
+
64
+ For questions, feedback, or collaboration:
65
+
66
+ 📧 mcshao@mail.nwpu.edu.cn