mcshao
/

LAT-Audio-Base

Model card Files Files and versions

mcshao commited on 16 days ago

Commit

a0cd61a

·

verified ·

1 Parent(s): 4ca5bf1

Update README.md

Files changed (1) hide show

README.md +66 -3

README.md CHANGED Viewed

@@ -1,3 +1,66 @@
----
-license: cc-by-nc-4.0
----

+# LAT-Audio-Base
+## Overview
+LAT-Audio is a large audio-language model designed for **precise temporal awareness in long-form audio understanding**.
+Unlike existing models that degrade on long audio, LAT-Audio introduces a **progressive global-to-local reasoning paradigm**, enabling models to maintain temporal consistency over audio up to 30 minutes.
+The core idea is to first construct a **global timeline** that captures the temporal-semantic structure of the audio, and then perform **task-specific reasoning** grounded on this timeline.
+During reasoning, LAT-Audio iteratively incorporates audio evidence through a **Think-With-Audio Chain-of-Thought (TWA-CoT)** process, which significantly reduces:
+- temporal hallucination (invalid timestamps)
+- timestamp drift (progressive misalignment over time)
+## Model Description
+LAT-Audio formulates long-form audio understanding as a structured reasoning process:
+1. **Global Timeline Construction**
+   The model summarizes the audio into a coarse temporal structure.
+2. **Global-to-Local Reasoning**
+   Downstream tasks are performed conditioned on the global timeline.
+3. **Think-With-Audio Chain-of-Thought (TWA-CoT)**
+   The model iteratively retrieves local audio segments and refines its reasoning with grounded evidence.
+This design enables robust temporal reasoning under long-context settings, where conventional direct modeling approaches often fail.
+<p align="center">
+  <img src="./Figures/LAT-Audio.png" width="1200"/>
+  <em>Figure 1: Overall framework of LAT-Audio.</em>
+</p>
+## Model Variants
+We provide two model variants:
+| Model | Reasoning | Training Data | Description |
+|------|----------|--------------|-------------|
+| **LAT-Audio** | ✅ Yes | LAT-Chronicle | Tool-augmented multi-step reasoning model with global-to-local temporal inference |
+| **LAT-Audio-Base** | ❌ No | LAT-Chronicle + in-house | Direct modeling baseline fine-tuned from Qwen3-Omni with more in-house data, offering faster and simpler inference |
+## Quick Start
+Download through Hugging Face
+```
+pip install -U "huggingface_hub[cli]"
+huggingface-cli download mcshao/LAT-Audio --local-dir ./LAT-Audio
+huggingface-cli download mcshao/LAT-Audio-Base --local-dir ./LAT-Audio-Base
+```
+For detailed inference methods and examples, please refer to the official repository:
+👉 https://github.com/alanshaoTT/LAT-Audio-Repo
+## Contact
+For questions, feedback, or collaboration:
+📧 mcshao@mail.nwpu.edu.cn